Part Two: The Data Layer
Data is Everywhere
There are many parts to an application, especially in the enterprise. The most valuable of all those parts is data. As we enter the age of Machine Learning and AI, this data has only become more valuable. Data is so valuable that companies constantly throw money and free products at customers to get it.
The way that we store data is changing as the volumes of data increases. Applications used to store all of their data in a single database, but in enterprise companies that is often not feasible. Google realized this long before many of us when they tried to create indexes of all the websites that existed on the internet. The map-reduce method that they devised to solve this problem has now become the backbone of a lot of large-scale data sources that are often referred to as data lakes.
What are Data Lakes?
Traditional RDBMS databases are relational databases that store data in tables that have constraints between the tables to ensure the data is accurate. These constraints are easy to enforce when the data is small, but when you scale up to the sizes that Hadoop can deal with, ensuring uniqueness and correct relations would simply bog things down and it simply isn’t worth the time to try to enforce it with the type of data you are storing.
When you are selling products to millions of customers every single day, you need your databases to be able to record those sales fast. Traditional databases use indexes, which help speed up certain types of queries by hashing part of the data. This hashing, however, does add computation time and only scales to a point (many rows can have the same hash so the database will still have to scan for the data in question).
With data lake volumes, data is coming in so fast you really don’t have time to continually update indexes. You need to have simple flows that won’t get bogged down and drop data. In order to do this, data lakes split up the data into lots of smaller chunks that are easier to manage. Inserting lots of data is easy because you just dispatch the data to one of the available nodes who can write it. Reading the data just requires sending queries to the nodes and batching together the results (there is a lot more technical optimization to it, but this is the general idea).
Another optimization that data lakes give you is that you only need to send the data that is required across the wire. Data lakes have the ability to distribute code that you write to the nodes which can then process that data directly on the nodes, rather than sending all the data back for you to process in your application. This node-level processing allows high-throughput processing of almost any amount of data, which is why Google can give you instant results to a search request (Google optimizes this by returning the fastest results first and continuing to process data as you are looking at the first page — you may have noticed that the number of results changes when you go from page to page).
Moving to Data Lakes
Moving to this data lake architecture is a technical challenge, but as I talked about in the last article in this series, container technologies can help make this a lot smoother. Frameworks like Hadoop require software to be installed on a bunch of servers independently which can take a lot of time. Using container images, you can take advantage of software that is already setup when you deploy the containers.
Even after you have installed the software, you often still have to configure it. If you use container images, someone has typically already done the work of setting up the servers for you in an easy configuration. Apache has documentation on setting up a Hadoop cluster using Docker containers, and it can generally be done in under an hour. Orchestration frameworks like Docker Compose allow developers to write manifests so that multi-image deployments can be done with a single command.
Do I need to Move to Data Lakes?
Not necessarily, but you should definitely be learning about them as your data needs grow so you are ready if you decide you do need them. The way that data is stored in data lakes is quite different from traditional databases, but there are frameworks that make retrieving that data more familiar to developers using traditional RDBMS databases. For example, Hadoop has a framework called Hive that lets you query these large data sources with SQL queries.
You also don’t have to store everything in the data lake. Your customer and static data can still live in a traditional database close to the app if you want, as this data is generally fairly small. Your sales and marketing data, however, is an excellent candidate for moving to the data lake. You often keep this data around for years even though you may not use it frequently. If you do use it, it is more to gain perspective on how to market products in the future.
Data lakes are also a great way to store data that is common across a large organization. A lot of companies have their own databases that require data pushes from other teams on a regular basis. Teams having separate copies of the same data leads to large problems when that data goes out of sync. It also forces companies to rely on batch processing which slows down how quickly they can respond to issues. Data lakes can help with these issues by centralizing the data and allowing everyone to utilize it at scale.
As we continue to search for the perfect stack, we must keep data storage a primary decision point. Data is only becoming more important with Machine Learning and AI, and companies are looking towards technologies like Hadoop to store and query the vast amounts of data they will need for years to come. Containers has made deploying these kinds of large scale databases relatively easy, and frameworks exist that can make the transition less painful.
Data lakes also help with data integrity and latency issues that arise when large organizations have many teams using the same data. Applications can easily utilize different data sources for different types of data, keeping customer data local and the long-term sale data in the data lake. This can allow companies to deal with almost any size data and still perform efficient searches, even at the scale of the internet.