Go jump in the data lake: The best advice for 2021
Author: Chris Samulski, Senior Director, Data Engineering and Analytics, Arbela Technologies
Database. Data warehouse. Data mart. Dataverse. Data Lake. It’s a lot. And it’s not different names for the same thing. In this blog, we’re going to focus on Azure Data Lake as it will be hugely influential in 2021 and beyond.
First, a quick review of the data storage and dissemination landscape.
- Database: Related, structured data set; usually not all of a business’ data, but a “slice” of it.
- Data warehouse: A system to process and structure defined business data from various databases for reporting and analysis - foundational to business intelligence tools and systems.
- Data mart: A subset in a data warehouse typically devoted to a line-of-business or department.
- Dataverse: The Microsoft Dataverse is essentially database-as-a-service for all your business application data that is accessible and actionable via Power Platform and usable by everyone from administrative assistants to executives to developers.
- Data Lake: The Azure Data Lake is, simply put, vast. While the data is unstructured (unlike the defined data that lives in a warehouse) and, in most cases, accessible and understood by professional analysts and data scientists, it is virtually unlimited: trillions of objects, petabyte (1,000,000 GB) size files, endless options and avenues.
The fluidity of Azure Data Lake
In my opinion, one reason it should be called a Data Lake has less to do with its size (after all, a Dataverse/uni-verse should, semantically speaking, be larger), and more to do with the fluid nature of the data therein.
The size is there, as it’s built on Azure, making the Data Lake endlessly scalable. But it’s the way data can flow within and without the Data Lake, coming from all sources (nothing is turned away), that make it lake-like. As a lake can draw water from the ground, from the air, from streams and tributaries, and on.
How to access the Data Lake and why
You might be wondering: Where does Azure Data Lake live? The answer is everywhere. Look at the following infographic of Microsoft tools, solutions, and systems. Can you see the Data Lake? (Hint: it runs under absolutely everything.)
You might also be wondering: What business purpose does Data Lake serve? The answer there is similar to the answer for Where given above: anything.
Again, the data in Azure Data Lake is unstructured. Unlike a data warehouse, where the data is defined and readable by everyday people, the data in Data Lake is unstructured and virtually endless, which means the purposes it may serve depend on why it’s being accessed and by whom.
Data storage in the data lake: with Arbela, it’s anything but “all wet”
Following the above, we must also say that “unstructured” does not necessarily mean random. Azure Data Lake Storage from Microsoft uses the concept of a hierarchical namespace to storage, which organizes data into a system of directories similar to the structure of files stored on your computer. This allows you to use both object-based and file-based models within the same data lake. Microsoft calls this capability multi-modal storage, and Azure Data Lake Storage is the first cloud-based solution to offer this type of capability.
This type of storage is ideal for performing analysis against large amounts of data not stored in a relational way. For example, you could be pulling and storing large amounts of user data from websites such as Google or social media platforms into a data lake for a robust analysis. But keep in mind, analyzed data coming straight from a data lake is, in most cases, not ideal for presenting data analysis results in a way that’s easy to understand.
Most people prefer to view data grouped together in a relational manner to recognize patterns, trends and outliers so they can make informed decisions effecting their business. For this reason, it makes sense to collect and analyze large amounts of data from multiple sources in data lakes, and then move information analyzed into a data warehouse for further relational analysis to then provide a better user-friendly presentation of findings.
Enter Arbela. Arbela Data Insights is a data warehouse solution that extracts, stores, delivers and displays business-critical information from most systems within your organization, including Azure Data Lake Storage. It has pre-configured industry standard and horizontal specific KPIs and metrics for actionable BI reporting.
Jumping into the Data Lake
If you’re confused trying to define Data Lake/Data Lakes, you’re not alone. Which is why Azure Data Lake can really only be accessed by an experienced data scientist. It’s not for everyday people: most of us would get lost.
But when you equip a data scientist with an application like Arbela Data Insights and a business directive (say, uncovering every revenue stream inside and outside of a business’ core systems over a five-year term) they know, once they jump into the Data Lake, which direction to “swim” in.
Here’s the neat thing: with Data Lake, you never need to ask, “How can I get started?” Because—as you have data flowing into and out of your enterprise systems all day, every day—you already have.
Want to find out how we can help you make sense of your data? Contact us today.