Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Data Lake vs. Data Warehouse


By:   |   Last Updated: 2019-03-14   |   Comments (4)   |   Related Tips: More > Big Data

Problem

We are already running a large data warehouse for our business, but now the management wants us to build a data lake. What's the difference? Where will we use it? Who will use it?

Solution

A data lake is a centralized location for storing an organization's data assets in their native form. Ideally, these data assets are collected from every touch point of the business and stored with the intention of analyzing in future. The purpose of capturing every data element of interest is to ensure business can use it to gain competitive market advantage. Ever since the start of modern computing, databases have been used for this purpose. Data lakes are a natural extension of databases - and later data warehouses – based on the variety of data and how it's stored or used.

Let's a say a business decides to capture information about its client interactions. This has been the role of CRM (Customer Relationship Management) applications for a long time. Users would record sales prospective, customer feedback and other information in a CRM database. Typically, the database would be relational with predefined tables representing customer and associated entities.

Modern businesses interact with customers in many different ways though: there may be a brick-and-mortar call center using a CRM; but then there will be one or more websites with their feedback forms, direct e-mails, e-commerce store, mobile apps, social media presence or business partner channels. All these are valuable sources of information which can provide a 360 view of the customer. To maintain a competitive advantage in the market, the business will need to capture information from all these outlets.

But not every piece of information can be saved in a database. Some data would be highly unstructured, like images (think of users sending faulty product images); some may be semi-structured, like social media feeds or XML documents. It's impossible to store every type of data in one single database and that's where a data lake can help.

In its basic form, a data lake is nothing but a huge pool of storage where data can be saved in its native, unprocessed form, without any transformation applied. For example, a data lake can store both nightly batches of CSV files (which is structured data) offloaded from the CRM and streaming feeds from the social media channel. The same data lake could be hosting semi-structured customer satisfaction survey files sent by third-party vendors.

How are Data Lakes Different from Data Warehouses?

So now we know a data lake holds all types of information a business deems strategically important, and the information is stored in its native, raw form.

This is not the case the with a data warehouse. First off, data warehouses typically store relational data, which is structured. There are tables in a data warehouse and those tables have relationships and can follow data models like snowflake or star schema.

Secondly, a data warehouse hosts only a subset of data from different sources. For example, a data warehouse can get its data from sales, product, customer and finance database systems, but it may skip any feeds from HR and payroll systems. In other words, data warehouses are purpose-built, meant to answer a specific set of questions. To cater this, source data is cleansed and processed before loading into the data warehouse.

Thirdly, data in a data warehouse may be aggregated in nature. Instead of raw transaction data, aggregated values are used for decision making.

Fourthly, data in a data lake is saved in perpetuity, regardless of use in immediate or near future. The notion here is "store everything; we may need it for future analysis". This is possible because data lakes are typically backed by large amount of storage at low-cost (more on that later).

In contrast, data warehouses are usually run on premium quality, IO-optimized storage for maximum performance. That's why storage for a data warehouse is not cheap, and to preserve performance, a data warehouse does not retain unlimited data: it's usually deleted or archived after an expiration period. Very large data warehouses can use techniques like data marts, partitioning, sharding or federated servers.

Use Cases for Data Lakes vs. Data Warehouses

The use cases for data lakes and data warehouses are quite different as well. There can be more than one way of transforming and analyzing data from a data lake. It may or may not need to be loaded into a separate staging area. For example, CSV files from a data lake may be loaded into a relational database with a traditional ETL tools before cleansing and processing. On other hand, image or video data could be directly analyzed from the lake by a machine learning algorithm. Generally, data from a data lake requires more pre-processing, cleansing or enriching.

This is not the case with data warehouses. Data in a warehouse is already extracted, cleansed, pre-processed, transformed and loaded into predefined schemas and tables, ready to be consumed by business intelligence applications.

That's the reason users and use cases of data lakes and data warehouses are different. Data lakes are often used in big data operations like data mining or machine learning for finding patterns, building predictive models or other complex high value outputs. The users are typically data scientists or advanced data analysts. Tools are typically related to the Hadoop ecosystem like Apache Spark or Hive.

Data warehouse use cases are mainly restricted to business intelligence, reporting, and visualizations with dashboards. The main technical users of a data warehouse are data analysts, report designers or sometimes data scientists, and end users are business decision makers.

This does not necessarily mean data lakes and data warehouses are two isolated business assets. In fact they are supposed to complement each other. For example, raw data in a data lake could be first level assets for a company. A process could load this data in a staging environment, transform it into relational data, aggregate it and then load it in the data warehouse. This transformed data then becomes a second level data asset. Further processing and enriching could be done in the warehouse, resulting in the third and final value-added asset. This final form of data can be then saved back to the data lake for anyone else's consumption.

Schema on Read vs. Schema on Write

Another difference between a data lake and a data warehouse is how data is read. A data lake hosts data in its raw format without any schema attached to it. Schema is only applied when data is read from the lake. This is called schema on read.

For example, let's say a data lake has a collection of many thousand JSON files. These files may not follow any particular schema, they may be many levels deep, but they may also have some common fields. A data scientist can extract only those common fields from each file and populate a table. In this case, the data scientist writes a query to parse the JSON files, extract the field values and populate a table. In other words, a schema is applied on the source files when data is actually read.

Data warehouse schemas are predefined. A schema consists of multiple tables and their fields, data types, constraints and relationships.  The schema is designed and developed with business rules in mind and tested for functionality before data is loaded into it. When data is written to a data warehouse, the process can be called "scheme on write" because the write operation is conforming to the already existing schema, otherwise it will be rejected. When data is read from the warehouse, applications do not impose any schemas of their own.

Storage

Let's talk about data lake storage.

Data in a data lake is slow moving. It's not meant for immediate access, but it needs very large storage space. Cloud-based low-cost object storage mediums are therefore perfect fit for data lakes. If we consider Amazon Web Services (AWS), Simple Storage Service (S3) is an ideal solution. Similarly for Microsoft Azure, the storage could be Azure Data Lake Storage. For Google Cloud Platform, it could be Google Cloud Storage.

Data lakes can be also hosted on Hadoop clusters. In this case data is ingested into the Hadoop File System (HDFS) and remains close to the compute power of the data nodes. Total storage capacity of the cluster is the storage available for the data lake.

ELT vs. ETL

Transforming data is not so much a priority in data lakes as much is loading data. Typically, data pipelines for a data lake extract data from source systems and loads that into target as quickly as possible. ELT (Extract, Load and Transform) tools are therefore ideally suited for this. Many ELT tools can connect to data lake storage systems natively. The transformation part generally varies between data consumers, so there can be many different types of transformation use cases. Generally, transformation is done by end-user applications connecting to the data lake. Using only the "E" and "L" of ELT also means data lake pipelines are simple and inexpensive.

Data Warehouses typically depend on ETL (Extract, Transform and Load) where one or more ETL packages run within one or few dedicated compute resources. This compute layer extracts data, transforms it, and then loads it into the data warehouse. The transformation processes for data warehouses are well defined, represent strict business rules, and repetitive in nature.

Another difference between data lake ELT and data warehouse ETL is how they are scheduled. With data lakes, there may (or may not) be, scheduled loading and transformation processes. Some transformations can be one-off or ad-hoc. Data warehouse transformations are almost always scheduled.

Data Lake Challenges and Best Practices

Like any other technology solution, data lakes come with its own set of challenges and best practices. Here are a few:

Data Governance

Businesses often start data lake projects by assuming it needs to store every piece of data it comes across. This is natural thinking, but it also means the lake could be containing unnecessary information which will never be used. Unless checked, a data lake risks becoming a swamp - a dumping ground for everyone's data.

Storage might be cheap, but storing massive amounts of unnecessary data also means:

  • The time to seek out relevant information will also increase
  • There will be no definite catalog of data in the lake
  • Other issues like data duplication or security will need to be addressed (discussed next)

Data stewards or data architects therefore need to address this proactively from the very onset of the project. The company needs to decide what type of information it's going to store in the lake, where it will be sourced from, how it will be stored and who will be the custodians and consumers of that data.

For example, if a business offers managed IoT service for its customers, saving online ad clickstream data in the lake may not be as important as saving all the IoT data.

Data Duplication

This is another pitfall of poor data governance. When people know about a data lake, they quickly start to build their own "areas" within the lake: silos of data related to their own teams or departments, unaware that another team might be saving the exact information in different form somewhere else in the lake. When a consumer searches for a data item, it may find the same piece of data in two different places, not knowing which version is the latest and more trustworthy.

There are two ways to address this challenge:

  • Data stewards and data architects can build specific areas in the lake for different data sources. These areas should be accessible by every team involved in the data's lifecycle. The existence of the area is then communicated to these teams. 
  • Maintaining appropriate metadata for different sets of data. For example, a data lake folder containing aggregated information from accounts payable system can have a metadata file showing the source server name, the date of loading, the user account that loaded the data and the accounting period.

Data Security

Security is a complex thing to manage in data lakes. Data warehouses usually have DBAs who build security models around databases, schemas or tables for specific sets users, groups or applications and their access requirements. Data lakes are quite different. They have varying - and often ad-hoc - needs of access from users, applications or even external parties. And because it's implemented on top of an object storage system, there is no DBA to manage security, no command to run for restricting access.

There are few ways to address this issue:

  • The same team responsible for the organization's data security should be entrusted with protecting the data lake. This may be the DBA team or the operations team.
  • The security team will provide read access to required files or folders to relevant users as and when necessary.
  • There should be only specific accounts – preferably application accounts – that can load data in the lake.
  • The data lake will not store sensitive data like personal or financial information in its original form. It will be masked before storing. The application responsible for using the data will need to decipher it.
  • If using cloud-based storage, all data should be encrypted with one or more keys that are regularly rotated.
  • There should be adequate measures for backing up critical pieces of data or at least versioning them. For example, Amazon Web Service S3 storage allows long time retention of data at lower cost in Amazon Glacier and enable versioning of files in a bucket.

Data Toolsets

With a wide variety of data types and their use-cases, data lake users can quickly start using tools of their own choice. This can quickly grow into an organization-wide proliferation of non-standard (and often vulnerable) tools and practices. It's therefore necessary to standardize on a set of tools for accessing the data lake and writing to it.

For example, it may be stipulated that data scientists only use a particular type of notebook technology, use specific languages like R or Python and use a set of data manipulation libraries. Using the same model, data analysts could be asked to use one or two specific SQL query tools that are licensed and ETL developers could be trained to use a standard integration tool.

Data Lake Adoption

Perhaps more challenging than implementing security is to actually make people use the data lake. In today's world, information should be democratized, people should be able to access relevant data without any hassle or too many hoops. If users see the data lake as an opaque, complex piece of technology, they will quickly turn away from it, defeating the purpose of a unified platform.

It's therefore necessary to build some self-service capability for different groups of users. This can start with a simple awareness program where users are trained on the data lake's existence, it's purpose, the business value it offers and how to use it with existing tooling. Further down the journey, the company can invest on self-service portals for ad-hoc searching and query building.

Conclusion: Some Data Lake Technologies and Vendors

Hopefully this article has given some background on data lakes and their best practices. We recommend users familiarize themselves with tools and technologies for data lake solutions offered by different vendors. Here are some starting points:

  • AWS Lake Formation is a service from Amazon Web Service for creating a secured data lake with minimal manual configuration. Currently it's in preview mode.
  • Microsoft Azure Data Lake consists of a number of underlying tools for building a data lake.
  • IBM and Hortonworks offer joint solutions for data lake on the Hadoop ecosystem.
Next Steps
  • Find if your company is currently using a data lake. If not, can it benefit from a data lake? Investigate if there are important data elements currently missing from your data stores which can add value to the business.
  • Learn more about data lake technologies offered by different vendors. This can include storage vendors, data integration vendors, database vendors, BI tool vendors.
  • If your company already has a data lake, identify its users, use cases, tools and any current challenges.
  • Understand your organization's data governance practices and how the data is shared.


Last Updated: 2019-03-14


next webcast button


next tip button



About the author
MSSQLTips author Sadequl Hussain Sadequl Hussain has been working with SQL Server since version 6.5 and his life as a DBA has seen him managing mission critical systems.

View all my tips
Related Resources




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Wednesday, March 20, 2019 - 3:04:17 PM - Donna Kelly Back To Top

Appreciate your thoughts . . .

Just one update:  You wrote: "all the historical data is not retained with the same data that is used for current reporting or analytical purposes"

All the data warehouses I've ever created, developed, or worked on do indeed have all historical data retained with current data!  We couldn't report on changes over time, otherwise.  I would remind you of Bill Inmon's definition of a data warehouse:   “A Data Warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in support of management’s decisions (Inmon, 2001).”

All my data warehouses grow infinitely.  In practice, that's not all that big.  A few years ago, I validated some hardware as capable of processing the largest data warehouse in Europe (100,000,000 transactions pernight, total 1,000,000,000,000 facts at time of creation).  The main fact tables were only a few terabytes.  The whole thing fitted in a cab of disk.  Nowhere near 'big data' (e.g. weather data, cosmic ray phenomena, etc.) which is in the exabyte-zettabyte range, and fit in server halls. :-) 

You know what would be really interesting:  a case study or two where an enterprise has made a serious investment in a data lake (complementary to their enterprise data warehouse) and has got documented real-world cost-benefit out of the data lake.  I would love to see such a case study.  Do you know of any such?


Friday, March 15, 2019 - 9:42:29 PM - Sadequl Hussain Back To Top

Hi Donna, thanks for your comments!

While it is true that in DWH we have slowly changing dimensions, the underlying structure of the DWH is not changing overnight.

Data warehouses can indeed be very large, and yes, in health or finance industries data has to be retained for a very long time. However, all the historical data is not retained with the same data that is used for current reporting or analytical purposes. That's why we have partitions (in the same table), archive tables (in the same or different databases) or data marts (in the same server or different servers). In other words, the actual DWH being used for querying, reporting etc. is still controlled, and not allowed to grow infinitely, even when the summarized data in a warehouse is much smaller than their transaction sources.

There are two reasons for it: one is to ensure the query performance is not degraded over time, the other is to ensure the disk space available for currently accessed data is optimized. That's why, a data warehouse could be made up of multiple data marts - all data marts could be using a cheap magnetic storage except the main data mart - which could be using SSD storage.

Disk space cost has reduced and is reducing exponentially, and there are serverless data warehouses today like Snowflake or Big Query where you don't have to worry about space, but still, petabyte scale data warehouses like Redshift tries to optimize space - using techniques like compression and distribution.

In contrast, modern data lakes are not based on server and disk model, but rather object data stores. This storage mediums have much slower latency than provisioned IOPs based disk systems used in DWH, but they are also not accessed as frequently as a DWH. And most importantly they are cheaper than the block storage used in DWH. They are also not supposed to answer immediate business questions.

However, if you consider it, a Data Warehouse cannot be a panacea for storing and analyzing and correlating every single data sources for the following reasons:

- Storing multiple unrelated data entities in the same warehouse (e.g IoT and social media data with finance data) does not make sense. In fact, it would waste valuable disk space in a server.

- The cost of designing, testing implementing and managing multiple data pipelines from each source to the warehouse would be prohibitive.

- It will mean data analysts and scientists who want to make sense of multiple pieces of data from different sources would need to access different sources - which is not optimal and also goes against data security.

Compared to this, a data lake serves a completely different purpose and uses a completely different model. The question is not about whether one supersedes the other, but rather how they work together to obtain a greater business value.

Data Lakes do not need disk performance and it's main consumer is NOT operational reports, but analysts and data scientists who try to correlate data to come with business models which are not apparent from DWH reports.

Also, it's not about data duplication. In a properly designed data landscape, a data lake would be the single point of truth - where all relevant data is first staged. They may be structured or unstructured, clean or unclean. A data pipeline and processing stage would be used to extract, clean, summarize and enrich the data, and then load it into a data warehouse. This warehouse would then be used to run operational reporting. Other data not ending up in DWH will still be retained, and they may be loaded into other systems for further analysis. And that's why in reality it's not uncommon to see multiple data warehouses consuming data from the same data lake, but serving different purposes. One could be used by Finance, another by Marketing and so on.

When you implement a data lake, it is understood that you will be dealing with a large storage space eventually - much larger than even petabytes of data - and you will be storing data without immediately understanding it's business value. However, the business value of a data lake is achieved as people make use of it - this is the same for data warehouses too. Some data may never be used, but some data will be used time and again and some data will be discovered later. But yes, governance is a key component here - business needs to understand what they are storing and why they are storing. And they need to ensure it is the single source of truth.

I would not say data lake is a commercial ploy from big vendors. I believe its adoption in the industry has been the result of companies realizing they can get complementary business value by storing extremely large amounts of data in a different location, at a cheaper price, with greater flexibility and then using other technologies like DWH on top of it.

Hope this answers your points :)


Friday, March 15, 2019 - 4:24:32 PM - Donna Kelly Back To Top

Hi and thanks for the comprehensive overview. I thought it was thought-provoking, and I’d like to offer some comments for debate.

First a technical issue. You say “a data warehouse does not retain unlimited data”. I would argue that’s absolutely not the case. We construct such things as Slowly Changing Dimensions of various types, whose purpose is to retain history over long periods. We do this so that we can do things like analyse customer buying patterns as their demographics change over the years. Healthcare data warehouses retain data forever.

I’d further argue that data warehouse data tends to be much smaller in volume than the transaction systems from which it derives. For example, a retail transaction system of 350GB current transactional data translates to 30GB for one year’s stock and sales data (real volumes from a real 6000 employee retail chain). Data lakes on the other hand tend to be very large (several terabytes at least) because they typically contain much unstructured data such as weblogs. We build our data warehouses today to deliver sub-second reporting performance. Data lakes tend to sit on near-line disk optimised for volume, not speed. Because of storage costs, data lakes tend to contain only current data.

The second issue is one of data governance. One of the reasons for going down the single integrated data warehouse route is to ensure all data is clean, in the same place, and with a single set of calculations and KPIs being performed on it. Integrity of the data is assured in this fashion, and information consumers can trust the ‘single source of the truth’. No more spreadmart hell. Introduction of a secondary source of reporting breaks this all to hell and gone. Not good. Not good at all.

The third issue is the most significant; it is a business issue. Data warehouses are structured and optimised for fast response to business queries. They're built to provide direct value to the business. Data lakes are essentially duplicates of operational data which already exists in their original locations. The masses of storage required by a data lake are not cheap. Add the cost of storage the cost of creation and implementation of the associated data governance policies. Creation of such a thing therefore requires a compelling business case. If all you are doing is collecting bucketloads of data with the hope one day of making use of it, then all you’re really doing is spending money without a purpose. If there’s a specific data stream that you do need to analyse and make use of, then why not integrate it into your enterprise data warehouse?

In summary, is not the 'Data Lake' a solution without a problem, just a way for IBM, Amazon and Microsoft to make money flogging disk? In essence, a marketing cliché to extract cash from unwary buyers?

I’ve deliberately been a little provocative in my comments, in the hope of sparking discussion, and perhaps come to common ground. I’d appreciate hearing your response.


Thursday, March 14, 2019 - 1:51:46 PM - Anne Cao Back To Top

Great article, I was wondering what is data lake, this explains the difference very clearly!


Learn more about SQL Server tools