Data Warehouse, Data Lake, Data Mart, Data Hub A Definition of Terms
In today’s business environment, most organizations are overwhelmed with data and looking for a way to tame the data overload and make it more manageable to help team members gather and analyze data and make the most of the information contained within the walls of the enterprise. When a business enters the domain of data management, it can often get lost in a morass of terms and concepts and find it nearly impossible to sort through the confusion. Without a clear understanding of the various categories and iterations of data management options, the business may make the wrong choice or become so mired in the review process that it will give up its quest.
This article is the first of two on the topic of Data Management. Here, we will define the various terms so that a business can more easily understand the types of data management solutions and tools. In the second of these two articles entitled, ‘Factors and Considerations Involved in Choosing a Data Management Solution’e discuss the various factors and considerations that a business should include when it is ready to choose a data management solution.
A Data Warehouse (AKA Datawarehouse, DWH, Enterprise Data Warehouse or EDW) solution is designed to centralize and consolidate large bodies of data from disparate, multiple sources and is meant to help users execute queries, perform analytics, provide reporting, and obtain business intelligence. Data Warehouse data is typically comprised of data from applications, log files and historical transactions and integrates and stores data from relational databases and other data sources originating in various business units and operational entities within the enterprise, e.g., sales, marketing, HR, finance.
A Data Warehouse is a structured environment that is comprised of one or more databases and organized in tiers. An interactive, front-end tier provides search results for reporting, analytics and data mining. The search engine accesses and analyzes the data for presentation and the foundational architecture or database server provides the storage and loading repository.
In order to prepare data for analysis, a Data Warehouse environment will typically utilize an Extraction, Transformation and Loading (ETL) process to prepare data for analysis. Team members who access a Data Warehouse may use SQL queries, analytical solutions or BI tools to mine the data, report, visualize, analyze and present the data.
We can think of a Data Mart as a subset of a Data Warehouse but, whereas a Data Warehouse is an enterprise-wide solution that comprises data from across the organization, the Data Mart is a structured environment that is used to store and present data for a specific team or business unit. This approach allows a business team or unit to curate, leverage and manipulate data that is specific to their teams. For example, a business might create a Data Mart to serve its Marketing, Sales and Advertising teams or it might expand that use to include Customer Service and Product teams so that it can more easily analyze and collaborate using data culled from specific sources within these business units.
While Data Warehouses access and analyze large volumes of records, a Data Mart improves the response time and performance for end-users by refining the data to provide only data that will support the collective needs of a specified group of users.
Think of a Data Mart as a ‘subject’ or ‘concept’ oriented data repository. A Data Mart often provides a subset of data from a larger Data Warehouse and is designed for ease of consumption, to produce actionable insight and analysis for a particular group.
A Data Lake is a less structured and more flexible approach to data management with data streaming in from various sources and a more free-wheeling approach to data access, exploration and sampling. A Data Lake stores data with no organization or hierarchy. All data types are stored in raw form or semi-transformed format and data is only organized for presentation and use as queries or requests are generated.
A Data Lake can store structured (relational databases, rows columns), semi-structured (XML, ISON, Logs, CSV) and unstructured or binary (Word documents, PDF formats, images, email, audio or vide0) data, and acts as repository of various data sources and users can use that data for various types of analytics from visualization to dashboard presentation, machine learning and data processing.
A Data Hub solution is typically a more flexible, personalized approach to data management with various integration technologies and solutions overlaid to provide the structure or output needed by the business. The data flows from various sources – not all of which will be operational. A Data Hub can provide data in various formats and perform actions to refine data for quality, security, duplicate removal, aged data, etc.
The Data Hub is meant to collect and connect data to produce insight for collaboration and data sharing. It will act as an integration and data processing hub to connect data sources and make them more readily accessible and usable for team members. The definition of a Data Hub will vary by business use and by organization as the parameters and organization of the hub environment will flex to the needs of the organization. So, factors like available models, data governance and access, data persistence and analytical formats and reporting options will vary.
As you consider the various solutions and options for data management, be sure to develop and use a comprehensive and detailed set of requirements and elicit feedback from those who will use and manage the solution.
Now that you understand the various Data Management options, you are ready to select an option for your business. The second of our two-article series, entitled, ‘Factors and Considerations Involved in Choosing a Data Management Solution’ will provide some simple suggestions and recommendations to help you choose the right option.