How the data warehouse can stand between your data and your insights
You have a product that has taken off. Your daily active users metric has been growing exponentially. The number of events per day you’re logging is now in the 100’s of millions.
As a result you now find yourself with terabytes of data or if you have become really successful hundreds of terabytes.
You begin to wonder if you could use all of this data to improve your business. Maybe you can use the data to create a more personalized experience for the users of your product. Or maybe you can use the data to discover demand for new products.
You request that your data team come up with way to leverage this data to do just these types of things.
The data team that you have hired recommends that you develop a data pipeline. An end-point of that pipeline being the data warehouse.
You may get something like this:
Data Pipeline and Data Warehouse
But after months of work, and many dollars spent building the data warehouse, the data scientists that you hired can’t come up with the insights.
How could all of that data, all of those IT consulting hours, and those cloud computing resources be marshalled to not produce the insights?
The problem likely lies in one of the important components of your pipeline: the data warehouse
Here are some of the painful things you can experience in the data warehouse:
- Poor Quality Data
- Data that is Hard to Understand
- Inaccurate / Untested Data
- A Slow Data Warehouse
- A Poorly Designed Data Warehouse
- A Data Warehouse that Costs Too Much
- A Data Warehouse that Does Not Factor in Privacy Requirements
Poor Quality Data
You data may be streaming in from multiple sources. When an analyst runs a JOIN
on this data, it could result in a table that is inconsistent. Inconsistent data can manifest itself as missing columns that are required to properly identify each data item. Or the data may contain duplicates that take extra space and prevent from performing the aggregations necessary to achieve insights without extra work (meaning extra analyst time cleaning the data via interpolation, and extra compute hours deduplicating the data).
Data that is Hard to Understand
You have PhD’s on your analyst team. Why are they scratching their heads and shrugging their shoulders after looking at your data? It could be that the tables in the data warehouse are an enigma.
A lot of times, the data warehouse is built by a different team than the analysts. Both groups are trying to manage data but are not necessarily playing for the same data team.
Oftentimes the tables are created in a way that makes it easy to create the table and but not easy to be processed downstream. The table is created without taking the downstream requirements into consideration! Noone thought to begin the data warehouse design with the end goal in mind of quickly enabling insight generation.
Inaccurate / Untested Data
Data items can be wrong. Data items may reflect something that is not possible. The data may reflect something going on in society that you do not want to serve as a basis for downstream analysis. The data must be accurate otherwise, it will lead your analysis to wrong or detrimental insights. Untested data is worse than not having any data.
A Slow Data Warehouse
A data warehouse can be of no use because it takes too long to query, or goes down often. If users are not trained on how to write efficient queries or if the warehouse is not developed to automatically scale with the growth of the data, and if there are no protections in place to prevent abuse of the compute resources of the warehouse your insights will never materialize.
Poorly Designed Data Warehouse
Business leaders who launch a data warehouse without first considering the business needs and translating these into actionable tasks will likely get a data warehouse that does not meet their business needs.
Not understanding these business needs upfront leads to miscommunication amongst the analysts, which leads to confused insights.
A Data Warehouse that Costs Too Much
One possible cause of a costly warehouse is not matching the right warehouse implementation option to your needs. Not every organization needs to create a from-scratch, on-premise, data warehouse. Doing this takes a lot of time, a lot of the right human resources, and equipment. This can yield a project that is late, over budget, and expensive to maintain or upgrade. As a result over time your warehouse becomes less useful as other priorities consume the organization’s resources.
A Data Warehouse that Does Not Factor in Privacy Requirements
Even if your product is a game, or something purely consumer oriented, and even if you spell out clearly in the terms of service that whatever data the user shares is yours, you still can’t ignore how the data warehouse will protect your user’s identifiable information.
Not taking this into consideration can result in people in the company being able to look up specific users for non-business purposes. It can result in people in the company misusing personally identifiable information, which can hurt your users, and negatively impact daily active user growth. It can result in personally identifiable information inadvertently leaking somewhere downstream.
How to Deal?
There is no magic bullet to addressing these many issues. While some of these issues are technical in nature (and just require the right no-how), others are organizational–meaning you can’t just download a free-ware tool to solve them.
But briefly, some of these issues can be addressed by:
- Have a well organized product development process. Using agile
- Having a well thought out product life cycle process and organized as cross-functional teams can work well
- Realize that there is no one-size fits all data warehouse. You will have to some warehouses that are configured to be high-speed data stores to capture data streaming in from your product. These are data warehouses that are configured to prioritize transactional activity. Other data warehouses will be configured to be always-on, highly-available, scalable, and reliable data stores whose purpose is to hold your 100s of terabytes of data in a queryable form to enable the data analysts.