top of page

Data Warehouse With Delta Tables

Nowadays, companies are eager to use technologies like Data Lake , could databases to store companies' data. So, BI solutions also need to be adapted according to new technologies. Most of the time, Data Engineers and BI Engineers spend time with ETL/ELT tools to create pipelines from various sources to Data Warehouse or Data Models.

The biggest challenge is to choose the right ETL/ELT tool for data ingest into different stages. Databricks provide a quick and easy way to extract, transform, and clean data. On the other hand, we can build our Data Warehouse in Databricks by using Delta Lake.


We need to follow some steps in order to initiate the Data Warehouse in Databricks. According to best practices, it's better to use Multi-Hop (Medallion) architecture to store data in Delta Lake for different stages.

  1. Bronze Layer

  2. Silver Layer

  3. Gold Layer

Before ingesting data into the different layers mentioned above, we first need to ingest data from the source system into a staging zone or landing zone. In some business cases, we can directly ingest data into the Bronze Layer.


Let's choose the landing zone approach before ingesting data into different stages. Normally, the landing zone contains filtered data according to the delta column. Before we initiate a project, we first need to identify the delta column. The majority of the time delta columns are timestamps, but we can choose a primary key in some cases. It depends on the source table column and data structure. However, it is necessary to get all the data into the landing zone for the initial load (first time load into the Data Lake).


The bronze layer contains all the raw data from the source system as it is. We can store all the historical data in the bronze layer in delta format. Then data needs to be transformed and cleaned according to the data behaviour. If the garbage values are stored in the bronze layer, it's necessary to clean the data. Then all the curated data can be stored in the silver layer. Databricks can handle all read, write, and transformation, and it is built to handle large amounts of data.

We can automate the above process if we can identify some common patterns of read, write, and transformation. Then we can use those patterns as meta data to automate the process. I always keep the meta in a configuration table so I can utilize it with spark commands.



Data Warehouse in Delta Lake

The Data Warehouse has facts and dimension tables to create dimensional models, data marts, OLAP cubes, etc. Normally, Data Warehouse tables are stored in databases, but in Delta Lake we can use delta tables to store facts and dimensions. As you can imagine, everything is in one place now. It is called the single source of truth.

The silver layer tables are used to create facts and dimensions for the Gold Layer in delta format. Databricks allows you to use delta tables in BI tools like Power BI, Tableau, and others. By using BI tools, we can create relationships between tables to create models to analyse business KPIs.


To get started you can use below links:



Thank you...

Comments


Subscribe Here.

Thanks for subscribing!

+94766088392

Colombo, Sri Lanka.

  • LinkedIn
  • Facebook
  • Instagram
bottom of page