Let's use Databrick Workflow in Effective Way - Trigger Raw Data Load Process

Shamen Paris
Oct 7, 2023
2 min read

Workflows using Databricks are now becoming better and better. When developing an automated data load process, there are various features that may be quite helpful. With the help of the example below, you can see how to create a general Raw Data Load process that will start a Databricks workflow job when a file arrives at the source location.

The requirements for creating a generic Data Load procedure from file vary, but we may take into account certain common qualities.

As is commonly understood, the columns and data type must match when loading raw data from files into a table. We may create a method to put the data into raw delta tables based on those requirements. These requirements can be used as configurations to create a configuration table. After that, send the configurations to the function we created.

The main benefit of maintaining a single method for loading raw data is that we won't always need to construct data intake notebooks.

To execute this standard notebook, create a task using the Databricks workflow and add a trigger. Since there is currently no iteration option in the Databricks workflow, jobs must be created in accordance with the process' configuration table entry. However, since our notebook already receives a common raw data load, we do not need to construct a data ingestion notebook.

To receive the correct configuration data to execute the notebook, we have to send a parameter to the notebook when we build the workflow. To do this, we may add our parameter as seen in the below image when setting up the workflow.

Additionally, an Id column in the configuration table is required to filter the configuration according to the job. The configuration of this table can be made as shown below.

The process will execute each time a file arrives at the source location, therefore we can use the file arrive trigger in the workflow as a trigger. However, the trigger setup can make a decision based on the situation. That can be made as shown below.

By using the methods mentioned above, we can create an automated approach for rapidly loading raw data from files into the raw (Bronze Layer) area.

If you are reading this, you have read the whole blog entry. I greatly appreciate it. I hope this has given you some insight. If you have any questions, leave a comment below.

Let's use Databrick Workflow in Effective Way - Trigger Raw Data Load Process

Recent Posts

Comments

HAVE I MISSED ANYTHING GOOD LATELY?
LET ME KNOW

Comments

HAVE I MISSED ANYTHING GOOD LATELY? LET ME KNOW

HAVE I MISSED ANYTHING GOOD LATELY?
LET ME KNOW