top of page
  • Twitter

How to Improve Performance When Running Notebooks

Shamen Paris

Updated: Feb 18, 2023

Most of the time, every developer or user wants to write their own Spark functions and notebooks so that the code can be executed as quickly as possible. So, in this post, I want to showcase how we can improve the performance of our code when we execute the developed solution.


Performance 1 - By partitioning the delta table

To begin, we must consider two aspects: setting the performance when writing and setting the performance when reading. There will be performance latency when reading the delta table if we want to insert data into delta tables quickly without any partition command defined in the scripts. Because most of the time, if you partition the file when you write, we can skip the partition when we read the delta table with relevant filters. In that case, it will take a little bit more time when you write the file with partitioning enabled. However, if you do not enable partitioning, the process will run quickly, but reading it will take longer than reading the partitioned delta table. 


You can use the repartition command to create the partition when you write into the delta table.


summer_df.repartition(col("Country"))\
        .write\
        .mode("overwrite")\
        .parquet("/FileStore/datasets/olympics_data_repartition")

There is a another command called partitionBy to create the partitions when we write into delta table.

summer_df.repartition(col("Country"))\
        .write\
        .partitionBy("Country")\
        .mode("overwrite")\
        .parquet("/FileStore/datasets/olympics_data_partition")

When we compare these two commands, partitionBy command gives much more performance rather than repartition command when we read the delta table. Below I have test the same query but in different partition methods. When query the partitionBy command used table it executed faster than repartition command used table.


repartition execution time - 7.13 seconds

partitionBy execution time - 0.65 seconds

So, If you are looking to read the queries much faster you can use partitionBy command when you write to the delta tables.


Performance 2 - OPTIMIZE and ZORDER

There is another way that you can improve the data read performance. If you have millions of records in the delta table, you can use OPTIMIZE and ZORDER commands to read the data fast. When you use the OPTIMIZE command on your delta table, the performance of the delta lake operations is optimized. On the other hand, the ZORDER command will organize the data according to your table fields. When you write a query, it will skip the unwanted files because of the ZORDER command.


You can define the OPTIMIZE and ZORDER command like this:

%sql
OPTIMIZE olympics ZORDER BY (Year)

You can compare the performance by query the delta table with and without using OPTIMIZE and ZORDER.



Note that this method is mostly helpful when there are millions of records in your delta table. Otherwise, if you have a small amount of data, you can use the repartition command to avoid the small file problem.


Conclusion

If you are looking to improve the performance of the read operation it is a best way to use the partition command. Then if the delta table has large number of data it is better to use OPTIMIZE and ZORDER command.

Comments


HAVE I MISSED ANYTHING GOOD LATELY?
LET ME KNOW

Thanks for submitting!

bottom of page