How to deal with missing data with Databricks

Shamen Paris
Jan 10, 2021
3 min read

Databricks Introduction

Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Available to all organizations, it allows them to easily achieve the full potential of combining their data, ELT processes, and machine learning.

This Apache-Spark based platform runs a distributed system behind the scenes, meaning the workload is automatically split across various processors and scales up and down on demand. Increased efficiency results in direct time and cost savings for massive tasks.

Let get start

Frist go to Databricks Community Edition and sign in free. click here

Step 1

Go to Data tab in Databricks and click create table for upload a dataset.

Create data set in DataBricks — Data Tab ---> Create Table

Go to Upload Files and click browse to upload the dataset into Databricks cluster.

Use this File:

Upload MissingData.csv file which is provide above

Step 2

Go to Workspace and create Notebook for start coding

Give a name to the Notebook and choose a Language. In this case I chose Python.

Step 3

Let start Coding

Missing Data

Often data sources are incomplete, which means you will have missing data, you have 3 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

Just keep the missing data points.
Drop them missing data points (including the entire row)
Fill them in with some other value.

Let's cover examples of each of these methods!

Keeping the missing data

A few machine learning algorithms can easily deal with missing data, let's see what it looks like:

#Import Spark Session to cluster
from pyspark.sql import SparkSession
#Start Spark Session and assign to a variable
spark = SparkSession.builder.appName("MissingData").getOrCreate()

We are using csv format dataset so need to use "spark.read.csv" . This csv file is contain header row because that header column need to be true. InferSchema automatically gives correct schema for the dataset.(Note: InferSchema set to true may cause slow)

#Import Dataset to RDD
df = spark.read.csv('/FileStore/tables/MissingData.csv',header=True,inferSchema=True)

Display data frame to verify data

df.show()

Drop the missing data

You can use the .na functions for missing data. The drop command has the following parameters:

df.na.drop(how='any', thresh=None, subset=None)

param how: 'any' or 'all'.

If 'any', drop a row if it contains any nulls.

If 'all', drop a row only if all its values are null.

param thresh: int, default None

If specified, drop rows that have less than `thresh` non-null values.

This overwrites the `how` parameter.

param subset:

optional list of column names to consider.

# Drop any row that contains missing data
df.na.drop().show()

# Has to have at least 2 NON-null values contain
df.na.drop(thresh=2).show()

To drop null values in specific column

df.na.drop(subset=["Sales"]).show()

To drop the rows which contain one or more null values in a row for that use below code

df.na.drop(how='any').show()

To drop the rows which contain all null values in a row for that use below code

df.na.drop(how='all').show()

Fill the missing values

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark is actually smart enough to match up the data types. For example:

df.na.fill('NEW VALUE').show()

Spark is smart enough to identify the numeric values. When use below code this replace null values with 0.

df.na.fill(0).show()

To fill the value in specific column refer below code.

df.na.fill('No Name',subset=['Name']).show()

Extra:

A very common practice is to fill values with the mean value for the column, for example:

from pyspark.sql.functions import mean
mean_val = df.select(mean(df['Sales'])).collect()

# Weird nested formatting of Row object!
mean_val[0][0]
#output--->344.3875

mean_sales = mean_val[0][0]
df.na.fill(mean_sales,["Sales"]).show()

I hope this post give a idea to play with missing values.

Good Luck!

How to deal with missing data with Databricks

Recent Posts

Comments

Subscribe Here.