Databricks Introduction
Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Available to all organizations, it allows them to easily achieve the full potential of combining their data, ELT processes, and machine learning.
This Apache-Spark based platform runs a distributed system behind the scenes, meaning the workload is automatically split across various processors and scales up and down on demand. Increased efficiency results in direct time and cost savings for massive tasks.
Let get start
Frist go to Databricks Community Edition and sign in free. click here
Step 1
Go to Data tab in Databricks and click create table for upload a dataset.
Go to Upload Files and click browse to upload the dataset into Databricks cluster.
Use this File:
Upload MissingData.csv file which is provide above
Step 2
Go to Workspace and create Notebook for start coding
Give a name to the Notebook and choose a Language. In this case I chose Python.
Step 3
Let start Coding
Missing Data
Often data sources are incomplete, which means you will have missing data, you have 3 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:
Just keep the missing data points.
Drop them missing data points (including the entire row)
Fill them in with some other value.
Let's cover examples of each of these methods!
Keeping the missing data
A few machine learning algorithms can easily deal with missing data, let's see what it looks like:
#Import Spark Session to cluster
from pyspark.sql import SparkSession
#Start Spark Session and assign to a variable
spark = SparkSession.builder.appName("MissingData").getOrCreate()
We are using csv format dataset so need to use "spark.read.csv" . This csv file is contain header row because that header column need to be true. InferSchema automatically gives correct schema for the dataset.(Note: InferSchema set to true may cause slow)
#Import Dataset to RDD
df = spark.read.csv('/FileStore/tables/MissingData.csv',header=True,inferSchema=True)
Display data frame to verify data
df.show()
Drop the missing data
You can use the .na functions for missing data. The drop command has the following parameters:
df.na.drop(how='any', thresh=None, subset=None)
param how: 'any' or 'all'.
If 'any', drop a row if it contains any nulls.
If 'all', drop a row only if all its values are null.
param thresh: int, default None
If specified, drop rows that have less than `thresh` non-null values.
This overwrites the `how` parameter.
param subset:
optional list of column names to consider.
# Drop any row that contains missing data
df.na.drop().show()
# Has to have at least 2 NON-null values contain
df.na.drop(thresh=2).show()
To drop null values in specific column
df.na.drop(subset=["Sales"]).show()
To drop the rows which contain one or more null values in a row for that use below code
df.na.drop(how='any').show()
To drop the rows which contain all null values in a row for that use below code
df.na.drop(how='all').show()
Fill the missing values
We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark is actually smart enough to match up the data types. For example:
df.na.fill('NEW VALUE').show()
Spark is smart enough to identify the numeric values. When use below code this replace null values with 0.
df.na.fill(0).show()
To fill the value in specific column refer below code.
df.na.fill('No Name',subset=['Name']).show()
Extra:
A very common practice is to fill values with the mean value for the column, for example:
from pyspark.sql.functions import mean
mean_val = df.select(mean(df['Sales'])).collect()
# Weird nested formatting of Row object!
mean_val[0][0]
#output--->344.3875
mean_sales = mean_val[0][0]
df.na.fill(mean_sales,["Sales"]).show()
I hope this post give a idea to play with missing values.
Good Luck!
Kommentare