Preparing the data is essential part of the Machine Learning or dump data into Data warehouse or Data Lake. When we received the data or records there are some missing data and unfilled values. When we face this kind of scenario we need to first analyze the dataset and then need to necessary steps.
In this blog post I will guide you some common scenarios that anyone will find out in the datasets.
Scenario
Table structure
Order ID
Item Code
Order Date
Customer ID
Customer Flag
Quantity
Unit Price
Amount
In this table every column has one or more null values. We need to clean this one by one.
First we need to take numeric columns.
Column : Amount
Amount is calculate with Quantity times Unit Price. If Amount has null values we need to check whether Quantity and Unit Price are null. If one of those columns are null we need to remove those columns.
We need to import two packages in python before clean the data
import numpy as np
import pandas as pd
Then we need to import our data set.
df = pd.read_csv('InputMissingData.csv')
Now we need to split data into two Dataframe. One is for null values and other one for not null values. Reason for that we will clean null values and combine with not null values.
df_remove_Notnull = df[df['Amount'].notnull()].reset_index()
df_null = df[df['Amount'].isnull()].reset_index()
Now we need to loop through the null values and check whether Quantity and Unit Price are null or not. If both of the column row values are null we will eliminate the row other wise we will do the calculation.
index = np.array(df_null.index)
for i in index:
if df_null['Quantity'][i] != 'nan' and df_null['UnitPrice'][i] != 'nan':
df_null['Amount'][i] = df_null['Quantity'][i] * df_null['UnitPrice'][i]
No we need to drop nulls in Amount field because after we done the process and still null values are occur in Amount columns means that record is a null row. So we need to remove those rows.
df_null_new = df_null.dropna(subset=['Amount'])
Now we combine two Dataframes.
df_amount = pd.concat([df_null_new, df_remove_Notnull], ignore_index=True)
Column : Unit Price
As previous we need to split dataset into two Dataframes.
df_unit_price_null = df_amount[df_amount['UnitPrice'].isnull()].reset_index()
df_unit_price_Notnull = df_amount[df_amount['UnitPrice'].notnull()].reset_index()
Now we need to loop through df_unit_price_null DataFrame UnitPrice column to replace values.
index = np.array(df_unit_price_null.index)
for i in index:
if df_unit_price_null['UnitPrice'][i] != 'nan':
df_unit_price_null['UnitPrice'][i] = df_null['Amount'][i] / df_null['Quantity'][i]
After that combine the values
df_unitPrice = pd.concat([df_unit_price_Notnull, df_unit_price_null], ignore_index=True)
Column : Quantity
Do the same process in Quantity Column
df_Quantity_null = df_unitPrice[df_unitPrice['Quantity'].isnull()]
df_Quantity_Notnull = df_unitPrice[df_unitPrice['Quantity'].notnull()]
index = np.array(df_Quantity_null.index)
for i in index:
if df_Quantity_null['Quantity'][i] != 'nan':
df_Quantity_null['Quantity'][i] = df_Quantity_null['Amount'][i] / df_Quantity_null['UnitPrice'][i]
Combine
df_Quantity = pd.concat([df_Quantity_null, df_Quantity_Notnull], ignore_index=True)
Columns : Order ID, Item Code, Order Date, Customer ID, Customer Flag
For other columns process is the same. But Order Date is a date column we need to replace date in Order Date column.
df_Quantity = df_Quantity.fillna(value = 'Unknown')
i = 0
while i < df_Quantity.OrderID.count():
if df_Quantity['OrderData'][i] == "Unknown":
df_Quantity['OrderData'][i] = "1/1/9999"
i = i + 1
Now check the dataset and find out what happened.
I hope you get some idea about how to clean data in Python
Good Luck!!!
Commentaires