Clean the data with Python

Shamen Paris
Jan 24, 2021
2 min read

Preparing the data is essential part of the Machine Learning or dump data into Data warehouse or Data Lake. When we received the data or records there are some missing data and unfilled values. When we face this kind of scenario we need to first analyze the dataset and then need to necessary steps.

In this blog post I will guide you some common scenarios that anyone will find out in the datasets.

Scenario

Table structure

Order ID
Item Code
Order Date
Customer ID
Customer Flag
Quantity
Unit Price
Amount

In this table every column has one or more null values. We need to clean this one by one.

First we need to take numeric columns.

Column : Amount

Amount is calculate with Quantity times Unit Price. If Amount has null values we need to check whether Quantity and Unit Price are null. If one of those columns are null we need to remove those columns.

We need to import two packages in python before clean the data

import numpy as np
import pandas as pd

Then we need to import our data set.

df = pd.read_csv('InputMissingData.csv')

Now we need to split data into two Dataframe. One is for null values and other one for not null values. Reason for that we will clean null values and combine with not null values.

df_remove_Notnull = df[df['Amount'].notnull()].reset_index()
df_null = df[df['Amount'].isnull()].reset_index()

Now we need to loop through the null values and check whether Quantity and Unit Price are null or not. If both of the column row values are null we will eliminate the row other wise we will do the calculation.

index = np.array(df_null.index)
for i in index:
    if df_null['Quantity'][i] != 'nan' and df_null['UnitPrice'][i] != 'nan':
        df_null['Amount'][i] = df_null['Quantity'][i] * df_null['UnitPrice'][i]

No we need to drop nulls in Amount field because after we done the process and still null values are occur in Amount columns means that record is a null row. So we need to remove those rows.

df_null_new = df_null.dropna(subset=['Amount'])

Now we combine two Dataframes.

df_amount = pd.concat([df_null_new, df_remove_Notnull], ignore_index=True)

Column : Unit Price

As previous we need to split dataset into two Dataframes.

df_unit_price_null = df_amount[df_amount['UnitPrice'].isnull()].reset_index()
df_unit_price_Notnull = df_amount[df_amount['UnitPrice'].notnull()].reset_index()

Now we need to loop through df_unit_price_null DataFrame UnitPrice column to replace values.

index = np.array(df_unit_price_null.index)
for i in index:
    if df_unit_price_null['UnitPrice'][i] != 'nan':
        df_unit_price_null['UnitPrice'][i] = df_null['Amount'][i] / df_null['Quantity'][i]

After that combine the values

df_unitPrice = pd.concat([df_unit_price_Notnull, df_unit_price_null], ignore_index=True)

Column : Quantity

Do the same process in Quantity Column

df_Quantity_null = df_unitPrice[df_unitPrice['Quantity'].isnull()]
df_Quantity_Notnull = df_unitPrice[df_unitPrice['Quantity'].notnull()]
index = np.array(df_Quantity_null.index)
for i in index:
    if df_Quantity_null['Quantity'][i] != 'nan':
        df_Quantity_null['Quantity'][i] = df_Quantity_null['Amount'][i] / df_Quantity_null['UnitPrice'][i]

Combine

df_Quantity = pd.concat([df_Quantity_null, df_Quantity_Notnull], ignore_index=True)

Columns : Order ID, Item Code, Order Date, Customer ID, Customer Flag

For other columns process is the same. But Order Date is a date column we need to replace date in Order Date column.

df_Quantity = df_Quantity.fillna(value = 'Unknown')
i =  0
while i < df_Quantity.OrderID.count():
    if df_Quantity['OrderData'][i] == "Unknown":
        df_Quantity['OrderData'][i] = "1/1/9999"
    i = i + 1

Now check the dataset and find out what happened.

I hope you get some idea about how to clean data in Python

Good Luck!!!

Clean the data with Python

Scenario

Recent Posts

Comments

Subscribe Here.