top of page

Clean the data with Python

Preparing the data is essential part of the Machine Learning or dump data into Data warehouse or Data Lake. When we received the data or records there are some missing data and unfilled values. When we face this kind of scenario we need to first analyze the dataset and then need to necessary steps.


In this blog post I will guide you some common scenarios that anyone will find out in the datasets.


Scenario

Table structure

  • Order ID

  • Item Code

  • Order Date

  • Customer ID

  • Customer Flag

  • Quantity

  • Unit Price

  • Amount

In this table every column has one or more null values. We need to clean this one by one.

First we need to take numeric columns.


Column : Amount

Amount is calculate with Quantity times Unit Price. If Amount has null values we need to check whether Quantity and Unit Price are null. If one of those columns are null we need to remove those columns.

We need to import two packages in python before clean the data

import numpy as np
import pandas as pd

Then we need to import our data set.

df = pd.read_csv('InputMissingData.csv')

Now we need to split data into two Dataframe. One is for null values and other one for not null values. Reason for that we will clean null values and combine with not null values.

df_remove_Notnull = df[df['Amount'].notnull()].reset_index()
df_null = df[df['Amount'].isnull()].reset_index()

Now we need to loop through the null values and check whether Quantity and Unit Price are null or not. If both of the column row values are null we will eliminate the row other wise we will do the calculation.

index = np.array(df_null.index)
for i in index:
    if df_null['Quantity'][i] != 'nan' and df_null['UnitPrice'][i] != 'nan':
        df_null['Amount'][i] = df_null['Quantity'][i] * df_null['UnitPrice'][i]

No we need to drop nulls in Amount field because after we done the process and still null values are occur in Amount columns means that record is a null row. So we need to remove those rows.

df_null_new = df_null.dropna(subset=['Amount'])

Now we combine two Dataframes.

df_amount = pd.concat([df_null_new, df_remove_Notnull], ignore_index=True)

Column : Unit Price

As previous we need to split dataset into two Dataframes.

df_unit_price_null = df_amount[df_amount['UnitPrice'].isnull()].reset_index()
df_unit_price_Notnull = df_amount[df_amount['UnitPrice'].notnull()].reset_index()

Now we need to loop through df_unit_price_null DataFrame UnitPrice column to replace values.

index = np.array(df_unit_price_null.index)
for i in index:
    if df_unit_price_null['UnitPrice'][i] != 'nan':
        df_unit_price_null['UnitPrice'][i] = df_null['Amount'][i] / df_null['Quantity'][i]

After that combine the values

df_unitPrice = pd.concat([df_unit_price_Notnull, df_unit_price_null], ignore_index=True)

Column : Quantity

Do the same process in Quantity Column

df_Quantity_null = df_unitPrice[df_unitPrice['Quantity'].isnull()]
df_Quantity_Notnull = df_unitPrice[df_unitPrice['Quantity'].notnull()]
index = np.array(df_Quantity_null.index)
for i in index:
    if df_Quantity_null['Quantity'][i] != 'nan':
        df_Quantity_null['Quantity'][i] = df_Quantity_null['Amount'][i] / df_Quantity_null['UnitPrice'][i]

Combine

df_Quantity = pd.concat([df_Quantity_null, df_Quantity_Notnull], ignore_index=True)

Columns : Order ID, Item Code, Order Date, Customer ID, Customer Flag

For other columns process is the same. But Order Date is a date column we need to replace date in Order Date column.

df_Quantity = df_Quantity.fillna(value = 'Unknown')
i =  0
while i < df_Quantity.OrderID.count():
    if df_Quantity['OrderData'][i] == "Unknown":
        df_Quantity['OrderData'][i] = "1/1/9999"
    i = i + 1

Now check the dataset and find out what happened.


I hope you get some idea about how to clean data in Python


Good Luck!!!

Commentaires


Subscribe Here.

Thanks for subscribing!

+94766088392

Colombo, Sri Lanka.

  • LinkedIn
  • Facebook
  • Instagram
bottom of page