Three common dirty data problems and how to fix them with Pandas

Data Science is a process and involves; defining your problem, collecting data, cleaning data, training and deploying machine learning models. The training of machine learning models is what attracts many to take the journey and I guess it is the cool part. However, many Data Scientist in industry will argue that training a model is the least time consuming task and cleaning data for analysis consumes a majority of their time. Can be up to 80%! This is mainly because processes that create data are imperfect.

Unfortunately, many courses focus on the model training aspect of Data Science and use “clean data”. There is a lack of focus on how to convert “messy data” to “clean data” . One is quick too blame bad model results to not having sufficient data. Whilst this might be true it is not the only cause as machine learning models require the right data in a clean format to produce good results. You know the saying “garbage in garbage out!”

This post is for those who are just stepping into the field and wondering why machine learning isn’t living up to its promise. Well one of the more common reasons.

What is “clean data” ?

The famous R Developer Hadley Wickham uses the term “tidy data” (which I think pretty much means the same thing) and defines it as where:

• Column headers are variable names.
• A single variable is stored in one column.
• Variables are only stored in columns.
• Unique observations in each row
• A single observational unit is stored in one table

Most of these text book definitions are still quite ambiguous and therefore this blog post doesn’t focus on definitions but rather on how to diagnose and fix common “messy data” problems. These common problems are:

  1. Type constraints
  2. Range constraints
  3. Uniqueness constraints

Type Constraints

Working with data you will encounter various data types (see table below).

Data typeExample
TextTweet, comment, username
CategoryMovie genres, marriage status
IntegerAge, number of likes
BinaryYes/no
DatesSubscription date
DecimalsTemperature
Data types

Data is not always loaded as the correct type. Let us look at an example using the “Trending YouTube Video Statistics” data available on Kaggle. This is a data set that contains several months of data on daily trending YouTube videos. The data is included for various regions but I focus only on the USA data set. Loading the data set and using Panda’s .dtypes method produces the following:

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

They are several things incorrect with the above output: category_id, trending_date, publish_time and description have been allocated incorrect types. The method .dtypes can be used to identify type constraints. To fix them, the methods .astype() and the function to_datetime() can be applied as follows:

#required libraries
import pandas as pd
import datetime as dt
#perform type conversions
dat['category_id'] = dat['category_id'].astype('category')
dat['trending_date'] = pd.to_datetime(dat['trending_date'], format="%y.%d.%m").dt.date
dat['publish_time'] = pd.to_datetime(dat['publish_time']).dt.date
dat['description'] = dat['description'].astype('str')
#verify
dat.dtypes

Note the type conversion to date format are a bit more tricky and in this example I had to specify the original date format that the data was in for the trending_date but did not have to for the publish_time. These conversions are all necessary in order for machine learning models to treat variables appropriately.

Range Constraints

Another common data problem is dealing with values that fall out of range. Example, in the “Trending YouTube Video Statistics” if we have a trending_date observation that records a value well into the future, that value is out of range. The best way to identify out of range values is through data visualizations such as histograms, box plots, count plots and scatter plots.

#plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
#scatter plot
dat.plot(x = "views", y = "likes", kind = "scatter")
plt.show()
#get year attribute
dat["Year_Trend"] = pd.DatetimeIndex(us_dat['trending_date']).year
#count plot
sns.countplot(x='Year_Trend', data = dat, palette = 'RdBu')
#box plot
dat['likes'].plot(kind = "box")
plt.show()
#histogram
dat['comment_count'].plot(kind = "hist")
plt.show()

Below are the plot outputs produced using the above code and from these no out of range values were identified.

So what if the data does have out of range values, how will one go about fixing them? Here are a couple of methods you can use:

  1. Drop the rows. This is not the most intuitive method as information can be lost. Only do this if there is no other remedy and affected rows are few.
  2. Set hard limits, custom minimums and maximums to your columns.
  3. Apply a custom business rule. This is most common in practice (well from my experience).
  4. Treat as missing then impute.
  5. Normalize, scale. This works well for numerical outliers.

Let us assume that some values in the trending_date columns have incorrectly been captured to a future date.  The code below implements the option to drop the rows and the option to set a hard limit.

import datetime as dt
#get present year
present_year = dt.date.today().year
#get year attribute
dat["Year_Trend"] = pd.DatetimeIndex(us_dat['trending_date']).year
#drop by filtering
dat = dat[dat["Year_Trend""] < present_year]
#drop by using .drop method
dat.drop(dat[dat["Year_Trend""] < present_year].index, inplace = True)
#convert to hard limit
dat.loc[dat["Year_Trend""] < present_year, "Year_Trend""] = present_year

Uniqueness constraints

This data problem appears when you have the exact information repeated across multiple rows for some or all columns in your data. Can also refer to this as duplicate data.

To identify duplicate rows you can use the .duplicated() method.

#check for duplicates
duplicates = dat.duplicated('video_id', keep = False)
duplicated_post = dat[duplicates].sort_values(by = 'video_id')

The above code returns all records with duplicated “video_id” the number of rows returned is 40242. Below is a sample output of the first six rows and four columns.

video_idtrending_datechannel_titlepublish_time
-0CMnp02rNY11/06/2018TheEllenShow04/06/2018
-0CMnp02rNY10/06/2018TheEllenShow04/06/2018
-0CMnp02rNY06/06/2018TheEllenShow04/06/2018
-0CMnp02rNY08/06/2018TheEllenShow04/06/2018
-0CMnp02rNY07/06/2018TheEllenShow04/06/2018
-0CMnp02rNY09/06/2018TheEllenShow04/06/2018
Example incomplete duplicates

In the above not all columns contain duplicates. The trending_date is different for each record. This scenario is known as incomplete duplicates. If I do not specify the column subset argument in the .duplicated() method the data will look different. The resulting data frame returns 48 rows instead of 40242. This is a significant difference. See below a sample of the first three rows.

video_idtrending_datechannel_titlepublish_time
1RZYOeQeIXE15/05/2018TheEllenShow14/05/2018
1U1u5aKU3AY15/05/2018CNN18/05/2018
2PH7dK6SLC815/05/2018johnmayerVEVO10/05/2018
odd results from using .duplicate() method without specifying the subset column(s)

In the above it looks like python has randomly selected to filter rows based on trending_date equal to 15/05/2018 . These results are rather odd and misleading. As a rule of thumb never use the .duplicated() method without specifying the subset column(s). If we drop these columns we will still have the incomplete duplicate scenario in the first table. Which records do we use? This is where we need to apply some common sense logic to the data.

The goal is to have a unique “video_id” and “trending_date” for each row. Notice that the data records the trend statistics for each video for multiple trending_dates as a video can trend for multiple days. So which trending date is correct? This will be dependent on the goal of my analysis.

In the analysis I am only interested in the most recent statistics of a post. Therefore, I can specify a method to only return the most recent trending_date for a post. In python this is achieved by using .groupby() and .transform() method. In the .transform() method I specify the statistic which in this case is max.

idx = dat.groupby('video_id')['trending_date'].transform(max) == dat['trending_date']
mydf = dat[idx]
#check for duplicates
duplicates = mydf.duplicated(['video_id', 'trending_date'], keep = False)
duplicated_post = mydf[duplicates].sort_values(by = 'video_id')
duplicated_post.to_csv("duplicates.csv")
#drop duplicates
mydat = mydf.drop_duplicates(['video_id', 'trending_date'])

After applying the above method the resulting duplicate columns are only six.

For illustrative purposes suppose the first two records of the six duplicated rows looked like the table below. The video_id and trending_date are the same as expected, however the the number of likes for the post are different. Here one can use .groupby() and .agg() methods to specify the appropriate statistical measures to combine sets of values e.g. (mean, max, sum etc)

video_idtrending_datetitlechannel_titlepublishing_timelikes
Dwc27Lsr1EY2018-05-15MANIFEST | Official Trailer | NBC Fall Shows 2018Manifest2018-05-101861
Dwc27Lsr1EY2018-05-15MANIFEST | Official Trailer | NBC Fall Shows 2018Manifest2018-05-101760
duplicate values example

#get only the latest post
 and aggregate likes column by mean
summaries = {'trending_date': 'max', 'likes' = 'mean'}
 #create dictionary 
idx = dat.groupby(by = 'video_id').agg(summaries).reset_index() #apply appropriate statistics

The above case study was a great example illustrating how identifying duplicates in your data is not a straight forward process and a common sense approach is sometimes needed.

For illustrative purposed let us visualize the data again for range constraints after removing duplicates.

We now see some evidence of outliers/extreme values in the data. Without cleaning your data you risk drawing the wrong conclusions from data and building inaccurate models that ultimately will lead to poor decision making. Never assume that your data is not faulty.

Note there are many more other problems that you may encounter when cleaning your data. These include, renaming columns, removing blank spaces, regular expressions for fuzzy logic, dropping zero variance columns. For further reading you can have a look at the below resources:

Better data beats fancier algorithms

Liked it? Take a second to support Samantha Van Der Merwe on Patreon!

Leave a Reply