
Data Science is a process and involves; defining your problem, collecting data, cleaning data, training and deploying machine learning models. The training of machine learning models is what attracts many to take the journey and I guess it is the cool part. However, many Data Scientist in industry will argue that training a model is the least time consuming task and cleaning data for analysis consumes a majority of their time. Can be up to 80%! This is mainly because processes that create data are imperfect.
Unfortunately, many courses focus on the model training aspect of Data Science and use “clean data”. There is a lack of focus on how to convert “messy data” to “clean data” . One is quick too blame bad model results to not having sufficient data. Whilst this might be true it is not the only cause as machine learning models require the right data in a clean format to produce good results. You know the saying “garbage in garbage out!”
This post is for those who are just stepping into the field and wondering why machine learning isn’t living up to its promise. Well one of the more common reasons.
What is “clean data” ?
The famous R Developer Hadley Wickham uses the term “tidy data” (which I think pretty much means the same thing) and defines it as where:
• Column headers are variable names.
• A single variable is stored in one column.
• Variables are only stored in columns.
• Unique observations in each row
• A single observational unit is stored in one table
Most of these text book definitions are still quite ambiguous and therefore this blog post doesn’t focus on definitions but rather on how to diagnose and fix common “messy data” problems. These common problems are:
- Type constraints
- Range constraints
- Uniqueness constraints
Type Constraints
Working with data you will encounter various data types (see table below).
Data type | Example |
Text | Tweet, comment, username |
Category | Movie genres, marriage status |
Integer | Age, number of likes |
Binary | Yes/no |
Dates | Subscription date |
Decimals | Temperature |
Data is not always loaded as the correct type. Let us look at an example using the “Trending YouTube Video Statistics” data available on Kaggle. This is a data set that contains several months of data on daily trending YouTube videos. The data is included for various regions but I focus only on the USA data set. Loading the data set and using Panda’s .dtypes method produces the following:
video_id object trending_date object title object channel_title object category_id int64 publish_time object tags object views int64 likes int64 dislikes int64 comment_count int64 thumbnail_link object comments_disabled bool ratings_disabled bool video_error_or_removed bool description object dtype: object
They are several things incorrect with the above output: category_id, trending_date, publish_time and description have been allocated incorrect types. The method .dtypes can be used to identify type constraints. To fix them, the methods .astype() and the function to_datetime() can be applied as follows:
#required libraries import pandas as pd import datetime as dt #perform type conversions dat['category_id'] = dat['category_id'].astype('category') dat['trending_date'] = pd.to_datetime(dat['trending_date'], format="%y.%d.%m").dt.date dat['publish_time'] = pd.to_datetime(dat['publish_time']).dt.date dat['description'] = dat['description'].astype('str') #verify dat.dtypes
Note the type conversion to date format are a bit more tricky and in this example I had to specify the original date format that the data was in for the trending_date but did not have to for the publish_time. These conversions are all necessary in order for machine learning models to treat variables appropriately.
Range Constraints
Another common data problem is dealing with values that fall out of range. Example, in the “Trending YouTube Video Statistics” if we have a trending_date observation that records a value well into the future, that value is out of range. The best way to identify out of range values is through data visualizations such as histograms, box plots, count plots and scatter plots.
#plotting libraries import matplotlib.pyplot as plt import seaborn as sns #scatter plot dat.plot(x = "views", y = "likes", kind = "scatter") plt.show() #get year attribute dat["Year_Trend"] = pd.DatetimeIndex(us_dat['trending_date']).year #count plot sns.countplot(x='Year_Trend', data = dat, palette = 'RdBu') #box plot dat['likes'].plot(kind = "box") plt.show() #histogram dat['comment_count'].plot(kind = "hist") plt.show()
Below are the plot outputs produced using the above code and from these no out of range values were identified.




So what if the data does have out of range values, how will one go about fixing them? Here are a couple of methods you can use:
- Drop the rows. This is not the most intuitive method as information can be lost. Only do this if there is no other remedy and affected rows are few.
- Set hard limits, custom minimums and maximums to your columns.
- Apply a custom business rule. This is most common in practice (well from my experience).
- Treat as missing then impute.
- Normalize, scale. This works well for numerical outliers.
Let us assume that some values in the trending_date columns have incorrectly been captured to a future date. The code below implements the option to drop the rows and the option to set a hard limit.
import datetime as dt #get present year present_year = dt.date.today().year #get year attribute dat["Year_Trend"] = pd.DatetimeIndex(us_dat['trending_date']).year #drop by filtering dat = dat[dat["Year_Trend""] < present_year] #drop by using .drop method dat.drop(dat[dat["Year_Trend""] < present_year].index, inplace = True) #convert to hard limit dat.loc[dat["Year_Trend""] < present_year, "Year_Trend""] = present_year
Uniqueness constraints
This data problem appears when you have the exact information repeated across multiple rows for some or all columns in your data. Can also refer to this as duplicate data.
To identify duplicate rows you can use the .duplicated() method.
#check for duplicates duplicates = dat.duplicated('video_id', keep = False) duplicated_post = dat[duplicates].sort_values(by = 'video_id')
The above code returns all records with duplicated “video_id” the number of rows returned is 40242. Below is a sample output of the first six rows and four columns.
video_id | trending_date | channel_title | publish_time |
-0CMnp02rNY | 11/06/2018 | TheEllenShow | 04/06/2018 |
-0CMnp02rNY | 10/06/2018 | TheEllenShow | 04/06/2018 |
-0CMnp02rNY | 06/06/2018 | TheEllenShow | 04/06/2018 |
-0CMnp02rNY | 08/06/2018 | TheEllenShow | 04/06/2018 |
-0CMnp02rNY | 07/06/2018 | TheEllenShow | 04/06/2018 |
-0CMnp02rNY | 09/06/2018 | TheEllenShow | 04/06/2018 |
In the above not all columns contain duplicates. The trending_date is different for each record. This scenario is known as incomplete duplicates. If I do not specify the column subset argument in the .duplicated() method the data will look different. The resulting data frame returns 48 rows instead of 40242. This is a significant difference. See below a sample of the first three rows.
video_id | trending_date | channel_title | publish_time |
1RZYOeQeIXE | 15/05/2018 | TheEllenShow | 14/05/2018 |
1U1u5aKU3AY | 15/05/2018 | CNN | 18/05/2018 |
2PH7dK6SLC8 | 15/05/2018 | johnmayerVEVO | 10/05/2018 |
In the above it looks like python has randomly selected to filter rows based on trending_date equal to 15/05/2018 . These results are rather odd and misleading. As a rule of thumb never use the .duplicated() method without specifying the subset column(s). If we drop these columns we will still have the incomplete duplicate scenario in the first table. Which records do we use? This is where we need to apply some common sense logic to the data.
The goal is to have a unique “video_id” and “trending_date” for each row. Notice that the data records the trend statistics for each video for multiple trending_dates as a video can trend for multiple days. So which trending date is correct? This will be dependent on the goal of my analysis.
In the analysis I am only interested in the most recent statistics of a post. Therefore, I can specify a method to only return the most recent trending_date for a post. In python this is achieved by using .groupby() and .transform() method. In the .transform() method I specify the statistic which in this case is max.
idx = dat.groupby('video_id')['trending_date'].transform(max) == dat['trending_date'] mydf = dat[idx] #check for duplicates duplicates = mydf.duplicated(['video_id', 'trending_date'], keep = False) duplicated_post = mydf[duplicates].sort_values(by = 'video_id') duplicated_post.to_csv("duplicates.csv") #drop duplicates mydat = mydf.drop_duplicates(['video_id', 'trending_date'])
After applying the above method the resulting duplicate columns are only six.
For illustrative purposes suppose the first two records of the six duplicated rows looked like the table below. The video_id and trending_date are the same as expected, however the the number of likes for the post are different. Here one can use .groupby() and .agg() methods to specify the appropriate statistical measures to combine sets of values e.g. (mean, max, sum etc)
video_id | trending_date | title | channel_title | publishing_time | likes |
Dwc27Lsr1EY | 2018-05-15 | MANIFEST | Official Trailer | NBC Fall Shows 2018 | Manifest | 2018-05-10 | 1861 |
Dwc27Lsr1EY | 2018-05-15 | MANIFEST | Official Trailer | NBC Fall Shows 2018 | Manifest | 2018-05-10 | 1760 |
#get only the latest post and aggregate likes column by mean summaries = {'trending_date': 'max', 'likes' = 'mean'} #create dictionary idx = dat.groupby(by = 'video_id').agg(summaries).reset_index() #apply appropriate statistics
The above case study was a great example illustrating how identifying duplicates in your data is not a straight forward process and a common sense approach is sometimes needed.
For illustrative purposed let us visualize the data again for range constraints after removing duplicates.




We now see some evidence of outliers/extreme values in the data. Without cleaning your data you risk drawing the wrong conclusions from data and building inaccurate models that ultimately will lead to poor decision making. Never assume that your data is not faulty.
Note there are many more other problems that you may encounter when cleaning your data. These include, renaming columns, removing blank spaces, regular expressions for fuzzy logic, dropping zero variance columns. For further reading you can have a look at the below resources:
- Pythonic Data Cleaning With Pandas and NumPy
- Reshaping and pivot tables
- Fuzzy String Matching in Python
Better data beats fancier algorithms