Marie Kondo can’t help with this one.
Data cleansing (or cleaning), is used to refer to the process of detecting and correcting inaccurate, corrupt or unusable data. It is an essential step before any data analysis project, since every step after it assumes the data is “clean” or, in other words, trustworthy and accurate.
In this 2-part series, we'll be going over the steps required for understanding what makes data "dirty", and how to clean it. Get your rubber gloves ready!
When do I need to clean my data?
There’s a high chance you’ll always have to do some cleaning. When working with datasets, especially large ones, you will almost inevitably find errors among the records- either because of human error or other issues during data collection and storage. It is thus good practice to always take the time to familiarize yourself with the dataset you’re working with, examine it for potential errors and corrupt records, and deal with them before actually trying to make sense of the data.
What tools do we need to clean data?
Most small to medium sized datasets can be properly analyzed and cleansed using Google Sheets or Excel. In this article, we will work with a small a dataset in Google Sheets- think a few hundreds or thousands of records at most.
The dataset we will be using as an example (as well as the final, "cleansed", version) can be found here.
Get to know your Dataset
Before we do any cleaning, let's take the time to familiarize ourselves with the dataset. In this case, we will use a sample of records from the OpenFoodFacts database, containing information about a variety of food products and their nutritional values, brand, etc.
The first step to cleaning your data is being able to verbally describe your dataset, or in other words, speak to its characteristics.
To describe your data, you will need to know:
The size of the dataset
Knowing how many records you’re working with will not only give you an idea of how much data you have available, but it will also let you know how much data was expunged after you are done cleaning.
Specifically, we need to know how many columns (features, or properties) and rows (records) are in a dataset.
You can easily get the # of columns and rows by using the COUNTA formula, which returns the number of cells with values in a given range.
For rows, it should look something like:
You should see a value of 2000. What's happening here is we're asking COUNTA to give us the amount of populated cells on column A.
Now, let's try the same for getting the number of columns:
Protip: You can avoid having to scroll to the right to get the last column by just clicking on row 1. That will select the entire first row for you, with all the columns it spans across.
You should get 155. That means the sample we’re using contains 155 columns and 2000 rows.
Great! Now we know how big our dataset is.
The Features’ Data Types
Now that we've taken a general glance at the dataset, we need to start looking at the types of data contained in each column. Most data can be categorized into 4 basic types:
- Numerical Data: Numbers! It can also be called quantitative data. Measurements like height, weight, distances, or, in our dataset, sugar content, all fall within this category. Numerical data can further be split into continuous or discrete data. Continuous data assumes any value within a range (sugar for 100g can be any value between 0 and 100). Discrete data only takes distinct values (i.e a number of people will always be an integer).
- Categorical Data represents a property of an object that allows grouping it in a category. Gender, nationality, hometown are all categories one could group people into. It is also possible to represent categorical data with numbers, such as 1 for male and 2 for female, which hold no mathematical significance, and thus still represent categorical, not numerical, data.
There is also what we call ordinal data, which is a mix of numerical and categorical data of sorts. In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular way. An example would be age groups, such as baby boomers, Gen X, Millenials and Gen Z. These would represent groups of people that can be ranked chronologically.
In our dataset, the “countries” and “brands” columns are both examples of categorical data.
- Time Series: Sequences of numbers representing time-based values that allow the data to be situated chronologically. These can dates or timestamps that allow you to visualize trends and evolutions in time for your data. It is important to know what type of representation is being used for time series data in your dataset. In the example we’re using, we have both UNIX epoch timestamps and ISO format timestamps for the record creation time. In cases where the format isn’t supported by Google Sheets like this one, it will be necessary to convert them to another, more manageable format.
- Text Data: Words or sentences that don’t represent categories, such as item descriptions, names, addresses etc…
Now that you're ready
Once you understand the contents of your data, the data types you will be dealing with, and the general shape of your dataset, you are ready to tackle the real cleaning. It's now time to head on over to Part 2 of this article and find out how you can polish raw data into a jewel of insights.