How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

By Jules Jackson

October 16, 2025

0

2

How I Built a Data Cleaning Pipeline Using One Messy DoorDash Dataset

Picture by Editor

# Introduction

In keeping with CrowdFlower’s survey, information scientists spend 60% of their time organizing and cleansing the info.

On this article, we’ll stroll via constructing an information cleansing pipeline utilizing a real-life dataset from DoorDash. It accommodates practically 200,000 meals supply information, every of which incorporates dozens of options comparable to supply time, whole gadgets, and retailer class (e.g., Mexican, Thai, or American delicacies).

# Predicting Meals Supply Instances with DoorDash Information

Predicting Food Delivery Times with DoorDash Data

DoorDash goals to estimate the time it takes to ship meals precisely, from the second a buyer locations an order to the time it arrives at their door. In this information venture, we’re tasked with creating a mannequin that predicts the full supply length primarily based on historic supply information.

Nevertheless, we received’t do the entire venture—i.e., we received’t construct a predictive mannequin. As an alternative, we’ll use the dataset supplied within the venture and create an information cleansing pipeline.

Our workflow consists of two main steps.

Data Cleaning Pipeline

# Information Exploration

Data Cleaning Pipeline

Let’s begin by loading and viewing the primary few rows of the dataset.

// Load and Preview the Dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

Right here is the output.

Data Cleaning Pipeline

This dataset consists of datetime columns that seize the order creation time and precise supply time, which can be utilized to calculate supply length. It additionally accommodates different options comparable to retailer class, whole merchandise depend, subtotal, and minimal merchandise value, making it appropriate for numerous varieties of information evaluation. We will already see that there are some NaN values, which we’ll discover extra intently within the following step.

// Discover The Columns With `information()`

Let’s examine all column names with the information() methodology. We’ll use this methodology all through the article to see the adjustments in column worth counts; it’s a very good indicator of lacking information and general information well being.

Right here is the output.

Data Cleaning Pipeline

As you’ll be able to see, we now have 15 columns, however the variety of non-null values differs throughout them. This implies some columns comprise lacking values, which may have an effect on our evaluation if not dealt with correctly. One very last thing: the created_at and actual_delivery_time information varieties are objects; these ought to be datetime.

# Constructing Information Cleansing Pipeline

On this step, we construct a structured information cleansing pipeline to arrange the dataset for modeling. Every stage addresses frequent points comparable to timestamp formatting, lacking values, and irrelevant options.

Building Data Cleaning Pipeline

// Fixing the Date and Time Columns Information Varieties

Earlier than doing information evaluation, we have to repair the columns that present the time. In any other case, the calculation that we talked about (actual_delivery_time – created_at) will go unsuitable.

What we’re fixing:

created_at: when the order was positioned
actual_delivery_time: when the meals arrived

These two columns are saved as objects, so to have the ability to do calculations appropriately, we now have to transform them to the datetime format. To try this, we will use datetime features in pandas. Right here is the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.information()

Right here is the output.

Building Data Cleaning Pipeline

As you’ll be able to see from the screenshot above, the created_at and actual_delivery_time are datetime objects now.

Building Data Cleaning Pipeline

Among the many key columns, store_primary_category has the fewest non-null values (192,668), which implies it has probably the most lacking information. That’s why we’ll concentrate on cleansing it first.

// Information Imputation With `mode()`

One of many messiest columns within the dataset, evident from its excessive variety of lacking values, is store_primary_category. It tells us what sort of meals shops can be found, like Mexican, American, and Thai. Nevertheless, many rows are lacking this data, which is an issue. For example, it will possibly restrict how we will group or analyze the info. So how can we repair it?

We’ll fill these rows as a substitute of dropping them. To try this, we are going to use smarter imputation.

We write a dictionary that maps every store_id to its most frequent class, after which use that mapping to fill in lacking values. Let’s see the dataset earlier than doing that.

Data Imputation With mode

Right here is the code.

import numpy as np

# International most-frequent class as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]

# Construct store-level mapping to probably the most frequent class (quick and sturdy)
store_mode = (
    df.groupby("store_id")["store_primary_category"]
      .agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)

# Fill lacking classes utilizing the store-level mode, then fall again to international mode
df["store_primary_category"] = (
    df["store_primary_category"]
      .fillna(df["store_id"].map(store_mode))
      .fillna(global_mode)
)

df.information()

Right here is the output.

Data Imputation With mode

As you’ll be able to see from the screenshot above, the store_primary_category column now has the next non-null depend. However let’s double-check with this code.

df["store_primary_category"].isna().sum()

Right here is the output displaying the variety of NaN values. It’s zero; we removed all of them.

Data Imputation With mode

And let’s see the dataset after the imputation.

Data Imputation With mode

// Dropping Remaining NaNs

Within the earlier step, we corrected the store_primary_category, however did you discover one thing? The non-null counts throughout the columns nonetheless don’t match!

This can be a clear signal that we’re nonetheless coping with lacking values in some a part of the dataset. Now, with regards to information cleansing, we now have two choices:

Fill these lacking values
Drop them

Provided that this dataset accommodates practically 200,000 rows, we will afford to lose some. With smaller datasets, you’d have to be extra cautious. In that case, it’s advisable to research every column, set up requirements (resolve how lacking values shall be stuffed—utilizing the imply, median, most frequent worth, or domain-specific defaults), after which fill them.

To take away the NaNs, we are going to use the dropna() methodology from the pandas library. We’re setting inplace=True to use the adjustments on to the DataFrame without having to assign it once more. Let’s see the dataset at this level.

Dropping NaNs

Right here is the code.

df.dropna(inplace=True)
df.information()

Right here is the output.

Dropping NaNs

As you’ll be able to see from the screenshot above, every column now has the identical variety of non-null values.

Let’s see the dataset after all of the adjustments.

Dropping NaNs

// What Can You Do Subsequent?

Now that we now have a clear dataset, right here are some things you are able to do subsequent:

Carry out EDA to grasp supply patterns.
Engineer new options like supply hours or busy dashers ratio so as to add extra which means to your evaluation.
Analyze correlations between variables to extend your mannequin’s efficiency.
Construct completely different regression fashions and discover the best-performing mannequin.
Predict the supply length with the best-performing mannequin.

# Last Ideas

On this article, we now have cleaned the real-life dataset from DoorDash by addressing frequent information high quality points, comparable to fixing incorrect information varieties and dealing with lacking values. We constructed a easy information cleansing pipeline tailor-made to this information venture and explored potential subsequent steps.

Actual-world datasets may be messier than you suppose, however there are additionally many strategies and tips to unravel these points. Thanks for studying!

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime firms. Nate writes on the most recent tendencies within the profession market, offers interview recommendation, shares information science tasks, and covers all the pieces SQL.

Previous articleThe right way to run an R knowledge visualization chatbot you’ll be able to speak to

Next articleGSMA digs deeper into cellular trade cohesion with new Entitlements Service

How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

# Introduction

# Predicting Meals Supply Instances with DoorDash Information

# Information Exploration

// Load and Preview the Dataset

// Discover The Columns With `information()`

# Constructing Information Cleansing Pipeline

// Fixing the Date and Time Columns Information Varieties

// Information Imputation With `mode()`

// Dropping Remaining NaNs

// What Can You Do Subsequent?

# Last Ideas

Take our quiz: How a lot have you learnt about antimicrobial resistance?

ServiceNow AI Analysis Releases DRBench, a Real looking Enterprise Deep-Analysis Benchmark

Introducing Clarifai Reasoning Engine Optimized for Agentic AI Inference

LEAVE A REPLY Cancel reply

Most Popular

How Robots Are Enhancing Enterprise

How one can Run Your ML Pocket book on Databricks?

Monitor, analyze, and handle capability utilization from a single interface with Amazon EC2 Capability Supervisor

Subsequent Technology Hybrid Programs Reworking Autos

Recent Comments

ABOUT US

POPULAR POSTS

How Robots Are Enhancing Enterprise

How one can Run Your ML Pocket book on Databricks?

Monitor, analyze, and handle capability utilization from a single interface with Amazon EC2 Capability Supervisor

POPULAR CATEGORY

How I Constructed a Information Cleansing Pipeline Utilizing One Messy DoorDash Dataset

# Introduction

# Predicting Meals Supply Instances with DoorDash Information

# Information Exploration

// Load and Preview the Dataset

// Discover The Columns With information()

# Constructing Information Cleansing Pipeline

// Fixing the Date and Time Columns Information Varieties

// Information Imputation With mode()

// Dropping Remaining NaNs

// What Can You Do Subsequent?

# Last Ideas

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

// Discover The Columns With `information()`

// Information Imputation With `mode()`