
Picture by Writer | Canva
Pandas library has one of many fastest-growing communities. This reputation has opened the door for alternate options, like polars. On this article, we’ll discover one such different, DuckDB.
DuckDB is an SQL database that you would be able to run proper in your pocket book. No setup is required, and no servers are wanted. It’s simple to put in and might work with Pandas in parallel.
Not like different SQL databases, you don’t have to configure the server. It simply works along with your pocket book after set up. Meaning no native setup complications, you’re writing the code immediately. DuckDB handles filtering, joins, and aggregations with clear SQL syntax, in comparison with Pandas, and performs considerably higher on giant datasets.
So sufficient with the phrases, let’s get began!
Information Mission – Uber Enterprise Modeling
We’ll use it with Jupyter Pocket book, combining it with Python for information evaluation. To make issues extra thrilling, we’ll work on a real-life information undertaking. Let’s get began!
Right here is the hyperlink to the info undertaking we’ll be utilizing on this article. It’s an information undertaking from Uber referred to as Accomplice’s Enterprise Modeling.
Uber used this information undertaking within the recruitment course of for the info science positions, and you may be requested to research the info for 2 completely different situations.
- State of affairs 1: Examine the price of two bonus applications designed to get extra drivers on-line throughout a busy day.
- State of affairs 2: Calculate and evaluate the annual web revenue of a standard taxi driver vs one who companions with Uber and buys a automotive.
Loading Dataset
Let’s load the dataframe first. This step will probably be wanted; therefore, we’ll register this dataset with DuckDB within the following sections.
import pandas as pd
df = pd.read_csv("dataset_2.csv")
Exploring the Dataset
Listed below are the primary few rows:
Let’s see all of the columns.
Right here is the output.
Join DuckDB and Register the DataFrame
Good, it’s a actually easy dataset, however how can we join DuckDB with this dataset?
First, if in case you have not put in it but, set up DuckDB.
Connecting with DuckDB is straightforward. Additionally, if you wish to learn the documentation, test it out right here.
Now, right here is the code to make a connection and register the dataframe.
import duckdb
con = duckdb.join()
con.register("my_data", df)
Good, let’s begin exploring seven queries that may prevent hours of Pandas work!
1. Multi-Standards Filtering for Complicated Eligibility Guidelines
One of the crucial important benefits of SQL is the way it naturally handles filtering, particularly multi-condition filtering, very simply.
Implementation of Multi-Criterial Filtering in DuckDB vs Pandas
DuckDB lets you apply a number of filters utilizing SQL’s The place Clauses and logic, which scales effectively because the variety of filters grows.
SELECT
*
FROM information
WHERE condition_1
AND condition_2
AND condition_3
AND condition_4
Now let’s see how we’d write the identical logic in Pandas. In Pandas, the small logic is expressed utilizing chained boolean masks with brackets, which might get verbose beneath many circumstances.
filtered_df = df[
(df["condition_1"]) &
(df["condition_2"]) &
(df["condition_3"]) &
(df["condition_4"])
]
Each strategies are equally readable and relevant to fundamental use. DuckDB feels extra pure and cleaner because the logic will get extra complicated.
Multi-Standards Filtering for the Uber Information Mission
On this case, we wish to discover drivers who qualify for a particular Uber bonus program.
In line with the foundations, the drivers should:
- Be on-line for a minimum of 8 hours
- Full a minimum of 10 journeys
- Settle for a minimum of 90% of trip requests
- Having a score of 4.7 or above
Now all now we have to do is write a question that does all these filterings. Right here is the code.
SELECT
COUN(*) AS qualified_drivers,
COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Score >= 4.7
However to execute this code with Python, we have to add con.execute(“”” “””) and fetchdf() strategies as proven beneath:
con.execute("""
SELECT
COUNT(*) AS qualified_drivers,
COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Score >= 4.7
""").fetchdf()
We’ll do that all through the article. Now that you understand how to run it in a Jupyter pocket book, we’ll present solely the SQL code any more, and also you’ll know how you can convert it to the Pythonic model.
Good. Now, keep in mind that the info undertaking desires us to calculate the full payout for Choice 1.
We’ve calculated the sum of the motive force, however we should always multiply this by $50, as a result of the payout will probably be $50 for every driver, so we’ll do it with COUNT(*) * 50.
Right here is the output.
2. Quick Aggregation to Estimate Enterprise Incentives
SQL is nice for rapidly aggregating, particularly when it is advisable to summarize information throughout rows.
Implementation of Aggregation in DuckDB vs Pandas
DuckDB enables you to mixture values throughout rows utilizing SQL features like SUM and COUNT in a single compact block.
SELECT
COUNT(*) AS num_rows,
SUM(column_name) AS total_value
FROM information
WHERE some_condition
In pandas, you first have to filter the dataframe, then individually depend and sum utilizing chaining strategies.
filtered = df[df["some_condition"]]
num_rows = filtered.form[0]
total_value = filtered["column_name"].sum()
DuckDB is extra concise and simpler to learn, and doesn’t require managing intermediate variables.
Aggregation in Uber Information Mission
Good, let’s transfer on to the second bonus scheme, Choice 2. In line with the undertaking description, drivers will obtain $4 per journey if:
- They full a minimum of 12 journeys.
- Have a score of 4.7 or higher.
This time, as a substitute of simply counting the drivers, we have to add the variety of journeys they accomplished for the reason that bonus is paid per journey, not per individual.
SELECT
COUNT(*) AS qualified_drivers,
SUM("Journeys Accomplished") * 4 AS total_payout
FROM information
WHERE "Journeys Accomplished" >= 12
AND Score >= 4.7
The depend right here tells us what number of drivers qualify. Nonetheless, to calculate the full payout, we’ll calculate their journeys and multiply by $4, as required by Choice 2.
Right here is the output.
With DuckDB, we don’t have to loop by means of the rows or construct customized aggregations. The Sum operate takes care of all the things we’d like.
3. Detect Overlaps and Variations Utilizing Boolean Logic
In SQL, you may simply mix the circumstances by utilizing Boolean Logic, comparable to AND, OR, and NOT.
Implementation of Boolean Logic in DuckDB vs Pandas
DuckDB helps boolean logic natively within the WHERE clause utilizing AND, OR, and NOT.
SELECT *
FROM information
WHERE condition_a
AND condition_b
AND NOT (condition_c)
Pandas requires a mix of logical operators with masks and parentheses, together with the usage of “~” for negation.
filtered = df[
(df["condition_a"]) &
(df["condition_b"]) &
~(df["condition_c"])
]
Whereas each are practical, DuckDB is less complicated to motive about when the logic includes exclusions or nested circumstances.
Boolean Logic for Uber Information Mission
Now now we have calculated Choice 1 and Choice 2, what comes subsequent? Now it’s time to do the comparability. Bear in mind our subsequent query.
That is the place we will use Boolean Logic. We’ll use a mix of AND and NOT.
SELECT COUNT(*) AS only_option1
FROM information
WHERE "Provide Hours" >= 8
AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
AND "Journeys Accomplished" >= 10
AND Score >= 4.7
AND NOT ("Journeys Accomplished" >= 12 AND Score >= 4.7)
Right here is the output.
Let’s break it down:
- The primary 4 circumstances are right here for Choice 1.
- The NOT(..) half is used to exclude drivers who additionally qualify for Choice 2.
It’s fairly easy, proper?
4. Fast Cohort Sizing with Conditional Filters
Generally, you wish to perceive how huge a particular group or cohort is inside your information.
Implementation of Conditional Filters in DuckDB vs Pandas?
DuckDB handles cohort filtering and share calculation with one SQL question, even together with subqueries.
SELECT
ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE condition_1
AND condition_2
AND condition_3
Pandas requires filtering, counting, and guide division to calculate percentages.
filtered = df[
(df["condition_1"]) &
(df["condition_2"]) &
(df["condition_3"])
]
share = spherical(100.0 * len(filtered) / len(df), 2)
DuckDB right here is cleaner and quicker. It minimizes the variety of steps and avoids repeated code.
Cohort Sizing For Uber Information Mission
Now we’re on the final query of State of affairs 1. On this query, Uber desires us to search out out the drivers that might not obtain some duties, like journeys and acceptance fee, but had larger rankings, particularly the drivers.
- Accomplished lower than 10 journeys
- Had an acceptance fee decrease than 90
- Had a score larger than 4.7
Now, these are three separate filters, and we wish to calculate the share of drivers satisfying every of them. Let’s see the question.
SELECT
ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE "Journeys Accomplished"
Right here is the output.
Right here, we filtered the rows the place all three circumstances had been happy, counted them, and divided them by the full variety of drivers to get a share.
5. Primary Arithmetic Queries for Income Modeling
Now, let’s say you wish to do some fundamental math. You may write expressions straight into your SELECT assertion.
Implementation of Arithmetic in DuckDB vs Pandas
DuckDB permits arithmetic to be written straight within the SELECT clause like a calculator.
SELECT
daily_income * work_days * weeks_per_year AS annual_revenue,
weekly_cost * weeks_per_year AS total_cost,
(daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM information
Pandas requires a number of intermediate calculations in separate variables for a similar consequence.
daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500
annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost
DuckDB simplifies the mathematics logic right into a readable SQL block, whereas Pandas will get a bit cluttered with variable assignments.
Primary Arithmetic in Uber Information Mission
In State of affairs 2, Uber requested us to calculate how a lot cash (after bills) the motive force makes per yr with out partnering with Uber. Listed below are some bills like fuel, hire, and insurance coverage.
Now let’s calculate the annual income and subtract the bills from it.
SELECT
200 * 6 * (52 - 3) AS annual_revenue,
200 * (52 - 3) AS gas_expense,
500 * (52 - 3) AS rent_expense,
400 * 12 AS insurance_expense,
(200 * 6 * (52 - 3))
- (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income
Right here is the output.
With DuckDB, you may write this like a SQL matrix block. You do not want Pandas Dataframes or guide looping!
6. Conditional Calculations for Dynamic Expense Planning
What in case your value construction modifications based mostly on sure circumstances?
Implementation of Conditional Calculations in DuckDB vs Pandas
DuckDB enables you to apply conditional logic utilizing arithmetic changes inside your question.
SELECT
original_cost * 1.05 AS increased_cost,
original_cost * 0.8 AS discounted_cost,
0 AS removed_cost,
(original_cost * 1.05 + original_cost * 0.8) AS total_new_cost
Pandas makes use of the identical logic with a number of math strains and guide updates to variables.
weeks_worked = 49
fuel = 200
insurance coverage = 400
gas_expense = fuel * 1.05 * weeks_worked
insurance_expense = insurance coverage * 0.8 * 12
rent_expense = 0
complete = gas_expense + insurance_expense
DuckDB turns what could be a multi-step logic in pandas right into a single SQL expression.
Conditional Calculations in Uber Information Mission
On this situation, we now mannequin what occurs if the motive force companions with Uber and buys a automotive. The bills change like
- Gasoline value will increase by 5%
- Insurance coverage decreases by 20%
- No extra hire expense
con.execute("""
SELECT
200 * 1.05 * 49 AS gas_expense,
400 * 0.8 * 12 AS insurance_expense,
0 AS rent_expense,
(200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()
Right here is the output.
7. Purpose-Pushed Math for Income Focusing on
Generally, your evaluation will be pushed by a enterprise objective like hitting a income goal or masking a one time value.
Implementation of Purpose-Pushed Math in DuckDB vs Pandas
DuckDB handles multi-step logic utilizing CTEs. It makes the question modular and simple to learn.
WITH vars AS (
SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
SELECT
base_income - (cost_1 + cost_2) AS current_profit,
cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
FROM vars
),
remaining AS (
SELECT
current_profit + new_total_expense AS required_revenue,
required_revenue / 49 AS required_weekly_income
FROM calc
)
SELECT required_weekly_income FROM remaining
Pandas requires nesting of calculations and reuse of earlier variables to keep away from duplication.
weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost
# new bills + automotive value
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000
required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks
DuckDB lets you construct a logic pipeline step-by-step, with out cluttering your pocket book with scattered code.
Purpose-Pushed Math in Uber Information Mission
Now that now we have modeled the brand new prices, let’s reply the ultimate enterprise query:
How rather more does the motive force have to earn per week to do each?
- Repay a $40.000 automotive inside a yr
- Keep the identical yearly web revenue
Now let’s write the code representing this logic.
WITH vars AS (
SELECT
52 AS total_weeks_per_year,
3 AS weeks_off,
6 AS days_per_week,
200 AS fare_per_day,
400 AS monthly_insurance,
200 AS gas_per_week,
500 AS vehicle_rent,
40000 AS car_cost
),
base AS (
SELECT
total_weeks_per_year,
weeks_off,
days_per_week,
fare_per_day,
monthly_insurance,
gas_per_week,
vehicle_rent,
car_cost,
total_weeks_per_year - weeks_off AS weeks_worked,
(fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
(gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
(vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
(monthly_insurance * 12) AS original_insurance
FROM vars
),
evaluate AS (
SELECT *,
(original_gas + original_rent + original_insurance) AS original_total_expense,
(original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
FROM base
),
new_costs AS (
SELECT *,
gas_per_week * 1.05 * weeks_worked AS new_gas,
monthly_insurance * 0.8 * 12 AS new_insurance
FROM evaluate
),
remaining AS (
SELECT *,
new_gas + new_insurance + car_cost AS new_total_expense,
original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
required_revenue / weeks_worked AS required_weekly_revenue,
original_annual_revenue / weeks_worked AS original_weekly_revenue
FROM new_costs
)
SELECT
ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM remaining
Right here is the output.
Remaining Ideas
On this article, we explored how you can join with DuckDB and analyze information. As an alternative of utilizing lengthy Pandas features, we used SQL queries. We additionally did this utilizing a real-life information undertaking that Uber requested within the information scientist recruitment course of.
For information scientists engaged on analysis-heavy duties, it’s a light-weight however highly effective different to Pandas. Attempt utilizing it in your subsequent undertaking, particularly when SQL logic matches the issue higher.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent developments within the profession market, provides interview recommendation, shares information science tasks, and covers all the things SQL.