7 DuckDB SQL Queries That Save You Hours of Pandas Work

July 7, 2025

51

Picture by Writer | Canva

Pandas library has one of many fastest-growing communities. This reputation has opened the door for alternate options, like polars. On this article, we’ll discover one such different, DuckDB.

DuckDB is an SQL database that you would be able to run proper in your pocket book. No setup is required, and no servers are wanted. It’s simple to put in and might work with Pandas in parallel.

Not like different SQL databases, you don’t have to configure the server. It simply works along with your pocket book after set up. Meaning no native setup complications, you’re writing the code immediately. DuckDB handles filtering, joins, and aggregations with clear SQL syntax, in comparison with Pandas, and performs considerably higher on giant datasets.

So sufficient with the phrases, let’s get began!

Information Mission – Uber Enterprise Modeling

We’ll use it with Jupyter Pocket book, combining it with Python for information evaluation. To make issues extra thrilling, we’ll work on a real-life information undertaking. Let’s get began!

A Data Project Example for DuckDB SQL Queries

Right here is the hyperlink to the info undertaking we’ll be utilizing on this article. It’s an information undertaking from Uber referred to as Accomplice’s Enterprise Modeling.

Uber used this information undertaking within the recruitment course of for the info science positions, and you may be requested to research the info for 2 completely different situations.

State of affairs 1: Examine the price of two bonus applications designed to get extra drivers on-line throughout a busy day.
State of affairs 2: Calculate and evaluate the annual web revenue of a standard taxi driver vs one who companions with Uber and buys a automotive.

Loading Dataset

Let’s load the dataframe first. This step will probably be wanted; therefore, we’ll register this dataset with DuckDB within the following sections.

import pandas as pd
df = pd.read_csv("dataset_2.csv")

Exploring the Dataset

Listed below are the primary few rows:

A Data Project Example for DuckDB SQL Queries

Let’s see all of the columns.

Right here is the output.

A Data Project Example for DuckDB SQL Queries

Join DuckDB and Register the DataFrame

Good, it’s a actually easy dataset, however how can we join DuckDB with this dataset?
First, if in case you have not put in it but, set up DuckDB.

Connecting with DuckDB is straightforward. Additionally, if you wish to learn the documentation, test it out right here.

Now, right here is the code to make a connection and register the dataframe.

import duckdb
con = duckdb.join()

con.register("my_data", df)

Connect DuckDB and Register the DataFrame

Good, let’s begin exploring seven queries that may prevent hours of Pandas work!

1. Multi-Standards Filtering for Complicated Eligibility Guidelines

One of the crucial important benefits of SQL is the way it naturally handles filtering, particularly multi-condition filtering, very simply.

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

DuckDB lets you apply a number of filters utilizing SQL’s The place Clauses and logic, which scales effectively because the variety of filters grows.

SELECT 
    *
FROM information
WHERE condition_1
  AND condition_2
  AND condition_3
  AND condition_4

Now let’s see how we’d write the identical logic in Pandas. In Pandas, the small logic is expressed utilizing chained boolean masks with brackets, which might get verbose beneath many circumstances.

filtered_df = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"]) &
    (df["condition_4"])
]

Each strategies are equally readable and relevant to fundamental use. DuckDB feels extra pure and cleaner because the logic will get extra complicated.

Multi-Standards Filtering for the Uber Information Mission

On this case, we wish to discover drivers who qualify for a particular Uber bonus program.

In line with the foundations, the drivers should:

Be on-line for a minimum of 8 hours
Full a minimum of 10 journeys
Settle for a minimum of 90% of trip requests
Having a score of 4.7 or above

Now all now we have to do is write a question that does all these filterings. Right here is the code.

SELECT 
    COUN(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Score >= 4.7

However to execute this code with Python, we have to add con.execute(“”” “””) and fetchdf() strategies as proven beneath:

con.execute("""
SELECT 
    COUNT(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Score >= 4.7
""").fetchdf()

We’ll do that all through the article. Now that you understand how to run it in a Jupyter pocket book, we’ll present solely the SQL code any more, and also you’ll know how you can convert it to the Pythonic model.
Good. Now, keep in mind that the info undertaking desires us to calculate the full payout for Choice 1.

Multi-Criteria Filtering

We’ve calculated the sum of the motive force, however we should always multiply this by $50, as a result of the payout will probably be $50 for every driver, so we’ll do it with COUNT(*) * 50.
Right here is the output.

Multi-Criteria Filtering

2. Quick Aggregation to Estimate Enterprise Incentives

SQL is nice for rapidly aggregating, particularly when it is advisable to summarize information throughout rows.

Implementation of Aggregation in DuckDB vs Pandas

DuckDB enables you to mixture values throughout rows utilizing SQL features like SUM and COUNT in a single compact block.

SELECT 
    COUNT(*) AS num_rows,
    SUM(column_name) AS total_value
FROM information
WHERE some_condition

In pandas, you first have to filter the dataframe, then individually depend and sum utilizing chaining strategies.

filtered = df[df["some_condition"]]
num_rows = filtered.form[0]
total_value = filtered["column_name"].sum()

DuckDB is extra concise and simpler to learn, and doesn’t require managing intermediate variables.

Aggregation in Uber Information Mission

Good, let’s transfer on to the second bonus scheme, Choice 2. In line with the undertaking description, drivers will obtain $4 per journey if:

They full a minimum of 12 journeys.
Have a score of 4.7 or higher.

This time, as a substitute of simply counting the drivers, we have to add the variety of journeys they accomplished for the reason that bonus is paid per journey, not per individual.

SELECT 
    COUNT(*) AS qualified_drivers,
    SUM("Journeys Accomplished") * 4 AS total_payout
FROM information
WHERE "Journeys Accomplished" >= 12
  AND Score >= 4.7

The depend right here tells us what number of drivers qualify. Nonetheless, to calculate the full payout, we’ll calculate their journeys and multiply by $4, as required by Choice 2.

Aggregation in DuckDB

Right here is the output.

Aggregation in DuckDB

With DuckDB, we don’t have to loop by means of the rows or construct customized aggregations. The Sum operate takes care of all the things we’d like.

3. Detect Overlaps and Variations Utilizing Boolean Logic

In SQL, you may simply mix the circumstances by utilizing Boolean Logic, comparable to AND, OR, and NOT.

Implementation of Boolean Logic in DuckDB vs Pandas

DuckDB helps boolean logic natively within the WHERE clause utilizing AND, OR, and NOT.

SELECT *
FROM information
WHERE condition_a
  AND condition_b
  AND NOT (condition_c)

Pandas requires a mix of logical operators with masks and parentheses, together with the usage of “~” for negation.

filtered = df[
    (df["condition_a"]) &
    (df["condition_b"]) &
    ~(df["condition_c"])
]

Whereas each are practical, DuckDB is less complicated to motive about when the logic includes exclusions or nested circumstances.

Boolean Logic for Uber Information Mission

Now now we have calculated Choice 1 and Choice 2, what comes subsequent? Now it’s time to do the comparability. Bear in mind our subsequent query.

Boolean Logic in DuckDB

That is the place we will use Boolean Logic. We’ll use a mix of AND and NOT.

SELECT COUNT(*) AS only_option1
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Fee", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Score >= 4.7
  AND NOT ("Journeys Accomplished" >= 12 AND Score >= 4.7)

Right here is the output.

Boolean Logic in DuckDB

Let’s break it down:

The primary 4 circumstances are right here for Choice 1.
The NOT(..) half is used to exclude drivers who additionally qualify for Choice 2.

It’s fairly easy, proper?

4. Fast Cohort Sizing with Conditional Filters

Generally, you wish to perceive how huge a particular group or cohort is inside your information.

Implementation of Conditional Filters in DuckDB vs Pandas?

DuckDB handles cohort filtering and share calculation with one SQL question, even together with subqueries.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE condition_1
  AND condition_2
  AND condition_3

Pandas requires filtering, counting, and guide division to calculate percentages.

filtered = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"])
]
share = spherical(100.0 * len(filtered) / len(df), 2)

DuckDB right here is cleaner and quicker. It minimizes the variety of steps and avoids repeated code.

Cohort Sizing For Uber Information Mission

Now we’re on the final query of State of affairs 1. On this query, Uber desires us to search out out the drivers that might not obtain some duties, like journeys and acceptance fee, but had larger rankings, particularly the drivers.

Accomplished lower than 10 journeys
Had an acceptance fee decrease than 90
Had a score larger than 4.7

Now, these are three separate filters, and we wish to calculate the share of drivers satisfying every of them. Let’s see the question.

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE "Journeys Accomplished"

Right here is the output.

Cohort Sizing in DuckDB

Right here, we filtered the rows the place all three circumstances had been happy, counted them, and divided them by the full variety of drivers to get a share.

5. Primary Arithmetic Queries for Income Modeling

Now, let’s say you wish to do some fundamental math. You may write expressions straight into your SELECT assertion.

Implementation of Arithmetic in DuckDB vs Pandas

DuckDB permits arithmetic to be written straight within the SELECT clause like a calculator.

SELECT 
    daily_income * work_days * weeks_per_year AS annual_revenue,
    weekly_cost * weeks_per_year AS total_cost,
    (daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM information

Pandas requires a number of intermediate calculations in separate variables for a similar consequence.

daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500

annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost

DuckDB simplifies the mathematics logic right into a readable SQL block, whereas Pandas will get a bit cluttered with variable assignments.

Primary Arithmetic in Uber Information Mission

In State of affairs 2, Uber requested us to calculate how a lot cash (after bills) the motive force makes per yr with out partnering with Uber. Listed below are some bills like fuel, hire, and insurance coverage.

Basic Arithmetic in DuckDB

Now let’s calculate the annual income and subtract the bills from it.

SELECT 
    200 * 6 * (52 - 3) AS annual_revenue,
    200 * (52 - 3) AS gas_expense,
    500 * (52 - 3) AS rent_expense,
    400 * 12 AS insurance_expense,
    (200 * 6 * (52 - 3)) 
      - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income

Right here is the output.

Basic Arithmetic in DuckDB

With DuckDB, you may write this like a SQL matrix block. You do not want Pandas Dataframes or guide looping!

6. Conditional Calculations for Dynamic Expense Planning

What in case your value construction modifications based mostly on sure circumstances?

Implementation of Conditional Calculations in DuckDB vs Pandas

DuckDB enables you to apply conditional logic utilizing arithmetic changes inside your question.

SELECT 
    original_cost * 1.05 AS increased_cost,
    original_cost * 0.8 AS discounted_cost,
    0 AS removed_cost,
    (original_cost * 1.05 + original_cost * 0.8) AS total_new_cost

Pandas makes use of the identical logic with a number of math strains and guide updates to variables.

weeks_worked = 49
fuel = 200
insurance coverage = 400

gas_expense = fuel * 1.05 * weeks_worked
insurance_expense = insurance coverage * 0.8 * 12
rent_expense = 0
complete = gas_expense + insurance_expense

DuckDB turns what could be a multi-step logic in pandas right into a single SQL expression.

Conditional Calculations in Uber Information Mission

On this situation, we now mannequin what occurs if the motive force companions with Uber and buys a automotive. The bills change like

Gasoline value will increase by 5%
Insurance coverage decreases by 20%
No extra hire expense

con.execute("""
SELECT 
    200 * 1.05 * 49 AS gas_expense,
    400 * 0.8 * 12 AS insurance_expense,
    0 AS rent_expense,
    (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()

Right here is the output.

Conditional Calculations in DuckDB

7. Purpose-Pushed Math for Income Focusing on

Generally, your evaluation will be pushed by a enterprise objective like hitting a income goal or masking a one time value.

Implementation of Purpose-Pushed Math in DuckDB vs Pandas

DuckDB handles multi-step logic utilizing CTEs. It makes the question modular and simple to learn.

WITH vars AS (
  SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
  SELECT 
    base_income - (cost_1 + cost_2) AS current_profit,
    cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
  FROM vars
),
remaining AS (
  SELECT 
    current_profit + new_total_expense AS required_revenue,
    required_revenue / 49 AS required_weekly_income
  FROM calc
)
SELECT required_weekly_income FROM remaining

Pandas requires nesting of calculations and reuse of earlier variables to keep away from duplication.

weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost

# new bills + automotive value
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000

required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks

DuckDB lets you construct a logic pipeline step-by-step, with out cluttering your pocket book with scattered code.

Purpose-Pushed Math in Uber Information Mission

Now that now we have modeled the brand new prices, let’s reply the ultimate enterprise query:

How rather more does the motive force have to earn per week to do each?

Repay a $40.000 automotive inside a yr
Keep the identical yearly web revenue

Now let’s write the code representing this logic.

WITH vars AS (
  SELECT 
    52 AS total_weeks_per_year,
    3 AS weeks_off,
    6 AS days_per_week,
    200 AS fare_per_day,
    400 AS monthly_insurance,
    200 AS gas_per_week,
    500 AS vehicle_rent,
    40000 AS car_cost
),
base AS (
  SELECT 
    total_weeks_per_year,
    weeks_off,
    days_per_week,
    fare_per_day,
    monthly_insurance,
    gas_per_week,
    vehicle_rent,
    car_cost,
    total_weeks_per_year - weeks_off AS weeks_worked,
    (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
    (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
    (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
    (monthly_insurance * 12) AS original_insurance
  FROM vars
),
evaluate AS (
  SELECT *,
    (original_gas + original_rent + original_insurance) AS original_total_expense,
    (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
  FROM base
),
new_costs AS (
  SELECT *,
    gas_per_week * 1.05 * weeks_worked AS new_gas,
    monthly_insurance * 0.8 * 12 AS new_insurance
  FROM evaluate
),
remaining AS (
  SELECT *,
    new_gas + new_insurance + car_cost AS new_total_expense,
    original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
    required_revenue / weeks_worked AS required_weekly_revenue,
    original_annual_revenue / weeks_worked AS original_weekly_revenue
  FROM new_costs
)
SELECT 
  ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
  ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM remaining

Right here is the output.

Goal-Driven Math in DuckDB

Remaining Ideas

On this article, we explored how you can join with DuckDB and analyze information. As an alternative of utilizing lengthy Pandas features, we used SQL queries. We additionally did this utilizing a real-life information undertaking that Uber requested within the information scientist recruitment course of.
For information scientists engaged on analysis-heavy duties, it’s a light-weight however highly effective different to Pandas. Attempt utilizing it in your subsequent undertaking, particularly when SQL logic matches the issue higher.

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent developments within the profession market, provides interview recommendation, shares information science tasks, and covers all the things SQL.

Previous articleHow Stifel constructed a contemporary knowledge platform utilizing AWS Glue and an event-driven area structure

Next articleGreatest Fitbit health trackers and watches in 2025

7 DuckDB SQL Queries That Save You Hours of Pandas Work

Information Mission – Uber Enterprise Modeling

Loading Dataset

Exploring the Dataset

Join DuckDB and Register the DataFrame

1. Multi-Standards Filtering for Complicated Eligibility Guidelines

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

Multi-Standards Filtering for the Uber Information Mission

2. Quick Aggregation to Estimate Enterprise Incentives

Implementation of Aggregation in DuckDB vs Pandas

Aggregation in Uber Information Mission

3. Detect Overlaps and Variations Utilizing Boolean Logic

Implementation of Boolean Logic in DuckDB vs Pandas

Boolean Logic for Uber Information Mission

4. Fast Cohort Sizing with Conditional Filters

Implementation of Conditional Filters in DuckDB vs Pandas?

Cohort Sizing For Uber Information Mission

5. Primary Arithmetic Queries for Income Modeling

Implementation of Arithmetic in DuckDB vs Pandas

Primary Arithmetic in Uber Information Mission

6. Conditional Calculations for Dynamic Expense Planning

Implementation of Conditional Calculations in DuckDB vs Pandas

Conditional Calculations in Uber Information Mission

7. Purpose-Pushed Math for Income Focusing on

Implementation of Purpose-Pushed Math in DuckDB vs Pandas

Purpose-Pushed Math in Uber Information Mission

Remaining Ideas

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY