

Pandas and SQL are each efficient for knowledge evaluation, however what if we may merge their energy? With pandasql, you may write SQL queries immediately inside a Jupyter pocket book. This integration seamlessly permits us to mix SQL logic with Python for efficient knowledge evaluation.
On this article, we are going to use each pandas and SQL collectively on a knowledge undertaking from Uber. Let’s get began!
# What Is pandasql?
Pandasql may be built-in with any DataFrame via an in-memory SQLite engine, so you may write pure SQL inside a Python atmosphere.
# Benefits of Utilizing Pandas and SQL Collectively
SQL is helpful for simply filtering rows, aggregating knowledge, or making use of multi-condition logic.
Python, then again, presents superior instruments for statistical evaluation and customized computations, in addition to set-based operations, which lengthen past SQL’s capabilities.
When used collectively, SQL simplifies knowledge choice, whereas Python provides analytical flexibility.
# The right way to Run pandasql Inside a Jupyter Pocket book?
To run pandasql
inside a Jupyter Pocket book, begin with the next code.
import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())
Subsequent, you may run your SQL code like this:
run("""
SELECT *
FROM df
LIMIT 10;
""")
We’ll use the SQL code with out exhibiting the run
perform every time on this article.
Let’s see how utilizing SQL and Pandas collectively works in a real-life undertaking from Uber.
# Actual-World Venture: Analyzing Uber Driver Efficiency Knowledge


Picture by Writer
On this knowledge undertaking, Uber asks us to research driver efficiency knowledge and consider bonus methods.
// Knowledge Exploration and Analytics
Now, let’s discover the datasets. First, we are going to load the info.
// Preliminary Dataset Loading
Let’s load the dataset by utilizing simply pandas.
import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')
// Exploring the Knowledge
Now let’s overview the dataset.
The output appears to be like like this:
Now now we have a glimpse of the info.
As you may see, the dataset contains every driver’s identify, the variety of journeys they accomplished, their acceptance charge (i.e., the proportion of journey requests accepted), complete provide hours (the entire hours spent on-line), and their common score.
Let’s confirm the column names earlier than beginning the info evaluation so we will use them accurately.
Right here is the output.
As you may see, our dataset has 5 totally different columns, and there aren’t any lacking values.
Let’s now reply the questions utilizing each SQL and Python.
# Query 1: Who Qualifies for Bonus Possibility 1?
Within the first query, we’re requested to find out the entire bonus payout for Possibility 1, which is:
$50 for every driver that’s on-line at the very least 8 hours, accepts 90% of requests, completes 10 journeys, and has a score of 4.7 or higher throughout the time-frame.
// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)
On this step, we are going to begin utilizing pandasql
.
Within the following code, now we have chosen all drivers who meet the circumstances for the Possibility 1 bonus utilizing the WHERE
clause and the AND
operator for linking a number of circumstances. To discover ways to use WHERE
and AND
, seek advice from this documentation.
opt1_eligible = run("""
SELECT Identify -- maintain solely a reputation column for readability
FROM df
WHERE `Provide Hours` >= 8
AND `Journeys Accomplished` >= 10
AND `Settle for Charge` >= 90
AND Ranking >= 4.7;
""")
opt1_eligible
Right here is the output.
// Step 2: Ending in Pandas
After filtering the dataset utilizing SQL with pandasql
, we swap to Pandas to carry out numerical calculations and finalize the evaluation. This hybrid approach, which mixes SQL and Python, enhances each readability and suppleness.
Subsequent, utilizing the next Python code, we calculate the entire payout by multiplying the variety of certified drivers (utilizing len()
) by the $50 bonus per driver. Take a look at the documentation to see how you should use the len()
perform.
payout_opt1 = 50 * len(opt1_eligible)
print(f"Possibility 1 payout: ${payout_opt1:,}")
Right here is the output.
# Query 2: Calculating the Whole Payout for Bonus Possibility 2
Within the second query, we’re requested to search out the entire bonus payout utilizing Possibility 2:
$4/journey for all drivers who full 12 journeys, and have a 4.7 or higher score.
// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)
First, we use SQL to filter for drivers who meet the Possibility 2 standards: finishing at the very least 12 journeys and sustaining a score of 4.7 or increased.
# Seize solely the rows that fulfill the Possibility-2 thresholds
opt2_drivers = run("""
SELECT Identify,
`Journeys Accomplished`
FROM df
WHERE `Journeys Accomplished` >= 12
AND Ranking >= 4.7;
""")
opt2_drivers.head()
Right here’s what we get.
// Step 2: Ending the Calculation in Pure Pandas
Now let’s carry out the calculation utilizing Pandas. The code computes the entire bonus by summing the Journeys Accomplished
column with sum()
after which multiplying the end result by the $4 bonus per journey.
total_trips = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Whole journeys: {total_trips}, Possibility-2 payout: ${option2_bonus}")
Right here is the end result.
# Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2
Within the third query, we’re requested to depend the variety of drivers who qualify for Possibility 1 however not for Possibility 2.
// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)
Within the following SQL code, we create two datasets: one for drivers who meet the Possibility 1 standards and one other for individuals who meet the Possibility 2 standards.
# All Possibility-1 drivers
opt1_drivers = run("""
SELECT Identify
FROM df
WHERE `Provide Hours` >= 8
AND `Journeys Accomplished` >= 10
AND `Settle for Charge` >= 90
AND Ranking >= 4.7;
""")
# All Possibility-2 drivers
opt2_drivers = run("""
SELECT Identify
FROM df
WHERE `Journeys Accomplished` >= 12
AND Ranking >= 4.7;
""")
// Step 2: Utilizing Python Set Logic to Spot the Distinction
Subsequent, we are going to use Python to determine the drivers who seem in Possibility 1 however not in Possibility 2, and we are going to use set operations for that.
Right here is the code:
only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)
print(f"Drivers qualifying for Possibility 1 however not Possibility 2: {count_only_opt1}")
Right here is the output.
By combining these strategies, we leverage SQL for filtering and Python’s set logic for evaluating the ensuing datasets.
# Query 4: Discovering Low-Efficiency Drivers with Excessive Scores
In query 4, we’re requested to find out the proportion of drivers who accomplished fewer than 10 journeys, had an acceptance charge under 90%, and nonetheless maintained a score of 4.7 or increased.
// Step 1: Pulling the Subset with SQL (pandasql)
Within the following code, we choose all drivers who’ve accomplished fewer than 10 journeys, have an acceptance charge of lower than 90%, and maintain a score of at the very least 4.7.
low_kpi_df = run("""
SELECT *
FROM df
WHERE `Journeys Accomplished` = 4.7;
""")
low_kpi_df
Right here is the output.
// Step 2: Calculating the Share in Plain Pandas
On this step, we are going to use Python to calculate the proportion of such drivers.
We merely divide the variety of filtered drivers by the entire driver depend, then multiply by 100 to get the proportion.
Right here is the code:
num_low_kpi = len(low_kpi_df)
total_drivers = len(df)
proportion = spherical(100 * num_low_kpi / total_drivers, 2)
print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {proportion}%")
Right here is the output.
# Query 5: Calculating Annual Revenue With out Partnering With Uber
Within the fifth query, we have to calculate the annual earnings of a taxi driver with out partnering with Uber, primarily based on the given value and income parameters.
// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)
By utilizing SQL, we first calculate yearly income from each day fares and subtract bills for fuel, lease, and insurance coverage.
taxi_stats = run("""
SELECT
200*6*(52-3) AS annual_revenue,
((200+500)*(52-3) + 400*12) AS annual_expenses
""")
taxi_stats
Right here is the output.
// Step 2: Deriving Revenue and Margin with Pandas
Within the subsequent step, we are going to use Python to compute the revenue and margin the drivers get when not partnering with Uber.
rev = taxi_stats.loc[0, "annual_revenue"]
value = taxi_stats.loc[0, "annual_expenses"]
revenue = rev - value
margin = spherical(100 * revenue / rev, 2)
print(f"Income : ${rev:,}")
print(f"Bills : ${value:,}")
print(f"Revenue : ${revenue:,} (margin: {margin}%)")
Right here’s what we get.
# Query 6: Calculating the Required Fare Improve to Preserve Profitability
Within the sixth query, we assume that the identical driver decides to purchase a City Automobile and companion with Uber.
The fuel bills improve by 5%, insurance coverage decreases by 20%, and rental prices are eradicated, however the driver must cowl the $40,000 value of the automotive. We’re requested to calculate how a lot this driver’s weekly gross fares should improve within the first 12 months to each repay the automotive and preserve the identical annual revenue margin.
// Step 1: Constructing the New One-12 months Expense Stack with SQL
On this step, we are going to use SQL to calculate the brand new one-year bills with adjusted fuel and insurance coverage and no rental charges, plus the automotive value.
new_exp = run("""
SELECT
40000 AS automotive,
200*1.05*(52-3) AS fuel, -- +5 %
400*0.80*12 AS insurance coverage -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost
Right here is the output.
// Step 2: Calculating the Weekly Fare Improve with Pandas
Subsequent, we use Python to calculate how way more the motive force should earn per week to protect that margin after shopping for the automotive.
# Current values from Query 5
old_rev = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks = 49
# new_cost was calculated within the earlier step (54130.0)
# We have to discover the brand new income (new_rev) such that the revenue margin stays the identical:
# (new_rev - new_cost) / new_rev = old_margin
# Fixing for new_rev provides: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)
# The entire improve in annual income wanted is the distinction
total_increase = new_rev_required - old_rev
# Divide by the variety of working weeks to get the required weekly improve
weekly_bump = spherical(total_increase / weeks, 2)
print(f"Required weekly gross-fare improve = ${weekly_bump}")
Right here’s what we get.
# Conclusion
Bringing collectively the strengths of SQL and Python, primarily via pandasql
, we solved six totally different issues.
SQL helps in fast filtering and summarizing structured datasets, whereas Python is sweet at superior computation and dynamic manipulation.
All through this evaluation, we leveraged each instruments to simplify the workflow and make every step extra interpretable.
Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent tendencies within the profession market, provides interview recommendation, shares knowledge science initiatives, and covers all the things SQL.