HomeArtificial IntelligenceLeveraging Pandas and SQL Collectively for Environment friendly Knowledge Evaluation

Leveraging Pandas and SQL Collectively for Environment friendly Knowledge Evaluation


What is pandasqlWhat is pandasqlPicture by Writer | Canva

 

Pandas and SQL are each efficient for knowledge evaluation, however what if we may merge their energy? With pandasql, you may write SQL queries immediately inside a Jupyter pocket book. This integration seamlessly permits us to mix SQL logic with Python for efficient knowledge evaluation.

On this article, we are going to use each pandas and SQL collectively on a knowledge undertaking from Uber. Let’s get began!

 

What Is pandasql?

 
Pandasql may be built-in with any DataFrame via an in-memory SQLite engine, so you may write pure SQL inside a Python atmosphere.

 

Benefits of Utilizing Pandas and SQL Collectively

 
 
Advantages of Using Pandas and SQL TogetherAdvantages of Using Pandas and SQL Together
 

SQL is helpful for simply filtering rows, aggregating knowledge, or making use of multi-condition logic.
Python, then again, presents superior instruments for statistical evaluation and customized computations, in addition to set-based operations, which lengthen past SQL’s capabilities.
When used collectively, SQL simplifies knowledge choice, whereas Python provides analytical flexibility.

 

The right way to Run pandasql Inside a Jupyter Pocket book?

 
To run pandasql inside a Jupyter Pocket book, begin with the next code.

import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())

 

Subsequent, you may run your SQL code like this:

run("""
SELECT *
FROM df
LIMIT 10;
""")

 

We’ll use the SQL code with out exhibiting the run perform every time on this article.
 
How to run pandasql inside Jupyter Notebook?How to run pandasql inside Jupyter Notebook?
 

Let’s see how utilizing SQL and Pandas collectively works in a real-life undertaking from Uber.

 

Actual-World Venture: Analyzing Uber Driver Efficiency Knowledge

 

Real-World Project: Analyzing Uber Driver Performance DataReal-World Project: Analyzing Uber Driver Performance Data
Picture by Writer

 

On this knowledge undertaking, Uber asks us to research driver efficiency knowledge and consider bonus methods.

 

// Knowledge Exploration and Analytics

Now, let’s discover the datasets. First, we are going to load the info.

 

// Preliminary Dataset Loading

Let’s load the dataset by utilizing simply pandas.

import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')

 

// Exploring the Knowledge

Now let’s overview the dataset.

 

The output appears to be like like this:
 
Data Exploration and AnalyticsData Exploration and Analytics
 

Now now we have a glimpse of the info.
As you may see, the dataset contains every driver’s identify, the variety of journeys they accomplished, their acceptance charge (i.e., the proportion of journey requests accepted), complete provide hours (the entire hours spent on-line), and their common score.
Let’s confirm the column names earlier than beginning the info evaluation so we will use them accurately.

 

Right here is the output.

 
Data Exploration and AnalyticsData Exploration and Analytics
 

As you may see, our dataset has 5 totally different columns, and there aren’t any lacking values.
Let’s now reply the questions utilizing each SQL and Python.

 

Query 1: Who Qualifies for Bonus Possibility 1?

 
Within the first query, we’re requested to find out the entire bonus payout for Possibility 1, which is:

$50 for every driver that’s on-line at the very least 8 hours, accepts 90% of requests, completes 10 journeys, and has a score of 4.7 or higher throughout the time-frame.

 

 

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

On this step, we are going to begin utilizing pandasql.

Within the following code, now we have chosen all drivers who meet the circumstances for the Possibility 1 bonus utilizing the WHERE clause and the AND operator for linking a number of circumstances. To discover ways to use WHERE and AND, seek advice from this documentation.

opt1_eligible = run("""
    SELECT Identify                -- maintain solely a reputation column for readability
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Charge`     >= 90
      AND  Ranking            >= 4.7;
""")
opt1_eligible

 

Right here is the output.

 
Output showing drivers eligible for Option 1Output showing drivers eligible for Option 1
 

// Step 2: Ending in Pandas

After filtering the dataset utilizing SQL with pandasql, we swap to Pandas to carry out numerical calculations and finalize the evaluation. This hybrid approach, which mixes SQL and Python, enhances each readability and suppleness.

Subsequent, utilizing the next Python code, we calculate the entire payout by multiplying the variety of certified drivers (utilizing len()) by the $50 bonus per driver. Take a look at the documentation to see how you should use the len() perform.

payout_opt1 = 50 * len(opt1_eligible)
print(f"Possibility 1 payout: ${payout_opt1:,}")

 

Right here is the output.

 
Finish in PandasFinish in Pandas
 

Query 2: Calculating the Whole Payout for Bonus Possibility 2

 
Within the second query, we’re requested to search out the entire bonus payout utilizing Possibility 2:

$4/journey for all drivers who full 12 journeys, and have a 4.7 or higher score.

 

 

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

First, we use SQL to filter for drivers who meet the Possibility 2 standards: finishing at the very least 12 journeys and sustaining a score of 4.7 or increased.

# Seize solely the rows that fulfill the Possibility-2 thresholds
opt2_drivers = run("""
    SELECT Identify,
           `Journeys Accomplished`
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")
opt2_drivers.head()

 

Right here’s what we get.

 
Filter the qualifying drivers with SQL (pandasql)Filter the qualifying drivers with SQL (pandasql)
 

// Step 2: Ending the Calculation in Pure Pandas

Now let’s carry out the calculation utilizing Pandas. The code computes the entire bonus by summing the Journeys Accomplished column with sum() after which multiplying the end result by the $4 bonus per journey.

total_trips   = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Whole journeys: {total_trips},  Possibility-2 payout: ${option2_bonus}")

 

Right here is the end result.

 
Finish the calculation in pure PandasFinish the calculation in pure Pandas
 

Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2

 
Within the third query, we’re requested to depend the variety of drivers who qualify for Possibility 1 however not for Possibility 2.

 

// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)

Within the following SQL code, we create two datasets: one for drivers who meet the Possibility 1 standards and one other for individuals who meet the Possibility 2 standards.

# All Possibility-1 drivers
opt1_drivers = run("""
    SELECT Identify
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Charge`     >= 90
      AND  Ranking            >= 4.7;
""")

# All Possibility-2 drivers
opt2_drivers = run("""
    SELECT Identify
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")

 

// Step 2: Utilizing Python Set Logic to Spot the Distinction

Subsequent, we are going to use Python to determine the drivers who seem in Possibility 1 however not in Possibility 2, and we are going to use set operations for that.

Right here is the code:

only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)

print(f"Drivers qualifying for Possibility 1 however not Possibility 2: {count_only_opt1}")

 

Right here is the output.

 
Use Python set logic to spot the differenceUse Python set logic to spot the difference
 

By combining these strategies, we leverage SQL for filtering and Python’s set logic for evaluating the ensuing datasets.

 

Query 4: Discovering Low-Efficiency Drivers with Excessive Scores

 
In query 4, we’re requested to find out the proportion of drivers who accomplished fewer than 10 journeys, had an acceptance charge under 90%, and nonetheless maintained a score of 4.7 or increased.

 

// Step 1: Pulling the Subset with SQL (pandasql)

Within the following code, we choose all drivers who’ve accomplished fewer than 10 journeys, have an acceptance charge of lower than 90%, and maintain a score of at the very least 4.7.

low_kpi_df = run("""
    SELECT *
    FROM   df
    WHERE  `Journeys Accomplished` = 4.7;
""")
low_kpi_df

 

Right here is the output.

 
Pull the subset with SQL (pandasql)Pull the subset with SQL (pandasql)
 

// Step 2: Calculating the Share in Plain Pandas

On this step, we are going to use Python to calculate the proportion of such drivers.

We merely divide the variety of filtered drivers by the entire driver depend, then multiply by 100 to get the proportion.

Right here is the code:

num_low_kpi   = len(low_kpi_df)
total_drivers = len(df)
proportion    = spherical(100 * num_low_kpi / total_drivers, 2)

print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {proportion}%")

 

Right here is the output.
 
Calculate the percentage in plain PandasCalculate the percentage in plain Pandas
 

Query 5: Calculating Annual Revenue With out Partnering With Uber

 
Within the fifth query, we have to calculate the annual earnings of a taxi driver with out partnering with Uber, primarily based on the given value and income parameters.

 

// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)

By utilizing SQL, we first calculate yearly income from each day fares and subtract bills for fuel, lease, and insurance coverage.

taxi_stats = run("""
SELECT
    200*6*(52-3)                      AS annual_revenue,
    ((200+500)*(52-3) + 400*12)       AS annual_expenses
""")
taxi_stats

 

Right here is the output.
 
Pulling yearly revenue and yearly expenses with SQL (pandasql)Pulling yearly revenue and yearly expenses with SQL (pandasql)
 

// Step 2: Deriving Revenue and Margin with Pandas

Within the subsequent step, we are going to use Python to compute the revenue and margin the drivers get when not partnering with Uber.

rev  = taxi_stats.loc[0, "annual_revenue"]
value = taxi_stats.loc[0, "annual_expenses"]

revenue  = rev - value
margin  = spherical(100 * revenue / rev, 2)

print(f"Income  : ${rev:,}")
print(f"Bills : ${value:,}")
print(f"Revenue   : ${revenue:,}    (margin: {margin}%)")

 

Right here’s what we get.

 
Pandas derives profit & margin from those SQL numbersPandas derives profit & margin from those SQL numbers
 

Query 6: Calculating the Required Fare Improve to Preserve Profitability

 
Within the sixth query, we assume that the identical driver decides to purchase a City Automobile and companion with Uber.

The fuel bills improve by 5%, insurance coverage decreases by 20%, and rental prices are eradicated, however the driver must cowl the $40,000 value of the automotive. We’re requested to calculate how a lot this driver’s weekly gross fares should improve within the first 12 months to each repay the automotive and preserve the identical annual revenue margin.

 

 

// Step 1: Constructing the New One-12 months Expense Stack with SQL

On this step, we are going to use SQL to calculate the brand new one-year bills with adjusted fuel and insurance coverage and no rental charges, plus the automotive value.

new_exp = run("""
SELECT
    40000             AS automotive,
    200*1.05*(52-3)   AS fuel,        -- +5 %
    400*0.80*12       AS insurance coverage   -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost

 

Right here is the output.
 
SQL builds the new one-year expense stackSQL builds the new one-year expense stack
 

// Step 2: Calculating the Weekly Fare Improve with Pandas

Subsequent, we use Python to calculate how way more the motive force should earn per week to protect that margin after shopping for the automotive.

# Current values from Query 5
old_rev    = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks      = 49

# new_cost was calculated within the earlier step (54130.0)

# We have to discover the brand new income (new_rev) such that the revenue margin stays the identical:
# (new_rev - new_cost) / new_rev = old_margin
# Fixing for new_rev provides: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)

# The entire improve in annual income wanted is the distinction
total_increase = new_rev_required - old_rev

# Divide by the variety of working weeks to get the required weekly improve
weekly_bump = spherical(total_increase / weeks, 2)

print(f"Required weekly gross-fare improve = ${weekly_bump}")

 

Right here’s what we get.
 
Pandas uses old profit-margin & algebra to find weekly bumpPandas uses old profit-margin & algebra to find weekly bump
 

Conclusion

 
Bringing collectively the strengths of SQL and Python, primarily via pandasql, we solved six totally different issues.

SQL helps in fast filtering and summarizing structured datasets, whereas Python is sweet at superior computation and dynamic manipulation.

All through this evaluation, we leveraged each instruments to simplify the workflow and make every step extra interpretable.
 
 

Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from high corporations. Nate writes on the most recent tendencies within the profession market, provides interview recommendation, shares knowledge science initiatives, and covers all the things SQL.



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments