Amazon Machine Studying Undertaking: Gross sales Knowledge in Python

January 31, 2026

2

Machine studying tasks work greatest once they join concept to actual enterprise outcomes. In e-commerce, which means higher income, smoother operations, and happier clients, all pushed by information. By working with practical datasets, practitioners find out how fashions flip patterns into selections that really matter.

This text walks by means of a full machine studying workflow utilizing an Amazon gross sales dataset, from drawback framing to a submission prepared prediction file. It offers learners a transparent view of how fashions flip insights into enterprise worth, on this article.

Understanding the issue assertion

Earlier than continuing with the coding half, it’s important to look as much as the issue assertion and perceive it. The dataset consists of Amazon e-commerce transactions which present genuine on-line purchasing patterns from precise on-line retail actions.

The first goal of this venture is to foretell order outcomes and analyze revenue-driving elements utilizing structured transactional information. The event course of requires us to create a supervised machine studying mannequin which learns from previous transaction information to forecast outcomes on new check datasets.

Key Enterprise Questions Addressed

Which elements affect the ultimate order quantity?
How do reductions, taxes, and transport prices have an effect on income?
Can we predict order standing or complete transaction worth precisely?
What insights can companies extract to enhance gross sales efficiency?

In regards to the dataset

The dataset consists of 100,000 e-commerce transactions which observe Amazon’s transaction fashion and embody 20 organized information fields. The artificial information reveals genuine buyer habits patterns along with precise enterprise operation processes.

The information set comprises details about worth modifications throughout completely different product sorts and buyer age teams and their cost choices and their order monitoring statuses. The information set comprises properties which make it appropriate for machine studying and analytical work and dashboard growth.

Part	Area Title
Order Particulars	OrderID
	OrderDate
	OrderStatus
	SellerID
Buyer Data	CustomerID
	CustomerName
	Metropolis
	State
	Nation
Product Data	ProductID
	ProductName
	Class
	Model
	Amount
Pricing & Income Metrics	UnitPrice
	Low cost
	Tax
	ShippingCost
	TotalAmount
Cost Particulars	PaymentMethod

Load important Python Libraries

To work on the mannequin growth course of first it requires important Python library imports to deal with information work. The mix of Pandas and NumPy will allow us to carry out each information dealing with duties and mathematical calculations. Our visualization wants will likely be fulfilled by means of the usage of Matplotlib and Seaborn. Scikit-learn gives capabilities for preprocessing and ML algorithms. Right here is the standard set of imports:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import classification_report, accuracy_score

The libraries allow us to carry out 4 primary actions which embody loading CSV information, executing information cleaning and transformation processes, utilizing charts for development evaluation, and constructing a classification mannequin.

Load the datasets

We are going to import information right into a Pandas dataFrame after we full the environment setup. The uncooked CSV file undergoes transformation by means of this step into an analyzable and programmatically manipulatable format.

df = pd.read_csv("Amazon.csv") 

print("Form:", df.form)

Form: (100000, 20)

We have to verify the info construction after loading as a result of we’d like affirmation that it was imported appropriately. The dataset dimensions are checked whereas we seek for any preliminary issues that have an effect on information high quality.

print("nMissing values:n", df.isna().sum()) 

df.head()

Lacking values:

OrderID 0
OrderDate 0
CustomerID 0
CustomerName 0
ProductID 0
ProductName 0
Class 0
Model 0
Amount 0
UnitPrice 0
Low cost 0
Tax 0
ShippingCost 0
TotalAmount 0
PaymentMethod 0
OrderStatus 0
Metropolis 0
State 0
Nation 0
SellerID 0

dtype: int64

OrderID	OrderDate	CustomerID	CustomerName	ProductID	ProductName	Class	Model	Amount	UnitPrice	Low cost	Tax	ShippingCost	TotalAmount	PaymentMethod	OrderStatus	Metropolis	State	Nation	SellerID
ORD0000001	2023-01-31	CUST001504	Vihaan Sharma	P00014	Drone Mini	Books	BrightLux	3	106.59	0.00	0.00	0.09	319.86	Debit Card	Delivered	Washington	DC	India	SELL01967
ORD0000002	2023-12-30	CUST000178	Pooja Kumar	P00040	Microphone	Dwelling & Kitchen	UrbanStyle	1	251.37	0.05	19.10	1.74	259.64	Amazon Pay	Delivered	Fort Price	TX	United States	SELL01298
ORD0000003	2022-05-10	CUST047516	Sneha Singh	P00044	Energy Financial institution 20000mAh	Clothes	UrbanStyle	3	35.03	0.10	7.57	5.91	108.06	Debit Card	Delivered	Austin	TX	United States	SELL00908
ORD0000004	2023-07-18	CUST030059	Vihaan Reddy	P00041	Webcam Full HD	Dwelling & Kitchen	Zenith	5	33.58	0.15	11.42	5.53	159.66	Money on Supply	Delivered	Charlotte	NC	India	SELL01164

Knowledge Preprocessing

1. Decomposing Date Options

Fashions can’t do math on a string like “2023-01-31”. The 2 components “Month: 1” and “12 months: 2023” create important numerical attributes which may detect seasonal patterns together with vacation gross sales.

df["OrderDate"] = pd.to_datetime(df["OrderDate"], errors="coerce") 
df["OrderYear"] = df["OrderDate"].dt.12 months
df["OrderMonth"] = df["OrderDate"].dt.month 
df["OrderDay"] = df["OrderDate"].dt.day

We now have efficiently extracted three new options: OrderYear, OrderMonth, and OrderDay. The mannequin learns patterns which present “December brings greater gross sales” and “weekend days produce elevated gross sales”.

2. Dropping Irrelevant Options

The mannequin requires solely particular columns. The distinctive ID identifiers (OrderID, CustomerID) don’t present predictive data which ends up in mannequin coaching information memorization by means of overfitting. We additionally dropped OrderDate since we simply extracted its helpful components.

cols_to_drop = [ 
   "OrderID", 
   "CustomerID", 
   "CustomerName", 
   "ProductID", 
   "ProductName", 
   "SellerID", 
   "OrderDate",   # already decomposed 
] 

df = df.drop(columns=cols_to_drop)

The dataframe now comprises solely important components which create predictive worth. The mannequin now detects widespread patterns by means of product class and tax charges whereas we take away particular buyer ID data which may create “leakage” and noise.

3. Dealing with Lacking Values

The preliminary verify confirmed no lacking values however we’d like our programs to deal with real-world situations. The mannequin will crash if upcoming information comprises lacking data. We implement a security internet by filling gaps with the median (for numbers) or “Unknown” (for textual content).

print("nMissing values after transformations:n", df.isna().sum())

# If any lacking values in numeric columns, fill with median
numeric_cols = df.select_dtypes(embody=["int64", "float64"]).columns.tolist()

for col in numeric_cols:
    if df[col].isna().sum() > 0:
        df[col] = df[col].fillna(df[col].median())

Class        0
Model           0
Amount        0
UnitPrice       0
Low cost        0
Tax             0
ShippingCost    0
TotalAmount     0
PaymentMethod   0
OrderStatus     0
Metropolis            0
State           0
Nation         0
OrderYear       0
OrderMonth      0
OrderDay        0
dtype: int64

# For categorical columns, fill with "Unknown"
categorical_cols = df.select_dtypes(embody=["object"]).columns.tolist()

for col in categorical_cols:
    df[col] = df[col].fillna("Unknown")

print("nFinal dtypes after cleansing:n")

Class        object
Model           object
Amount        int64
UnitPrice       float64
Low cost        float64
Tax             float64
ShippingCost    float64
TotalAmount     float64
PaymentMethod   object
OrderStatus     object
Metropolis            object
State           object
Nation         object
OrderYear       int32
OrderMonth      int32
OrderDay        int32
dtype: object

The pipeline is now bulletproof. The ultimate dtypes verify confirms that our information is absolutely prepped: all categorical variables are objects (prepared for encoding) and all numerical variables are int32 or float64 (prepared for scaling).

Exploratory information evaluation (EDA)

The Knowledge Evaluation course of begins with our preliminary examination of information which we deal with as an interview course of to be taught in regards to the information’s traits. Our investigation contains three primary components which we use to determine patterns and outliers and study distributional traits.

Statistical Abstract: We have to perceive the mathematical properties of our numerical columns. Are the costs affordable? Are there any damaging values that exist in prohibited areas?

# 2. Primary Knowledge Understanding / EDA (light-weight) 
print("nDescriptive stats (numeric):n") 

df.describe()

The descriptive statistics desk gives crucial context:

Amount: The measurement goes from 1 to five with three as its common worth. Customers who store at retail shops have a tendency to indicate this habits which companies use for his or her B2B purchases.
UnitPrice: The value ranges between 5.00 and 599.99 which reveals that there exists a number of product tiers.

The goal variable TotalAmount reveals large variance as a result of its customary deviation approaches 724 which implies our mannequin should preserve its capability to course of transactions starting from small purchases to most purchases of 3534.98.

Categorical Evaluation

We have to know the cardinality (variety of distinctive values) of our categorical options. The mannequin experiences bloat and overfitting points as a result of excessive cardinality happens when there are literally thousands of distinctive cities within the dataset.

print("nUnique values in some categorical columns:")

for col in ["Category", "Brand", "PaymentMethod", "OrderStatus", "Country"]:
    print(f"{col}: {df[col].nunique()} distinctive")

Distinctive values in some categorical columns:

Class: 6 distinctive
Model: 10 distinctive
PaymentMethod: 6 distinctive
OrderStatus: 5 distinctive
Nation: 5 distinctive

Visualizing the Goal Distribution

The histogram reveals the frequency of various transaction quantities. A clean curve (KDE) permits us to see the density. With the curve being barely proper skewed subsequently tree-based fashions like Random Forest deal with very nicely.

sns.histplot(df["TotalAmount"], kde=True)
plt.title("TotalAmount distribution")
plt.present()

The TotalAmount visualization allows us to find out whether or not the info reveals any skewed distribution. The information requires a Log Transformation when it reveals excessive skewness with just a few high-priced merchandise and quite a few low-cost gadgets.

Characteristic Engineering

Characteristic engineering develops new variables by means of the method of remodeling present variables to spice up mannequin efficiency. In Supervised Studying, we should explicitly inform the mannequin what to foretell (y) and what information to make use of to make that prediction (X).

target_column = "TotalAmount"

X = df.drop(columns=[target_column])
y = df[target_column]

numeric_features = X.select_dtypes(embody=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(embody=["object"]).columns.tolist()

print("nNumeric options:", numeric_features)
print("Categorical options:", categorical_features)

Splitting the prepare and check information

The mannequin analysis course of requires separate information as a result of coaching information can’t be used for evaluation, which parallels the observe of offering college students with examination solutions earlier than the check. The information distribution consists of two components: Coaching Set which serves academic functions and Take a look at Set which verifies outcomes.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("nTrain form:", X_train.form, "Take a look at form:", X_test.form)

Right here we have now used the 80-20 % rule, which implies randomly out of all the info we have now 80% will likely be used because the prepare information and the remaining 20% will likely be used to check it because the check information set.

Construct Machine Studying Mannequin

Creating the ML pipeline would concerned the next processes:

1. Creating Preprocessing Pipelines

The uncooked numbers of every measurement scale in a different way as a result of they embody measurements that vary from 1 to five for Amount and from 5 to 500 for Worth. The fashions obtain sooner convergence when researchers implement information scaling methods. One-Sizzling Encoding gives the mandatory methodology to rework categorical textual content into numerical format. The ColumnTransformer system allows us to use completely different transformation strategies for each column sort in our dataset.

numeric_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

2. Defining the Random Forest Mannequin

We now have chosen the Random Forest Regressor for this venture. The ensemble methodology constructs a number of choice timber which it makes use of to compute forecast outcomes by means of prediction averaging. The system demonstrates sturdy robustness in opposition to overfitting issues whereas it excels at managing non-linear connections between variables.

mannequin = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

# Full pipeline
regressor = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", model),
    ]
)

We created the mannequin with n_estimators=200 to construct 200 choice timber and n_jobs=-1 to allow all CPU cores for speedier mannequin growth. The very best observe for this implementation requires customers to create a single Pipeline object which mixes the preprocessor and mannequin to deal with their total operational course of as one unit.

3. Coaching the Mannequin

This stage represents the first studying course of. The pipeline processes coaching information by means of transformation steps earlier than it makes use of the Random Forest mannequin on the transformed information.

regressor.match(X_train, y_train) 

print("nModel coaching full.")

The mannequin now understands how completely different enter variables (Class Worth Tax and many others.) relate to the output variable (Whole Quantity).

Make predictions on the check dataset

Now we check the mannequin on the check information (i.e, 20,000 “unseen” information). The mannequin efficiency evaluation makes use of statistical metrics to check its predicted outcomes (y_pred) with the precise outcomes (y_test).

y_pred = regressor.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("nTest metrics:")
print("MAE :", mae)
print("MSE :", mse)
print("RMSE:", rmse)
print("R2  :", r2)

Take a look at metrics:

MAE : 3.886121525000014
MSE : 41.06268576375389
RMSE: 6.408017303640331
R2 : 0.99992116450905

This signifies:

The Imply Absolute Error (MAE) worth stands at roughly 3.88. Our prediction reveals a mean error of $3.88.
The R2 Rating worth stands at roughly 0.9999. That is close to excellent. The impartial variables (Worth, Tax, Transport) nearly fully account for the Whole Quantity in line with this outcome. The Whole system in artificial monetary information follows the equation Whole = Worth * Qty + Tax + Transport – Low cost.

Put together submission file

The system requires individuals to current their predictions in line with predetermined output specs which should not be altered.

submission = pd.DataFrame({ 
   "OrderID": df.loc[X_test.index, "OrderID"], 
   "PredictedTotalAmount": y_pred 
}) 

submission.to_csv("submission.csv", index=False)

The analysis system accepts this file for direct submission whereas stakeholders also can obtain it.

Conclusion

This machine studying venture demonstrates its full course of by means of demonstration of uncooked e-commerce transaction information transformation into helpful predictive outcomes. The structured workflow methodology allows you to handle precise datasets with full assurance and understanding of the method. The success of the venture relies on the 5 steps which embody preprocessing and EDA and have engineering and modeling.

The venture helps in creating your machine studying capabilities whereas coaching you to deal with actual work conditions. The pipeline wants further optimization work earlier than it could actually operate as a suggestion system with superior fashions or deep studying methods.

Steadily Requested Questions

Q1. What’s the primary objective of this Amazon gross sales machine studying venture?

A. It goals to foretell the overall order quantity utilizing transactional and pricing information.

Q2. Why was a Random Forest mannequin chosen for this venture?

A. It captures advanced patterns and reduces overfitting by combining many choice timber.

Q3. What does the ultimate submission file include?

A. It contains OrderID and the mannequin’s predicted complete quantity for every order.

Hey! I am Vipin, a passionate information science and machine studying fanatic with a powerful basis in information evaluation, machine studying algorithms, and programming. I’ve hands-on expertise in constructing fashions, managing messy information, and fixing real-world issues. My objective is to use data-driven insights to create sensible options that drive outcomes. I am wanting to contribute my expertise in a collaborative surroundings whereas persevering with to be taught and develop within the fields of Knowledge Science, Machine Studying, and NLP.

Login to proceed studying and luxuriate in expert-curated content material.

Previous articleVirgin Media O2 reaches over 500 UK websites with 5G SA

Next articleidentical measurement for components in several containers in swiftui [closed]

Amazon Machine Studying Undertaking: Gross sales Knowledge in Python

Understanding the issue assertion

In regards to the dataset

Load important Python Libraries

Load the datasets

Knowledge Preprocessing

Exploratory information evaluation (EDA)

Characteristic Engineering

Splitting the prepare and check information

Construct Machine Studying Mannequin

Make predictions on the check dataset

Put together submission file

Conclusion

Steadily Requested Questions

Login to proceed studying and luxuriate in expert-curated content material.

Meta’s Actuality Labs Information $19 Billion Loss as Strategic Focus Shifts

The ROI Paradox: Why Small-Scale AI Structure Outperforms Giant Company Applications

Tesla to Finish Manufacturing of Flagship Mannequin S and Mannequin X

LEAVE A REPLY Cancel reply

Most Popular

Meta’s Actuality Labs Information $19 Billion Loss as Strategic Focus Shifts

Newest U.S. Agreements on Reciprocal Commerce

identical measurement for components in several containers in swiftui [closed]