Coding Implementation to Finish-to-Finish Transformer Mannequin Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

September 24, 2025

29

On this tutorial, we stroll via how we use Hugging Face Optimum to optimize Transformer fashions and make them sooner whereas sustaining accuracy. We start by establishing DistilBERT on the SST-2 dataset, after which we evaluate totally different execution engines, together with plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step-by-step, we get hands-on expertise with mannequin export, optimization, quantization, and benchmarking, all inside a Google Colab setting. Take a look at the FULL CODES right here.

!pip -q set up "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "consider>=0.4" speed up


from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import consider
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig


os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")


MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR  = Path("onnx-distilbert")
Q_DIR    = Path("onnx-distilbert-quant")
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
BATCH    = 16
MAXLEN   = 128
N_WARM   = 3
N_ITERS  = 8


print(f"Gadget: {DEVICE} | torch={torch.__version__}")

We start by putting in the required libraries and establishing our surroundings for Hugging Face Optimum with ONNX Runtime. We configure paths, batch dimension, and iteration settings, and we verify whether or not we run on CPU or GPU. Take a look at the FULL CODES right here.

ds = load_dataset("glue", "sst2", cut up="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = consider.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)


def make_batches(texts, max_len=MAXLEN, batch=BATCH):
   for i in vary(0, len(texts), batch):
       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
                       max_length=max_len, return_tensors="pt")


def run_eval(predict_fn, texts, labels):
   preds = []
   for toks in make_batches(texts):
       preds.lengthen(predict_fn(toks))
   return metric.compute(predictions=preds, references=labels)["accuracy"]


def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
   for _ in vary(n_warm):
       for toks in make_batches(texts[:BATCH*2]):
           predict_fn(toks)
   occasions = []
   for _ in vary(n_iters):
       t0 = time.time()
       for toks in make_batches(texts):
           predict_fn(toks)
       occasions.append((time.time() - t0) * 1000)
   return float(np.imply(occasions)), float(np.std(occasions))

We load an SST-2 validation slice and put together tokenization, an accuracy metric, and batching. We outline run_eval to compute accuracy from any predictor and bench to heat up and time end-to-end inference. With these helpers, we pretty evaluate totally different engines utilizing equivalent information and batching. Take a look at the FULL CODES right here.

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()


@torch.no_grad()
def pt_predict(toks):
   toks = {ok: v.to(DEVICE) for ok, v in toks.gadgets()}
   logits = torch_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")


compiled_model = torch_model
compile_ok = False
attempt:
   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
   compile_ok = True
besides Exception as e:
   print("torch.compile unavailable or failed -> skipping:", repr(e))


@torch.no_grad()
def ptc_predict(toks):
   toks = {ok: v.to(DEVICE) for ok, v in toks.gadgets()}
   logits = compiled_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


if compile_ok:
   ptc_ms, ptc_sd = bench(ptc_predict, texts)
   ptc_acc = run_eval(ptc_predict, texts, labels)
   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We load the baseline PyTorch classifier, outline a pt_predict helper, and benchmark/rating it on SST-2. We then try torch.compile for just-in-time graph optimizations and, if profitable, run the identical benchmarks to match pace and accuracy below an equivalent setup. Take a look at the FULL CODES right here.

supplier = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
   MODEL_ID, export=True, supplier=supplier, cache_dir=ORT_DIR
)


@torch.no_grad()
def ort_predict(toks):
   logits = ort_model(**{ok: v.cpu() for ok, v in toks.gadgets()}).logits
   return logits.argmax(-1).cpu().tolist()


ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")


Q_DIR.mkdir(mother and father=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(method="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)


ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, supplier=supplier)


@torch.no_grad()
def ortq_predict(toks):
   logits = ort_quant(**{ok: v.cpu() for ok, v in toks.gadgets()}).logits
   return logits.argmax(-1).cpu().tolist()


oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We export the mannequin to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark each to see how latency improves whereas accuracy stays comparable. Take a look at the FULL CODES right here.

pt_pipe  = pipeline("sentiment-analysis", mannequin=torch_model, tokenizer=tokenizer,
                   system=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", mannequin=ort_model, tokenizer=tokenizer, system=-1)
samples = [
   "What a fantastic movie—performed brilliantly!",
   "This was a complete waste of time.",
   "I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
   a = pt_pipe(s)[0]["label"]
   b = ort_pipe(s)[0]["label"]
   print(f"- {s}n  PT={a} | ORT={b}")


import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],
       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
show(df)


print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, therefore omitted.
- For bigger good points on GPU, additionally attempt FlashAttention2 fashions or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; attempt NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(method="static") with a calibration set.
""")

We sanity-check predictions with fast sentiment pipelines and print PyTorch vs ONNX labels aspect by aspect. We then assemble a abstract desk to match latency and accuracy throughout engines, inserting torch.compile outcomes when out there. We conclude with sensible notes, permitting us to increase the workflow to different backends and quantization modes.

In conclusion, we will clearly see how Optimum helps us bridge the hole between normal PyTorch fashions and production-ready, optimized deployments. We obtain speedups with ONNX Runtime and quantization whereas retaining accuracy, and we additionally discover how torch.compile gives good points immediately inside PyTorch. This workflow demonstrates a sensible method to balancing efficiency and effectivity for Transformer fashions, offering a basis that may be additional prolonged with superior backends, comparable to OpenVINO or TensorRT.

Take a look at the FULL CODES right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter.

For content material partnership/promotions on marktechpost.com, please TALK to us

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

Previous articleAccelerating SQL analytics with Amazon Redshift MCP server

Next articleGoogle On Authentication For website positioning Instruments & Private Crawlers

Coding Implementation to Finish-to-Finish Transformer Mannequin Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Novel mRNA remedy curbs antibiotic-resistant infections in preclinical lung fashions – NanoApps Medical – Official web site

OpenAI admits knowledge breach after analytics accomplice hit by phishing assault

What works and what doesn’t (Analyst Angle)

Studying sturdy controllers that work throughout many partially observable environments

Recent Comments

ABOUT US

POPULAR POSTS

Novel mRNA remedy curbs antibiotic-resistant infections in preclinical lung fashions – NanoApps Medical – Official web site

OpenAI admits knowledge breach after analytics accomplice hit by phishing assault

What works and what doesn’t (Analyst Angle)

POPULAR CATEGORY