Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

August 29, 2025

43

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions develop into bigger and extra succesful, the frameworks that energy them are compelled to maintain tempo, optimizing for every part from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we carry these concerns collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and provide sensible steering on which to decide on based mostly in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the thought of structured technology. It brings distinctive abstractions like RadixAttention and specialised state administration that permit it to ship low latency for interactive functions. This makes SGLang particularly interesting when the workload requires exact management over outputs, similar to when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration by way of PagedAttention. It additionally gives broad assist for quantization methods like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible selection for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which implies it takes full benefit of {hardware} options within the H100 and B200. The result’s increased effectivity, quicker response instances, and higher scaling as workloads improve. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

Framework	Design Focus	Key Strengths
SGLANG	Structured technology, RadixAttention	Low latency, environment friendly token technology
vLLM	Steady batching, PagedAttention	Excessive throughput, helps quantization
TensorRT-LLM	TensorRT optimizations	GPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

To judge the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs underneath quite a lot of circumstances. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its dimension and complexity make it a demanding benchmark, which is precisely why it’s splendid for testing inference frameworks and {hardware}.

We measured three major classes of efficiency:

Latency – How briskly the mannequin generates the primary token (TTFT) and the way rapidly it produces subsequent tokens.
Throughput – What number of tokens per second could be generated underneath various ranges of concurrency.
Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. If you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

This is how the three frameworks stacked up:

Time to First Token (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.053	0.125	0.177
10	1.91	1.155	2.496
50	7.546	3.08	4.14
100	1.87	8.991	5.467

Per-Token Latency (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.005	0.004	0.004
10	0.011	0.01	0.009
50	0.021	0.015	0.018
100	0.019	0.021	0.049

What this exhibits:

vLLM was persistently the quickest to generate the primary token throughout all concurrency ranges, with wonderful scaling traits.
SGLang had essentially the most secure per-token latency, persistently round 4–21 ms throughout totally different masses.
TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

In the case of serving a number of requests, throughput is the quantity to observe. This is how the three frameworks carried out as concurrency elevated:

Total Throughput (tokens/second)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	187.15	230.96	242.79
10	863.15	988.18	867.21
50	2211.85	3108.75	2162.95
100	4741.62	3221.84	1942.64

One of the essential findings was how vLLM achieved the best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed robust efficiency at reasonable to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated one of the best single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

Strengths: Steady per-token latency, robust throughput at reasonable concurrency, good total stability.
Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.
Greatest For: Reasonable to high-throughput functions, situations requiring constant token technology timing.

vLLM

Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, wonderful scaling.
Weaknesses: Barely increased per-token latency at excessive masses.
Greatest For: Interactive functions, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

Strengths: Greatest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.
Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.
Greatest For: Single-user or low-concurrency functions, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There isn’t a single framework that outperforms throughout all classes. As an alternative, every has been optimized for various objectives, and the correct selection is dependent upon workload and infrastructure.

Use vLLM for interactive functions and high-concurrency deployments requiring quick responses and most throughput scaling.
Select SGLang when reasonable throughput and constant efficiency are wanted.
Deploy TensorRT-LLM for single-user functions or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework is dependent upon workload sort and {hardware} availability, somewhat than on the lookout for a common winner. Operating GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI functions at scale.

It is value noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM persistently outperformed each SGLang and vLLM throughout all metrics, due to its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the correct framework to your particular {hardware} to unlock most efficiency potential.

You may discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is straightforward. With Clarifai’s Compute Orchestration, you possibly can serve GPT-OSS-120B or another open-weight fashions or your individual customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you possibly can rapidly go from mannequin to software. The most effective half is that you’re not locked right into a single framework. You may experiment with totally different runtimes, and select the one which finest aligns along with your efficiency and value necessities.

This flexibility makes it straightforward to combine cutting-edge frameworks into your workflows and ensures you might be at all times getting the absolute best efficiency out of your {hardware}. Try the documentation to discover ways to add your individual fashions.

Previous articleHigh 10 GitHub Python Initiatives: Studying Information for 2025

Next articleMethods Group Can Assist Your web optimization

Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Introduction

Overview of the Frameworks

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

Latency Outcomes

Throughput Outcomes

Framework Evaluation and Suggestions

Conclusion

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

decodable – What’s flawed with my enum decoding in Swift?

Speed up knowledge lake operations with Apache Iceberg V3 deletion vectors and row lineage

Seeed Studio’s XIAO Debug Mate Makes Energy Evaluation, Serial Comms, and DAPLink a Breeze

Anatomy of an AI agent data base

Recent Comments

ABOUT US

POPULAR POSTS

decodable – What’s flawed with my enum decoding in Swift?

Speed up knowledge lake operations with Apache Iceberg V3 deletion vectors and row lineage

Seeed Studio’s XIAO Debug Mate Makes Energy Evaluation, Serial Comms, and DAPLink a Breeze

POPULAR CATEGORY