Benchmarking GPT-OSS Throughout H100s and B200s

August 14, 2025

48

11.7_blog_hero

This weblog publish focuses on new options and enhancements. For a complete record, together with bug fixes, please see the launch notes.

Benchmarking GPT-OSS Throughout H100s and B200s

OpenAI has launched gpt-oss-120b and gpt-oss-20b, a brand new era of open-weight reasoning fashions underneath the Apache 2.0 license. Constructed for sturdy instruction following, highly effective instrument use, and superior reasoning, these fashions are designed for next-generation agentic workflows.

With a Combination of Consultants (MoE) design, prolonged context size of 131K tokens, and quantization that permits the 120b mannequin to run on a single 80 GB GPU, GPT-OSS combines large scale with sensible deployment. Builders can alter reasoning ranges from low to excessive to optimize pace, price, or accuracy, and use built-in searching, code execution, and customized instruments for advanced workflows.

Our analysis workforce benchmarked gpt-oss-120b throughout NVIDIA B200 and H100 GPUs utilizing vLLM, SGLang, and TensorRT-LLM. Exams coated single-request situations and high-concurrency workloads with 50–100 requests. Key findings embrace:

Single request pace: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in a number of instances.
Excessive concurrency: B200 sustains 7,236 tokens/sec at most load with decrease per-token latency.
Effectivity: One B200 can substitute two H100s for equal or higher efficiency, with decrease energy use and fewer complexity.
Efficiency beneficial properties: Some workloads see as much as 15x quicker inference in comparison with a single H100.

For detailed benchmarks on throughput, latency, time to first token, and different metrics, learn our full weblog on NVIDIA B200 vs H100.

In case you are trying to deploy GPT-OSS fashions on H100s, you are able to do it right this moment on Clarifai throughout a number of clouds. Assist for B200s is coming quickly, supplying you with entry to the most recent NVIDIA GPUs for testing and manufacturing.

Developer Plan

Final month we launched Native Runners, and the response from builders has been unimaginable. From AI hobbyists to manufacturing groups, many have been desperate to run open supply fashions domestically on their very own {hardware} whereas nonetheless making the most of the Clarifai platform. With Native Runners, you may run and take a look at fashions by yourself machines, then entry them by a public API for integration into any utility.

Now, with the arrival of the most recent GPT-OSS fashions together with gpt-oss-20b, you may run these superior reasoning fashions domestically with full management of your compute and the flexibility to deploy agentic workflows immediately.

To make it even simpler, we’re introducing the Developer Plan at a promotional worth of simply $1/month. It contains all the things within the Neighborhood Plan, plus:

Take a look at the Developer Plan and begin operating your personal fashions domestically right this moment. In case you are able to run GPT-OSS-20b in your {hardware}, comply with our step-by-step tutorial right here.

Printed Fashions

We’ve got expanded our mannequin library with new open-weight and specialised fashions which can be prepared to make use of in your workflows.

The newest additions embrace:

GPT-OSS-120b – open-weight language mannequin designed for sturdy reasoning, superior instrument use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and variable reasoning ranges, making it splendid for advanced agentic purposes.
GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship mannequin for essentially the most demanding reasoning and generative duties. GPT-5 Mini provides a quicker, cost-effective various for real-time purposes. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.
Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding mannequin with long-context help and powerful agentic capabilities, well-suited for code era, refactoring, and growth automation.

You can begin exploring these fashions instantly within the Clarifai Playground or entry them by way of API to combine into your purposes.

Ollama Assist

Ollama makes it easy to obtain and run highly effective open-source fashions instantly in your machine. With Clarifai Native Runners, now you can expose these domestically operating fashions by way of a safe public API.

We’ve additionally added Ollama toolkit to the Clarifai CLI, letting you obtain, run, and expose Ollama fashions with a single command.

Learn our step-by-step information on operating Ollama fashions domestically and making them accessible by way of API.

Playground Enhancements

Now you can examine a number of fashions aspect by aspect within the Playground as a substitute of testing them separately. Rapidly spot variations in output, pace, and high quality to decide on the perfect match to your use case.

We’ve additionally added enhanced inference controls, Pythonic help, and mannequin model selectors for smoother experimentation.

Screenshot 2025-08-14 at 6.58.27 PM

Extra Updates

Python SDK:

Improved logging, pipeline dealing with, authentication, Native Runner help, and code validation.
Added reside logging, verbose output, and integration with GitHub repositories for versatile mannequin initialization.

Platform:

Clarifai Organizations:

Prepared to begin constructing?

With Clarifai’s Compute Orchestration, you may deploy GPT-OSS, Qwen3-Coder, and different open supply and your personal customized fashions on devoted GPUs like NVIDIA B200s and H100s, on-prem or within the cloud. Serve fashions, MCP servers, or full agentic workflows instantly out of your {hardware} with full management over efficiency, price, and safety.

Previous articleCluster supervisor communication simplified with Distant Publication

Next articlePre-Convention Studying Listing: Submarine Networks World 2025

Benchmarking GPT-OSS Throughout H100s and B200s

Benchmarking GPT-OSS Throughout H100s and B200s

Developer Plan

Printed Fashions

Ollama Assist

Playground Enhancements

Extra Updates

Prepared to begin constructing?

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Safety researchers warning app builders about dangers in utilizing Google Antigravity

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

Ionic Angular ion-content inner-scroll has zero peak on iOS stopping scrolling – all customary fixes tried

Recent Comments

ABOUT US

POPULAR POSTS

Safety researchers warning app builders about dangers in utilizing Google Antigravity

MatrixSpace Operation Flytrap 4.5 – DRONELIFE

Türkiye: ‘alternatives from customs reform’

POPULAR CATEGORY