Benchmarking Velocity, Scale, and Value Effectivity

September 11, 2025

39

11.8_blog_hero (1)

This weblog put up focuses on new options and enhancements. For a complete listing, together with bug fixes, please see the launch notes.

GPT-OSS-120B: Benchmarking Velocity, Scale, and Value Effectivity

Synthetic Evaluation has benchmarked Clarifai’s Compute Orchestration with the GPT-OSS-120B mannequin—probably the most superior open-source giant language fashions accessible in the present day. The outcomes underscore Clarifai as one of many prime {hardware} and GPU-agnostic engines for AI workloads the place pace, flexibility, effectivity and reliability matter most.

What the benchmark exhibits (P50, final 72h; single question, 1k-token immediate):

Excessive throughput: 313 output tokens per second—among the many very quickest measured on this configuration.
Low latency: 0.27s time-to-first-token (TTFT), so responses start streaming virtually immediately.
Compelling worth/efficiency: Positioned within the benchmark’s “most tasty quadrant” (excessive pace + low worth).

Pricing that scales:

Clarifai presents GPT-OSS-120B at $0.09 per 1M enter tokens and $0.36 per 1M output tokens. Synthetic Evaluation shows a blended worth (3:1 enter:output) of simply $0.16 per 1M tokens, putting Clarifai considerably under the $0.26–$0.28 cluster of opponents whereas matching or exceeding their efficiency.

Under is a comparability of output pace versus worth throughout main suppliers for GPT-OSS-120B. Clarifai stands out within the “most tasty quadrant,” combining excessive throughput with aggressive pricing.

Output Speed vs Price (10 Sep 25) (2)

Output Velocity vs. Worth

This chart compares latency (time to first token) in opposition to output pace. Clarifai demonstrates one of many lowest latencies whereas sustaining top-tier throughput—putting it among the many best-in-class suppliers.

Latency vs Output Speed (10 Sep 25) (1)

Latency vs. Output Velocity

Why GPT-OSS-120B Issues

As one of many main open-source “GPT-OSS” fashions, GPT-OSS-120B represents the rising demand for clear, community-driven alternate options to closed-source LLMs. Working a mannequin of this scale requires infrastructure that may not solely ship excessive pace and low latency, but in addition maintain prices beneath management at manufacturing scale. That’s precisely the place Clarifai’s Compute Orchestraction makes a distinction.

Why This Benchmark Issues

These outcomes are greater than numbers—they present how Clarifai has engineered each layer of the stack to optimize GPU utilization. With CO, a number of fashions can run on the identical GPUs, workloads scale elastically, and enterprises can squeeze extra worth out of each accelerator. The payoff is quick, dependable, and cost-efficient inference that may help each experimentation and large-scale deployment.

Examine the total benchmarks on Synthetic Evaluation right here.

Right here’s a fast demo of how one can entry the GPT-OSS-120B mannequin within the Playground.

Native Runners

Native Runners allow you to develop and run fashions by yourself {hardware}—laptops, workstations, edge bins—whereas making them callable via Clarifai’s cloud API. Clarifai handles the general public URL, routing, and authentication; your mannequin executes domestically and your information stays in your machine. It behaves like another Clarifai‑hosted mannequin.

Why groups use Native Runners

Construct the place your information and instruments dwell. Hold fashions near native information, inner databases, and OS‑degree utilities.
No customized networking. Begin a runner and get a public URL—no port‑forwarding or reverse proxies.
Use your personal compute. Carry your GPUs and customized setups; the platform nonetheless offers the API, workflows, and governance round them.

New: Ollama Toolkit (now within the CLI)

We’ve added an Ollama Toolkit to the Clarifai CLI so you may initialize an Ollama‑backed mannequin listing in a single command (and select any mannequin from the Ollama library). It pairs completely with Native Runners—obtain, run, and expose an Ollama mannequin through a public API with a minimal setup.

The CLI helps --toolkit ollama plus flags like --model-name, --port, and --context-length, making it trivial to focus on particular Ollama fashions.

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B domestically and serve it via a public API

Choose a mannequin in Ollama.
- Gemma 3 270M (tiny, quick; 32K context): gemma3:270m.
- GPT‑OSS 20B (OpenAI open‑weight, optimized for native use): gpt-oss:20b.
Initialize the challenge with the Ollama Toolkit.
Use the command above, swapping --model-name in your choose (e.g., gpt-oss:20b). It will create a brand new mannequin listing construction that’s suitable with the Clarifai platform. You’ll be able to customise or optimize the generated mannequin by modifying the 1/mannequin.py file as wanted.
Begin your Native Runner.
From the mannequin listing:

The runner registers with Clarifai and exposes your native mannequin through a public URL; the CLI prints a prepared‑to‑run consumer snippet.
Name it like all Clarifai mannequin.
For instance (Python SDK):

Behind the scenes, the API name is routed to your machine; outcomes return to the caller over Clarifai’s safe management airplane.

Deep dive: We printed a step‑by‑step information that walks via operating Ollama fashions domestically and exposing them with Native Runners. Test it out right here.

Attempt it on the Developer Plan

You can begin free of charge, or use the Developer Plan—$1/month for the primary yr—which incorporates as much as 5 Native Runners and limitless runner hours.

Try the total instance and setup information within the documentation right here.

Billing

We’ve made billing extra clear and versatile with this launch. Month-to-month spending limits have been launched: $100 for Developer and Important plans, and $500 for the Skilled plan. In case you want larger limits, you may attain out to our staff.

We’ve additionally added a brand new bank card pre-authorization course of. A short lived cost is utilized to confirm card validity and accessible funds — $50 for Developer, $100 for Important, and $500 for Skilled plans. The quantity is robotically refunded inside seven days, making certain a seamless verification expertise.

Management Heart

The Management Heart will get much more versatile and informative with this replace. Now you can resize charts to half their authentic dimension on the configure web page, making side-by-side comparisons smoother and layouts extra manageable.
Charts are smarter too: the Saved Inputs Value chart now appropriately exhibits the typical value for the chosen interval, whereas longer date ranges robotically show weekly aggregated information for simpler readability. Empty charts show significant messages as an alternative of zeros, so that you at all times know when information isn’t accessible.
We’ve additionally added cross-links between compute value and utilization charts, making it easy to navigate between these views and get an entire image of your AI infrastructure.

Extra Adjustments

Python SDK: Mounted Native Runner CLI command, up to date protocol and gRPC variations, built-in secrets and techniques, corrected num_threads defaults, added stream_options validation, prevented downloading authentic checkpoints, improved mannequin add and deployment, and added consumer affirmation to stop Dockerfile overwrite throughout uploads.
Examine all SDK updates right here.
Platform Updates: Added a public useful resource filter to rapidly view Group-shared assets, improved Playground error messaging for streaming limits, and prolonged login session period for Google and GitHub SSO customers to seven days.
Discover all platform adjustments right here.

Prepared to start out constructing?

With Native Runners, now you can serve fashions, MCP servers, or brokers instantly from your personal {hardware} with out importing mannequin weights or managing infrastructure. It’s the quickest method to check, iterate, and securely run fashions out of your laptop computer, workstation, or on-prem server. You’ll be able to learn the documentation to get began, or try the weblog to see how one can run Ollama fashions domestically and expose them through a public API.

Previous articleKey Ideas for Constructing ML Fashions That Remedy Actual-World Issues

Next articleLevita Magnetics’ MARS platform makes use of AI-guided autonomous digicam in first surgical procedure

Benchmarking Velocity, Scale, and Value Effectivity