HomeiOS DevelopmentResponses Bug in LM Studio

Responses Bug in LM Studio


It began, as this stuff do, with a shortcut I used to be sure would work.

I’ve been constructing SwiftAgents, my Swift framework for speaking to language fashions, and one of many native suppliers it helps is LM Studio — the app a whole lot of us attain for to run fashions on our personal Macs. LM Studio not too long ago grew assist for the newer “Responses” API, the OpenAI-style endpoint that may keep in mind a dialog for you. As an alternative of re-sending the entire chat historical past on each flip, you ship solely the brand new message plus just a little breadcrumb — previous_response_id — that tells the server “you already keep in mind the remaining.” Much less information over the wire, much less bookkeeping on the consumer. An apparent win, and I needed it in SwiftAgents.

Earlier than wiring it in for good, I requested Claude Code to benchmark it. Ten turns of the identical little dialog, run two methods: as soon as with the brand new chaining trick, and as soon as the old school manner the place you resend your complete historical past each single time. I simply needed to substantiate the intelligent path was quicker earlier than committing to it.

The numbers got here again backwards.

When the shortcut is the great distance

Here’s what the benchmark discovered, working a small Qwen3 mannequin inside LM Studio. The left column is the “optimization” — chaining with previous_response_id, sending solely the brand new message every flip. The suitable column is the brute-force method — resending your complete dialog, each time, like a caveman.

The quantity proven is what number of enter tokens the server really needed to course of on that flip:

Flip Chaining (solely the brand new message despatched) Full resend (complete historical past each time)
1 26 26
2 48 48
3 98 69
4 206 95
5 415 120
6 829 141
7 1,669 169
8 3,338 191
9 6,677 211
10 13,364 238

Learn it twice, as a result of I needed to. The wasteful method — resending every thing — retains the workload flat, round 240 tokens by flip ten. The intelligent method, the place I ship nearly nothing, by some means makes the server grind by way of 13 thousand.

And take a look at the form of that left column: 26, 48, 98, 206, 415, 829… it doubles each flip. A textbook geometric balloon. Regardless of the server does internally when it “remembers” the dialog for you, it rebuilds the entire thing roughly twice as massive every time. Because the mannequin has to learn all of these tokens earlier than it could actually say a phrase, the wait balloons proper together with the token rely. By flip ten a single reply took 28 seconds with chaining, towards 3 seconds with out.

The optimization was, comfortably, the slowest potential method to maintain the dialog.

Ensuring it wasn’t simply me

A outcome that foolish deserves suspicion, so the subsequent step was to examine whether or not I’d misconfigured one thing or stumbled onto one unhealthy mannequin. The primary thought was to run the benchmark towards official GPT 5.5 – and there the caching behaved precisely as you’d anticipate. Then I requested Claude Code to run the identical probe throughout quite a few LLMs I had beforehand downloaded.

The balloon confirmed up each single time — small fashions and huge, previous architectures and brand-new ones, the plain ones and the flamboyant “reasoning” ones, and even a mixture-of-experts mannequin. Similar fingerprint every time: the chained path doubles each flip, the full-resend path stays flat.

A couple of of the extra memorable information factors:

  • gpt-oss (a 20-billion-parameter mixture-of-experts mannequin): ballooned to 16,833 tokens by flip ten — for a dialog that was genuinely 283 tokens lengthy. That’s a 59× tax. The stunning irony right here is that this mannequin barely “thinks” out loud in any respect, but it scored the worst blowup of the lot, which advised us the bug has nothing to do with how a lot the mannequin generates and every thing to do with how the server rebuilds the historical past.
  • A 12-billion Gemma mannequin: by flip ten, a single reply took 37.6 seconds as an alternative of the ~2.6 seconds the identical dialog wanted over the plain chat endpoint.

Importantly, this isn’t the Responses API being a foul thought, and it isn’t LM Studio being unhealthy software program — its atypical chat endpoint is fast and caches superbly. It’s one particular characteristic, the server-side dialog reconstruction behind previous_response_id, that misbehaves. I do know it’s particular to LM Studio as a result of the plain factors of comparability don’t do it: OpenAI’s personal servers preserve the token rely equal to the actual dialog, and Ollama — which merely declines to be stateful — retains it flat too. Solely LM Studio’s reconstruction inflates.

So fairly than ship a characteristic that makes issues slower, I did the boring, right factor in SwiftAgents: on LM Studio it resends the complete historical past and skips the chaining solely. And I wrote the entire thing up, with a runnable replica script, as a bug report on LM Studio’s tracker. Generally the deliverable is a paper path.

A aspect quest: the app I beloved versus the one I didn’t

Someplace in the midst of all this benchmarking, a distinct query crept in.

I’ve at all times most well-liked LM Studio. It’s the better-looking app, it feels extra trendy, and — the rationale that truly mattered to me — it supported MLX, Apple’s on-device machine-learning framework, lengthy earlier than Ollama did. On Apple Silicon, MLX is the quick path, so for a great whereas LM Studio was merely the faster method to run a mannequin on a Mac. Ollama was the command-line workhorse I revered however didn’t attain for.

Whereas poking at Gemma 4, I seen Ollama had quietly closed that hole — it now runs the identical trendy, accelerated mannequin codecs I’d switched to LM Studio for within the first place. Which meant, for the primary time, I might put the 2 of them on a really degree enjoying area: the similar mannequin, within the similar quantization, and simply race them.

So I did. Right here’s Gemma-4-E4B, similar nvfp4 construct on each:

Ollama LM Studio
Studying your immediate (immediate processing) 910 tok/s 445 tok/s
Writing the reply (era) 62.7 tok/s 51.7 tok/s
Time till the primary phrase seems 72 ms 121 ms
Re-reading a 1,780-token immediate it simply noticed (heat cache) 65 ms 657 ms

Ollama wins each row. It reads prompts twice as quick, generates noticeably faster, begins answering sooner, and — the one which stunned me most — reuses its cache about ten occasions extra cheaply. Ask it to re-read a immediate it simply processed and it’s completed in 65 milliseconds; LM Studio takes the higher a part of a second to do the identical factor.

I need to be honest, as a result of there’s an trustworthy caveat buried in right here. The primary time I raced them I had LM Studio on MLX and Ollama on the older format, and in that mismatched setup LM Studio’s era seemed quicker. It was a entice — I used to be evaluating the quick format towards the sluggish one. The second I matched them quant-for-quant, the obvious win evaporated and Ollama pulled forward on every thing. So I gained’t declare Ollama is universally quicker at every thing for everybody; I’ll declare the factor my information really helps, which is that on the identical mannequin in the identical format, Ollama got here out forward in every single place I seemed.

That’s a barely uncomfortable conclusion for me, given how a lot I preferred the opposite app. However the stopwatch doesn’t care what’s prettier.

The half I preserve serious about

Right here’s the bit that genuinely tickles me, and it’s not likely about tokens in any respect.

I didn’t write any of those benchmarks. I described what I needed to know — “load a mannequin, run ten turns every manner, observe the response time” — and Claude Code wrote the Python, ran it and computed all of the statistics. When it wanted a mannequin that wasn’t loaded, it drove LM Studio’s command-line software to load it, checked the API to substantiate it was actually resident, and benchmarked it.

At one level it quoted a era velocity that seemed too good, paused, determined the measurement window had been too quick to belief, rewrote the benchmark to generate an extended pattern, and re-ran it to get an trustworthy quantity. It even filed the bug report on my behalf. You may see how more information was added as feedback as I used to be discovering extra information.

On the similar time my agentic CI loop was ticking as nicely on the SwiftAgents PR. When the pull request’s continuous-integration construct went pink on Linux — as a result of a kind I’d used lives in a distinct module off the Mac — it identified the failure, reached for my very own SwiftCross shim to repair it, pushed, watched the construct, discovered a second spot with the identical drawback, fastened that too, and waited with me till all six platforms went inexperienced. I largely watched.

A couple of months in the past, writing a benchmark harness by hand would have been an excessive amount of work for me. So I wouldn’t have completed this analysis, however I might have simply complained on Twitter about one other drawback in any individual else’s code. And I might have been annoyed that I couldn’t do something about it. On this new actuality brokers do the analysis, the write-up and the submitting of the difficulty. The ball is now in LM Studio’s courtroom. This new actuality nonetheless feels faintly like dishonest.

I put the benchmarking scripts in gist for reference.

What I modified

Two issues got here out of a day that was solely ever meant to substantiate a one-line optimization.

SwiftAgents now does the wise factor on LM Studio: it resends the complete dialog and leaves previous_response_id chaining nicely alone till the underlying balloon is fastened. The “optimization” stays on the shelf.

And by myself machine, my default has quietly shifted from the app I preferred to the one which’s quicker. I nonetheless assume LM Studio is the nicer factor to have a look at. However I’ve been doing this lengthy sufficient to know that when the numbers are that constant, you go the place the numbers level — even after they level someplace you didn’t anticipate, and even when an AI is the one holding the stopwatch.

Do you utilize any native inferencing? In that case, which do you favor?


Classes: Bug Studies

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments