Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL

June 26, 2025

3

Introduction

Imaginative and prescient-Language Fashions (VLMs) are quickly changing into the core of many generative AI functions, from multimodal chatbots and agentic programs to automated content material evaluation instruments. As open-source fashions mature, they provide promising options to proprietary programs, enabling builders and enterprises to construct cost-effective, scalable, and customizable AI options.

Nevertheless, the rising variety of VLMs presents a typical dilemma: how do you select the proper mannequin on your use case? It is usually a balancing act between output high quality, latency, throughput, context size, and infrastructure value.

This weblog goals to simplify the decision-making course of by offering detailed benchmarks and mannequin descriptions for 3 main open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. All benchmarks had been run utilizing Clarifai’s Compute Orchestration, our personal inference engine, to make sure constant circumstances and dependable comparisons throughout fashions.

Earlier than diving into the outcomes, right here’s a fast breakdown of the important thing metrics used within the benchmarks. All outcomes had been generated utilizing Clarifai’s Compute Orchestration on NVIDIA L40S GPUs, with enter tokens set to 500 and output tokens set to 150.

Latency per Token: The time it takes to generate every output token. Decrease latency means quicker responses, particularly vital for chat-like experiences.
Time to First Token (TTFT): Measures how shortly the mannequin generates the primary token after receiving the enter. It impacts perceived responsiveness in streaming technology duties.
Finish-to-Finish Throughput: The variety of tokens the mannequin can generate per second for a single request, contemplating the complete request processing time. Larger end-to-end throughput means the mannequin can effectively generate output whereas retaining latency low.
General Throughput: The whole variety of tokens generated per second throughout all concurrent requests. This displays the mannequin’s skill to scale and preserve efficiency underneath load.

Now, let’s dive into the main points of every mannequin, beginning with Gemma-3-4B.

Gemma3-4b

Gemma-3-4B, a part of Google’s newest Gemma 3 household of open multimodal fashions, is designed to deal with each textual content and picture inputs, producing coherent and contextually wealthy textual content responses. With assist for as much as 128K context tokens, 140+ languages, and duties like textual content technology, picture understanding, reasoning, and summarization, it’s constructed for production-grade functions throughout numerous use circumstances.

Benchmark Abstract: Efficiency on L40S GPU

Gemma-3-4B continues to indicate robust efficiency throughout each textual content and picture duties, with constant conduct underneath various concurrency ranges. All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. Gemma-3-4B is optimized for low-latency textual content processing and handles picture inputs as much as 512px with steady throughput throughout concurrency ranges.

Textual content-Solely Efficiency Highlights:

Latency per token: 0.022 sec (1 concurrent request)
Time to First Token (TTFT): 0.135 sec
Finish-to-end throughput: 202.25 tokens/sec
Requests per minute (RPM): As much as 329.90 at 32 concurrent requests
General throughput: 942.57 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

256px pictures: 718.63 tokens/sec, 252.16 RPM at 32 concurrency
512px pictures: 688.21 tokens/sec, 242.04 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

At 2 concurrent requests:
At 8 concurrent requests:
At 16 concurrent requests:
At 32 concurrent requests:

General Perception:

Gemma-3-4B gives quick and dependable efficiency for text-heavy and structured vision-language duties. For big picture inputs (512px), efficiency stays steady, however chances are you’ll must scale compute sources to take care of low latency and excessive throughput.

When you’re evaluating GPU efficiency for serving this mannequin, we’ve revealed a separate comparability of A10 vs. L40S, serving to you select one of the best {hardware} on your wants.

gemma_throughput_trimmed

MiniCPM-o 2.6

MiniCPM-o 2.6 represents a significant leap in end-side multimodal LLMs. It expands enter modalities to photographs, video, audio, and textual content, providing real-time speech dialog and multimodal streaming assist.

With an structure integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the mannequin boasts a complete of 8 billion parameters. MiniCPM-o-2.6 demonstrates vital enhancements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech dialog, multimodal stay streaming, and superior effectivity in token processing.

Benchmark Abstract: Efficiency on L40S GPU

All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. MiniCPM-o-2.6 performs exceptionally effectively throughout each textual content and picture workloads, scaling easily throughout concurrency ranges. Shared vLLM serving gives vital features in total throughput whereas sustaining low latency.

Textual content-Solely Efficiency Highlights:

Latency per token: 0.022 sec (1 concurrent request)
Time to First Token (TTFT): 0.087 sec
Finish-to-end throughput: 213.23 tokens/sec
Requests per minute (RPM): As much as 362.83 at 32 concurrent requests
General throughput: 1075.28 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

256px pictures: 1039.60 tokens/sec, 353.19 RPM at 32 concurrency
512px pictures: 957.37 tokens/sec, 324.66 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

At 2 concurrent requests:
At 8 concurrent requests:
At 16 concurrent requests:
At 32 concurrent requests:

General Perception:

MiniCPM-o-2.6 performs reliably throughout a variety of duties and enter sizes. It maintains low latency, scales linearly with concurrency, and stays performant even with 512px picture inputs. This makes it a stable selection for real-time functions working on fashionable GPUs just like the L40S. These outcomes replicate efficiency on that particular {hardware} configuration, and will differ relying on the setting or GPU tier.

minicpm_throughput_vs_concurrency_trimmed

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL is a vision-language mannequin designed for visible recognition, reasoning, lengthy video evaluation, object localization, and structured information extraction.

Its structure integrates window consideration into the Imaginative and prescient Transformer (ViT), considerably bettering each coaching and inference effectivity. Further optimizations like SwiGLU activation and RMSNorm additional align the ViT with the Qwen2.5 LLM, enhancing total efficiency and consistency.

Benchmark Abstract: Efficiency on L40S GPU

Qwen2.5-VL-7B-Instruct delivers constant efficiency throughout each textual content and image-based duties. Benchmarks from Clarifai’s Compute Orchestration spotlight its skill to deal with multimodal inputs at scale, with robust throughput and responsiveness underneath various concurrency ranges.

Textual content-Solely Efficiency Highlights:

Latency per token: 0.022 sec (1 concurrent request)
Time to First Token (TTFT): 0.089 sec
Finish-to-end throughput: 205.67 tokens/sec
Requests per minute (RPM): As much as 353.78 at 32 concurrent requests
General throughput: 1017.16 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

256px pictures: 854.53 tokens/sec, 318.64 RPM at 32 concurrency
512px pictures: 832.28 tokens/sec, 345.98 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

At 2 concurrent requests:
At 8 concurrent requests:
At 16 concurrent requests:
At 32 concurrent requests:

General Perception:

Qwen2.5-VL-7B-Instruct is well-suited for each textual content and multimodal duties. Whereas bigger pictures introduce latency and throughput trade-offs, the mannequin performs reliably with small to medium-sized inputs even at excessive concurrency. It’s a powerful selection for scalable vision-language pipelines that prioritize throughput and reasonable latency.

qwen_throughput_vs_concurrency_trimmed

Which VLM is Proper for You?

Choosing the proper Imaginative and prescient-Language Mannequin (VLM) relies on your workload sort, enter modality, and concurrency necessities. All benchmarks on this report had been generated utilizing NVIDIA L40S GPUs by way of Clarifai’s Compute Orchestration.

These outcomes replicate efficiency on enterprise-grade infrastructure. When you’re utilizing lower-end {hardware} or focusing on bigger batch sizes or ultra-low latency, precise efficiency might differ. It’s vital to judge based mostly in your particular deployment setup.

MiniCPM-o-2.6
MiniCPM affords constant efficiency throughout each textual content and picture duties, particularly when deployed with shared vLLM. It scales effectively as much as 32 concurrent requests, sustaining excessive throughput and low latency even with 1024px picture inputs.

In case your software requires steady efficiency underneath load and suppleness throughout modalities, MiniCPM is essentially the most well-rounded selection on this group.

Gemma-3-4B
Gemma performs finest on text-heavy workloads with occasional picture enter. It handles concurrency effectively as much as 16 requests however begins to dip at 32, notably with giant pictures equivalent to 2048px.

In case your use case is primarily centered on quick, high-quality textual content technology with small to medium picture inputs, Gemma gives robust efficiency with no need high-end scaling.

Qwen2.5-VL-7B-Instruct
Qwen2.5 is optimized for structured vision-language duties equivalent to doc parsing, OCR, and multimodal reasoning, making it a powerful selection for functions that require exact visible and textual understanding.

In case your precedence is correct visible reasoning and multimodal understanding, Qwen2.5 is a powerful match, particularly when output high quality issues greater than peak throughput.

That will help you evaluate at a look, right here’s a abstract of the important thing efficiency metrics for all three fashions at 32 concurrent requests throughout textual content and picture inputs.

Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)

Metric	Mannequin	Textual content Solely	256px Picture	512px Picture
Latency per Token (sec)	Gemma-3-4B	0.027	0.036	0.037
	MiniCPM-o 2.6	0.024	0.026	0.028
	Qwen2.5-VL-7B-Instruct	0.025	0.032	0.032
Time to First Token (sec)	Gemma-3-4B	0.236	1.034	1.164
	MiniCPM-o 2.6	0.120	0.347	0.786
	Qwen2.5-VL-7B-Instruct	0.121	0.364	0.341
Finish-to-Finish Throughput (tokens/s)	Gemma-3-4B	168.45	124.56	120.01
	MiniCPM-o 2.6	188.86	176.29	160.14
	Qwen2.5-VL-7B-Instruct	186.91	179.69	191.94
General Throughput (tokens/s)	Gemma-3-4B	942.58	718.63	688.21
	MiniCPM-o 2.6	1075.28	1039.60	957.37
	Qwen2.5-VL-7B-Instruct	1017.16	854.53	832.28
Requests per Minute (RPM)	Gemma-3-4B	329.90	252.16	242.04
	MiniCPM-o 2.6	362.84	353.19	324.66
	Qwen2.5-VL-7B-Instruct	353.78	318.64	345.98

Notice: These benchmarks had been run on L40S GPUs. Outcomes might differ relying on GPU class (equivalent to A100 or H100), CPU limitations, or runtime configurations together with batching, quantization, or mannequin variants.

Conclusion

We have now seen the benchmarks throughout MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, masking their efficiency on latency, throughput, and scalability underneath completely different concurrency ranges and picture sizes. Every mannequin performs otherwise based mostly on the duty and workload necessities.

If you wish to check out these fashions, we’ve launched a brand new AI Playground the place you may discover them instantly. We are going to proceed including the most recent fashions to the platform, so control our updates and be part of our Discord neighborhood for the most recent bulletins.

If you’re additionally seeking to deploy these Open Supply VLMs by yourself devoted compute, our platform helps production-grade inference, and scalable deployments. You’ll be able to shortly get began with organising your individual node pool and working inference effectively. Take a look at the tutorial under to get began.

Previous articleThe following batch of CTO Fellows are reimagining healthcare

Next articleI actually need to swap to a MacBook Professional. Right here’s why I’m ready until subsequent 12 months

Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL

Introduction

Gemma3-4b

Benchmark Abstract: Efficiency on L40S GPU

Textual content-Solely Efficiency Highlights:

Multimodal (Picture + Textual content) Efficiency (General Throughput):

Scales with Concurrency (Finish-to-Finish Throughput):

MiniCPM-o 2.6

Benchmark Abstract: Efficiency on L40S GPU

Textual content-Solely Efficiency Highlights:

Multimodal (Picture + Textual content) Efficiency (General Throughput):

Scales with Concurrency (Finish-to-Finish Throughput):

Qwen2.5-VL-7B-Instruct

Benchmark Abstract: Efficiency on L40S GPU

Which VLM is Proper for You?

Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)

Conclusion

Automate Knowledge High quality Stories with n8n: From CSV to Skilled Evaluation

The Obtain: Google DeepMind’s DNA AI, and heatwaves’ influence on the grid

MIT and NUS Researchers Introduce MEM1: A Reminiscence-Environment friendly Framework for Lengthy-Horizon Language Brokers

LEAVE A REPLY Cancel reply

Most Popular

WhatsApp provides customers ‘Message Summaries’ for when these texts pile up

Shock Finest Purchase sale slashes 24% OFF one of the best wi-fi earbuds for Samsung customers

🛠️ Ceramic Turret・ STL File for 3D printing・Cults

Copilot Cash is the budgeting app you have been on the lookout for

Recent Comments

ABOUT US

POPULAR POSTS

WhatsApp provides customers ‘Message Summaries’ for when these texts pile up

Shock Finest Purchase sale slashes 24% OFF one of the best wi-fi earbuds for Samsung customers

🛠️ Ceramic Turret・ STL File for 3D printing・Cults

POPULAR CATEGORY