HomeArtificial IntelligenceGemma 3 vs. MiniCPM vs. Qwen 2.5 VL

Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL


Introduction

Imaginative and prescient-Language Fashions (VLMs) are quickly changing into the core of many generative AI functions, from multimodal chatbots and agentic programs to automated content material evaluation instruments. As open-source fashions mature, they provide promising options to proprietary programs, enabling builders and enterprises to construct cost-effective, scalable, and customizable AI options.

Nevertheless, the rising variety of VLMs presents a typical dilemma: how do you select the proper mannequin on your use case? It is usually a balancing act between output high quality, latency, throughput, context size, and infrastructure value.

This weblog goals to simplify the decision-making course of by offering detailed benchmarks and mannequin descriptions for 3 main open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. All benchmarks had been run utilizing Clarifai’s Compute Orchestration, our personal inference engine, to make sure constant circumstances and dependable comparisons throughout fashions.

Earlier than diving into the outcomes, right here’s a fast breakdown of the important thing metrics used within the benchmarks. All outcomes had been generated utilizing Clarifai’s Compute Orchestration on NVIDIA L40S GPUs, with enter tokens set to 500 and output tokens set to 150.

  1. Latency per Token: The time it takes to generate every output token. Decrease latency means quicker responses, particularly vital for chat-like experiences.
  2. Time to First Token (TTFT): Measures how shortly the mannequin generates the primary token after receiving the enter. It impacts perceived responsiveness in streaming technology duties.
  3. Finish-to-Finish Throughput: The variety of tokens the mannequin can generate per second for a single request, contemplating the complete request processing time. Larger end-to-end throughput means the mannequin can effectively generate output whereas retaining latency low.
  4. General Throughput: The whole variety of tokens generated per second throughout all concurrent requests. This displays the mannequin’s skill to scale and preserve efficiency underneath load.

Now, let’s dive into the main points of every mannequin, beginning with Gemma-3-4B.

Gemma3-4b

Gemma-3-4B, a part of Google’s newest Gemma 3 household of open multimodal fashions, is designed to deal with each textual content and picture inputs, producing coherent and contextually wealthy textual content responses. With assist for as much as 128K context tokens, 140+ languages, and duties like textual content technology, picture understanding, reasoning, and summarization, it’s constructed for production-grade functions throughout numerous use circumstances.

Benchmark Abstract: Efficiency on L40S GPU

Gemma-3-4B continues to indicate robust efficiency throughout each textual content and picture duties, with constant conduct underneath various concurrency ranges. All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. Gemma-3-4B is optimized for low-latency textual content processing and handles picture inputs as much as 512px with steady throughput throughout concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.135 sec

  • Finish-to-end throughput: 202.25 tokens/sec

  • Requests per minute (RPM): As much as 329.90 at 32 concurrent requests

  • General throughput: 942.57 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

  • 256px pictures: 718.63 tokens/sec, 252.16 RPM at 32 concurrency

  • 512px pictures: 688.21 tokens/sec, 242.04 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

General Perception:

Gemma-3-4B gives quick and dependable efficiency for text-heavy and structured vision-language duties. For big picture inputs (512px), efficiency stays steady, however chances are you’ll must scale compute sources to take care of low latency and excessive throughput.

When you’re evaluating GPU efficiency for serving this mannequin, we’ve revealed a separate comparability of A10 vs. L40S, serving to you select one of the best {hardware} on your wants.

gemma_throughput_trimmed

MiniCPM-o 2.6

MiniCPM-o 2.6 represents a significant leap in end-side multimodal LLMs. It expands enter modalities to photographs, video, audio, and textual content, providing real-time speech dialog and multimodal streaming assist.

With an structure integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the mannequin boasts a complete of 8 billion parameters. MiniCPM-o-2.6 demonstrates vital enhancements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech dialog, multimodal stay streaming, and superior effectivity in token processing.

Benchmark Abstract: Efficiency on L40S GPU

All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter measurement of 500 tokens and output measurement of 150 tokens. MiniCPM-o-2.6 performs exceptionally effectively throughout each textual content and picture workloads, scaling easily throughout concurrency ranges. Shared vLLM serving gives vital features in total throughput whereas sustaining low latency.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.087 sec

  • Finish-to-end throughput: 213.23 tokens/sec

  • Requests per minute (RPM): As much as 362.83 at 32 concurrent requests

  • General throughput: 1075.28 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

  • 256px pictures: 1039.60 tokens/sec, 353.19 RPM at 32 concurrency

  • 512px pictures: 957.37 tokens/sec, 324.66 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

General Perception:

MiniCPM-o-2.6 performs reliably throughout a variety of duties and enter sizes. It maintains low latency, scales linearly with concurrency, and stays performant even with 512px picture inputs. This makes it a stable selection for real-time functions working on fashionable GPUs just like the L40S. These outcomes replicate efficiency on that particular {hardware} configuration, and will differ relying on the setting or GPU tier.

minicpm_throughput_vs_concurrency_trimmed

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL is a vision-language mannequin designed for visible recognition, reasoning, lengthy video evaluation, object localization, and structured information extraction.

Its structure integrates window consideration into the Imaginative and prescient Transformer (ViT), considerably bettering each coaching and inference effectivity. Further optimizations like SwiGLU activation and RMSNorm additional align the ViT with the Qwen2.5 LLM, enhancing total efficiency and consistency.

Benchmark Abstract: Efficiency on L40S GPU

Qwen2.5-VL-7B-Instruct delivers constant efficiency throughout each textual content and image-based duties. Benchmarks from Clarifai’s Compute Orchestration spotlight its skill to deal with multimodal inputs at scale, with robust throughput and responsiveness underneath various concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.089 sec

  • Finish-to-end throughput: 205.67 tokens/sec

  • Requests per minute (RPM): As much as 353.78 at 32 concurrent requests

  • General throughput: 1017.16 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (General Throughput):

  • 256px pictures: 854.53 tokens/sec, 318.64 RPM at 32 concurrency

  • 512px pictures: 832.28 tokens/sec, 345.98 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

General Perception:

Qwen2.5-VL-7B-Instruct is well-suited for each textual content and multimodal duties. Whereas bigger pictures introduce latency and throughput trade-offs, the mannequin performs reliably with small to medium-sized inputs even at excessive concurrency. It’s a powerful selection for scalable vision-language pipelines that prioritize throughput and reasonable latency.

qwen_throughput_vs_concurrency_trimmed

Which VLM is Proper for You?

Choosing the proper Imaginative and prescient-Language Mannequin (VLM) relies on your workload sort, enter modality, and concurrency necessities. All benchmarks on this report had been generated utilizing NVIDIA L40S GPUs by way of Clarifai’s Compute Orchestration.

These outcomes replicate efficiency on enterprise-grade infrastructure. When you’re utilizing lower-end {hardware} or focusing on bigger batch sizes or ultra-low latency, precise efficiency might differ. It’s vital to judge based mostly in your particular deployment setup.

MiniCPM-o-2.6
MiniCPM affords constant efficiency throughout each textual content and picture duties, particularly when deployed with shared vLLM. It scales effectively as much as 32 concurrent requests, sustaining excessive throughput and low latency even with 1024px picture inputs.

In case your software requires steady efficiency underneath load and suppleness throughout modalities, MiniCPM is essentially the most well-rounded selection on this group.

Gemma-3-4B
Gemma performs finest on text-heavy workloads with occasional picture enter. It handles concurrency effectively as much as 16 requests however begins to dip at 32, notably with giant pictures equivalent to 2048px.

In case your use case is primarily centered on quick, high-quality textual content technology with small to medium picture inputs, Gemma gives robust efficiency with no need high-end scaling.

Qwen2.5-VL-7B-Instruct
Qwen2.5 is optimized for structured vision-language duties equivalent to doc parsing, OCR, and multimodal reasoning, making it a powerful selection for functions that require exact visible and textual understanding.

In case your precedence is correct visible reasoning and multimodal understanding, Qwen2.5 is a powerful match, particularly when output high quality issues greater than peak throughput.

That will help you evaluate at a look, right here’s a abstract of the important thing efficiency metrics for all three fashions at 32 concurrent requests throughout textual content and picture inputs.

Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)

 

 

Metric Mannequin Textual content Solely 256px Picture 512px Picture
Latency per Token (sec) Gemma-3-4B 0.027 0.036 0.037
MiniCPM-o 2.6 0.024 0.026 0.028
Qwen2.5-VL-7B-Instruct 0.025 0.032 0.032
Time to First Token (sec) Gemma-3-4B 0.236 1.034 1.164
MiniCPM-o 2.6 0.120 0.347 0.786
Qwen2.5-VL-7B-Instruct 0.121 0.364 0.341
Finish-to-Finish Throughput (tokens/s) Gemma-3-4B 168.45 124.56 120.01
MiniCPM-o 2.6 188.86 176.29 160.14
Qwen2.5-VL-7B-Instruct 186.91 179.69 191.94
General Throughput (tokens/s) Gemma-3-4B 942.58 718.63 688.21
MiniCPM-o 2.6 1075.28 1039.60 957.37
Qwen2.5-VL-7B-Instruct 1017.16 854.53 832.28
Requests per Minute (RPM) Gemma-3-4B 329.90 252.16 242.04
MiniCPM-o 2.6 362.84 353.19 324.66
Qwen2.5-VL-7B-Instruct 353.78 318.64 345.98

 

Notice: These benchmarks had been run on L40S GPUs. Outcomes might differ relying on GPU class (equivalent to A100 or H100), CPU limitations, or runtime configurations together with batching, quantization, or mannequin variants.

Conclusion

We have now seen the benchmarks throughout MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, masking their efficiency on latency, throughput, and scalability underneath completely different concurrency ranges and picture sizes. Every mannequin performs otherwise based mostly on the duty and workload necessities.

If you wish to check out these fashions, we’ve launched a brand new AI Playground the place you may discover them instantly. We are going to proceed including the most recent fashions to the platform, so control our updates and be part of our Discord neighborhood for the most recent bulletins.

If you’re additionally seeking to deploy these Open Supply VLMs by yourself devoted compute, our platform helps production-grade inference, and scalable deployments. You’ll be able to shortly get began with organising your individual node pool and working inference effectively. Take a look at the tutorial under to get began.

 



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments