Qwen has launched Qwen3-ASR-Toolkit, an MIT-licensed Python CLI that programmatically bypasses the Qwen3-ASR-Flash API’s 3-minute/10 MB per-request restrict by performing VAD-aware chunking, parallel API calls, and automated resampling/format normalization through FFmpeg. The result’s secure, hour-scale transcription pipelines with configurable concurrency, context injection, and clear textual content post-processing. Python ≥3.8 prerequisite, Set up with:
pip set up qwen3-asr-toolkit
What the toolkit provides on prime of the API
- Lengthy-audio dealing with. The toolkit slices enter utilizing voice exercise detection (VAD) at pure pauses, retaining every chunk beneath the API’s arduous period/measurement caps, then merges outputs so as.
- Parallel throughput. A thread pool dispatches a number of chunks concurrently to DashScope endpoints, bettering wall-clock latency for hour-long inputs. You management concurrency through
-j/--num-threads
. - Format & charge normalization. Any widespread audio/video container (MP4/MOV/MKV/MP3/WAV/M4A, and many others.) is transformed to the API’s required mono 16 kHz earlier than submission. Requires FFmpeg put in on PATH.
- Textual content cleanup & context. The device consists of post-processing to cut back repetitions/hallucinations and helps context injection to bias recognition towards area phrases; the underlying API additionally exposes language detection and inverse textual content normalization (ITN) toggles.
The official Qwen3-ASR-Flash API is single-turn and enforces ≤3 min period and ≤10 MB payloads per name. That’s cheap for interactive requests however awkward for lengthy media. The toolkit operationalizes finest practices—VAD-aware segmentation + concurrent calls—so groups can batch massive archives or dwell seize dumps with out writing orchestration from scratch.
Fast begin
- Set up stipulations
# System: FFmpeg have to be out there
# macOS
brew set up ffmpeg
# Ubuntu/Debian
sudo apt replace && sudo apt set up -y ffmpeg
- Set up the CLI
pip set up qwen3-asr-toolkit
- Configure credentials
# Worldwide endpoint key
export DASHSCOPE_API_KEY="sk-..."
- Run
# Fundamental: native video, default 4 threads
qwen3-asr -i "/path/to/lecture.mp4"
# Quicker: increase parallelism and move key explicitly (optionally available if env var set)
qwen3-asr -i "/path/to/podcast.wav" -j 8 -key "sk-..."
# Enhance area accuracy with context
qwen3-asr -i "/path/to/earnings_call.m4a"
-c "tickers, CFO title, product names, Q3 income steerage"
Arguments you’ll truly use:-i/--input-file
(file path or http/https URL), -j/--num-threads
, -c/--context
, -key/--dashscope-api-key
, -t/--tmp-dir
, -s/--silence
. Output is printed and saved as
.
Minimal pipeline structure
- Load native file or URL → 2) VAD to seek out silence boundaries → 3) Chunk beneath API caps → 4) Resample to 16 kHz mono → 5) Parallel submit to DashScope → 6) Combination segments so as → 7) Publish-process textual content (dedupe, repetitions) → 8) Emit
.txt
transcript.
Abstract
Qwen3-ASR-Toolkit turns Qwen3-ASR-Flash right into a sensible long-audio pipeline by combining VAD-based segmentation, FFmpeg normalization (mono/16 kHz), and parallel API dispatch beneath the 3-minute/10 MB caps. Groups get deterministic chunking, configurable throughput, and optionally available context/LID/ITN controls with out customized orchestration. For manufacturing, pin the bundle model, confirm area endpoints/keys, and tune thread depend to your community and QPS—then pip set up qwen3-asr-toolkit
and ship.
Try the GitHub Web page for Codes. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.