Learn how to Construct an Superior Voice AI Pipeline with WhisperX for Transcription, Alignment, Evaluation, and Export?

October 3, 2025

35

On this tutorial, we stroll by way of a sophisticated implementation of WhisperX, the place we discover transcription, alignment, and word-level timestamps intimately. We arrange the setting, load and preprocess the audio, after which run the complete pipeline, from transcription to alignment and evaluation, whereas making certain reminiscence effectivity and supporting batch processing. Alongside the way in which, we additionally visualize outcomes, export them in a number of codecs, and even extract key phrases to achieve deeper insights from the audio content material. Take a look at the FULL CODES right here.

!pip set up -q git+https://github.com/m-bain/whisperX.git
!pip set up -q pandas matplotlib seaborn


import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.show import Audio, show, HTML
import warnings
warnings.filterwarnings('ignore')


CONFIG = {
   "gadget": "cuda" if torch.cuda.is_available() else "cpu",
   "compute_type": "float16" if torch.cuda.is_available() else "int8",
   "batch_size": 16, 
   "model_size": "base", 
   "language": None, 
}


print(f"🚀 Operating on: {CONFIG['device']}")
print(f"📊 Compute sort: {CONFIG['compute_type']}")
print(f"🎯 Mannequin: {CONFIG['model_size']}")

We start by putting in WhisperX together with important libraries after which configure our setup. We detect whether or not CUDA is accessible, choose the compute sort, and set parameters reminiscent of batch dimension, mannequin dimension, and language to organize for transcription. Take a look at the FULL CODES right here.

def download_sample_audio():
   """Obtain a pattern audio file for testing"""
   !wget -q -O pattern.mp3 https://github.com/mozilla-extensions/speaktome/uncooked/grasp/content material/cv-valid-dev/sample-000000.mp3
   print("✅ Pattern audio downloaded")
   return "pattern.mp3"


def load_and_analyze_audio(audio_path):
   """Load audio and show fundamental data"""
   audio = whisperx.load_audio(audio_path)
   period = len(audio) / 16000 
   print(f"📁 Audio: {Path(audio_path).identify}")
   print(f"⏱️  Period: {period:.2f} seconds")
   print(f"🎵 Pattern fee: 16000 Hz")
   show(Audio(audio_path))
   return audio, period


def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
   """Transcribe audio utilizing WhisperX (batched inference)"""
   print("n🎤 STEP 1: Transcribing audio...")
  
   mannequin = whisperx.load_model(
       model_size,
       CONFIG["device"],
       compute_type=CONFIG["compute_type"]
   )
  
   transcribe_kwargs = {
       "batch_size": CONFIG["batch_size"]
   }
   if language:
       transcribe_kwargs["language"] = language
  
   consequence = mannequin.transcribe(audio, **transcribe_kwargs)
  
   total_segments = len(consequence["segments"])
   total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
  
   del mannequin
   gc.gather()
   if CONFIG["device"] == "cuda":
       torch.cuda.empty_cache()
  
   print(f"✅ Transcription full!")
   print(f"   Language: {consequence['language']}")
   print(f"   Segments: {total_segments}")
   print(f"   Complete textual content size: {sum(len(seg['text']) for seg in consequence['segments'])} characters")
  
   return consequence

We obtain a pattern audio file, load it for evaluation, after which transcribe it utilizing WhisperX. We arrange batched inference with our chosen mannequin dimension and configuration, and we output key particulars reminiscent of language, variety of segments, and complete textual content size. Take a look at the FULL CODES right here.

def align_transcription(segments, audio, language_code):
   """Align transcription for correct word-level timestamps"""
   print("n🎯 STEP 2: Aligning for word-level timestamps...")
  
   strive:
       model_a, metadata = whisperx.load_align_model(
           language_code=language_code,
           gadget=CONFIG["device"]
       )
      
       consequence = whisperx.align(
           segments,
           model_a,
           metadata,
           audio,
           CONFIG["device"],
           return_char_alignments=False
       )
      
       total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
      
       del model_a
       gc.gather()
       if CONFIG["device"] == "cuda":
           torch.cuda.empty_cache()
      
       print(f"✅ Alignment full!")
       print(f"   Aligned phrases: {total_words}")
      
       return consequence
   besides Exception as e:
       print(f"⚠️  Alignment failed: {str(e)}")
       print("   Persevering with with segment-level timestamps solely...")
       return {"segments": segments, "word_segments": []}

We align the transcription to generate exact word-level timestamps. By loading the alignment mannequin and making use of it to the audio, we refine timing accuracy, after which report the full aligned phrases whereas making certain reminiscence is cleared for environment friendly processing. Take a look at the FULL CODES right here.

def analyze_transcription(consequence):
   """Generate statistics in regards to the transcription"""
   print("n📊 TRANSCRIPTION STATISTICS")
   print("="*70)
  
   segments = consequence["segments"]
  
   total_duration = max(seg["end"] for seg in segments) if segments else 0
   total_words = sum(len(seg.get("phrases", [])) for seg in segments)
   total_chars = sum(len(seg["text"].strip()) for seg in segments)
  
   print(f"Complete period: {total_duration:.2f} seconds")
   print(f"Complete segments: {len(segments)}")
   print(f"Complete phrases: {total_words}")
   print(f"Complete characters: {total_chars}")
  
   if total_duration > 0:
       print(f"Phrases per minute: {(total_words / total_duration * 60):.1f}")
  
   pauses = []
   for i in vary(len(segments) - 1):
       pause = segments[i+1]["start"] - segments[i]["end"]
       if pause > 0:
           pauses.append(pause)
  
   if pauses:
       print(f"Common pause between segments: {sum(pauses)/len(pauses):.2f}s")
       print(f"Longest pause: {max(pauses):.2f}s")
  
   word_durations = []
   for seg in segments:
       if "phrases" in seg:
           for phrase in seg["words"]:
               period = phrase["end"] - phrase["start"]
               word_durations.append(period)
  
   if word_durations:
       print(f"Common phrase period: {sum(word_durations)/len(word_durations):.3f}s")
  
   print("="*70)

We analyze the transcription by producing detailed statistics reminiscent of complete period, phase depend, phrase depend, and character depend. We additionally calculate phrases per minute, pauses between segments, and common phrase period to raised perceive the pacing and stream of the audio. Take a look at the FULL CODES right here.

def display_results(consequence, show_words=False, max_rows=50):
   """Show transcription ends in formatted desk"""
   information = []
  
   for seg in consequence["segments"]:
       textual content = seg["text"].strip()
       begin = f"{seg['start']:.2f}s"
       finish = f"{seg['end']:.2f}s"
       period = f"{seg['end'] - seg['start']:.2f}s"
      
       if show_words and "phrases" in seg:
           for phrase in seg["words"]:
               information.append({
                   "Begin": f"{phrase['start']:.2f}s",
                   "Finish": f"{phrase['end']:.2f}s",
                   "Period": f"{phrase['end'] - phrase['start']:.3f}s",
                   "Textual content": phrase["word"],
                   "Rating": f"{phrase.get('rating', 0):.2f}"
               })
       else:
           information.append({
               "Begin": begin,
               "Finish": finish,
               "Period": period,
               "Textual content": textual content
           })
  
   df = pd.DataFrame(information)
  
   if len(df) > max_rows:
       print(f"Exhibiting first {max_rows} rows of {len(df)} complete...")
       show(HTML(df.head(max_rows).to_html(index=False)))
   else:
       show(HTML(df.to_html(index=False)))
  
   return df


def export_results(consequence, output_dir="output", filename="transcript"):
   """Export ends in a number of codecs"""
   os.makedirs(output_dir, exist_ok=True)
  
   json_path = f"{output_dir}/{filename}.json"
   with open(json_path, "w", encoding="utf-8") as f:
       json.dump(consequence, f, indent=2, ensure_ascii=False)
  
   srt_path = f"{output_dir}/{filename}.srt"
   with open(srt_path, "w", encoding="utf-8") as f:
       for i, seg in enumerate(consequence["segments"], 1):
           begin = format_timestamp(seg["start"])
           finish = format_timestamp(seg["end"])
           f.write(f"{i}n{begin} --> {finish}n{seg['text'].strip()}nn")
  
   vtt_path = f"{output_dir}/{filename}.vtt"
   with open(vtt_path, "w", encoding="utf-8") as f:
       f.write("WEBVTTnn")
       for i, seg in enumerate(consequence["segments"], 1):
           begin = format_timestamp_vtt(seg["start"])
           finish = format_timestamp_vtt(seg["end"])
           f.write(f"{begin} --> {finish}n{seg['text'].strip()}nn")
  
   txt_path = f"{output_dir}/{filename}.txt"
   with open(txt_path, "w", encoding="utf-8") as f:
       for seg in consequence["segments"]:
           f.write(f"{seg['text'].strip()}n")
  
   csv_path = f"{output_dir}/{filename}.csv"
   df_data = []
   for seg in consequence["segments"]:
       df_data.append({
           "begin": seg["start"],
           "finish": seg["end"],
           "textual content": seg["text"].strip()
       })
   pd.DataFrame(df_data).to_csv(csv_path, index=False)
  
   print(f"n💾 Outcomes exported to '{output_dir}/' listing:")
   print(f"   ✓ {filename}.json (full structured information)")
   print(f"   ✓ {filename}.srt (subtitles)")
   print(f"   ✓ {filename}.vtt (net video subtitles)")
   print(f"   ✓ {filename}.txt (plain textual content)")
   print(f"   ✓ {filename}.csv (timestamps + textual content)")


def format_timestamp(seconds):
   """Convert seconds to SRT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


def format_timestamp_vtt(seconds):
   """Convert seconds to VTT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"


def batch_process_files(audio_files, output_dir="batch_output"):
   """Course of a number of audio information in batch"""
   print(f"n📦 Batch processing {len(audio_files)} information...")
   outcomes = {}
  
   for i, audio_path in enumerate(audio_files, 1):
       print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).identify}")
       strive:
           consequence, _ = process_audio_file(audio_path, show_output=False)
           outcomes[audio_path] = consequence
          
           filename = Path(audio_path).stem
           export_results(consequence, output_dir, filename)
       besides Exception as e:
           print(f"❌ Error processing {audio_path}: {str(e)}")
           outcomes[audio_path] = None
  
   print(f"n✅ Batch processing full! Processed {len(outcomes)} information.")
   return outcomes


def extract_keywords(consequence, top_n=10):
   """Extract commonest phrases from transcription"""
   from collections import Counter
   import re
  
   textual content = " ".be part of(seg["text"] for seg in consequence["segments"])
  
   phrases = re.findall(r'bw+b', textual content.decrease())
  
   stop_words = {'the', 'a', 'an', 'and', 'or', 'however', 'in', 'on', 'at', 'to', 'for',
                 'of', 'with', 'is', 'was', 'are', 'had been', 'be', 'been', 'being',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'may',
                 'ought to', 'could', 'may', 'should', 'can', 'this', 'that', 'these', 'these'}
  
   filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
  
   word_counts = Counter(filtered_words).most_common(top_n)
  
   print(f"n🔑 Prime {top_n} Key phrases:")
   for phrase, depend in word_counts:
       print(f"   {phrase}: {depend}")
  
   return word_counts

We format outcomes into clear tables, export transcripts to JSON/SRT/VTT/TXT/CSV codecs, and keep exact timestamps with helper formatters. We additionally batch-process a number of audio information end-to-end and extract prime key phrases, enabling us to rapidly flip uncooked transcriptions into analysis-ready artifacts. Take a look at the FULL CODES right here.

def process_audio_file(audio_path, show_output=True, analyze=True):
   """Full WhisperX pipeline"""
   if show_output:
       print("="*70)
       print("🎵 WhisperX Superior Tutorial")
       print("="*70)
  
   audio, period = load_and_analyze_audio(audio_path)
  
   consequence = transcribe_audio(audio, CONFIG["model_size"], CONFIG["language"])
  
   aligned_result = align_transcription(
       consequence["segments"],
       audio,
       consequence["language"]
   )
  
   if analyze and show_output:
       analyze_transcription(aligned_result)
       extract_keywords(aligned_result)
  
   if show_output:
       print("n" + "="*70)
       print("📋 TRANSCRIPTION RESULTS")
       print("="*70)
       df = display_results(aligned_result, show_words=False)
      
       export_results(aligned_result)
   else:
       df = None
  
   return aligned_result, df


# Instance 1: Course of pattern audio
# audio_path = download_sample_audio()
# consequence, df = process_audio_file(audio_path)


# Instance 2: Present word-level particulars
# consequence, df = process_audio_file(audio_path)
# word_df = display_results(consequence, show_words=True)


# Instance 3: Course of your individual audio
# audio_path = "your_audio.wav"  # or .mp3, .m4a, and so on.
# consequence, df = process_audio_file(audio_path)


# Instance 4: Batch course of a number of information
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# outcomes = batch_process_files(audio_files)


# Instance 5: Use a bigger mannequin for higher accuracy
# CONFIG["model_size"] = "large-v2"
# consequence, df = process_audio_file("audio.mp3")


print("n✨ Setup full! Uncomment examples above to run.")

We run the complete WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract key phrases, render a clear outcomes desk, and export every part to a number of codecs, prepared for actual use.

In conclusion, we constructed an entire WhisperX pipeline that not solely transcribes audio but in addition aligns it with exact word-level timestamps. We export the ends in a number of codecs, course of information in batches, and analyze patterns to make the output extra significant. With this, we now have a versatile, ready-to-use workflow for transcription and audio evaluation on Colab, and we’re prepared to increase it additional into real-world tasks.

Take a look at the FULL CODES right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as nicely.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Comply with MARKTECHPOST: Add us as a most well-liked supply on Google.