On this tutorial, we stroll by way of a sophisticated implementation of WhisperX, the place we discover transcription, alignment, and word-level timestamps intimately. We arrange the setting, load and preprocess the audio, after which run the complete pipeline, from transcription to alignment and evaluation, whereas making certain reminiscence effectivity and supporting batch processing. Alongside the way in which, we additionally visualize outcomes, export them in a number of codecs, and even extract key phrases to achieve deeper insights from the audio content material. Take a look at the FULL CODES right here.
!pip set up -q git+https://github.com/m-bain/whisperX.git
!pip set up -q pandas matplotlib seaborn
import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.show import Audio, show, HTML
import warnings
warnings.filterwarnings('ignore')
CONFIG = {
"gadget": "cuda" if torch.cuda.is_available() else "cpu",
"compute_type": "float16" if torch.cuda.is_available() else "int8",
"batch_size": 16,
"model_size": "base",
"language": None,
}
print(f"🚀 Operating on: {CONFIG['device']}")
print(f"📊 Compute sort: {CONFIG['compute_type']}")
print(f"🎯 Mannequin: {CONFIG['model_size']}")
We start by putting in WhisperX together with important libraries after which configure our setup. We detect whether or not CUDA is accessible, choose the compute sort, and set parameters reminiscent of batch dimension, mannequin dimension, and language to organize for transcription. Take a look at the FULL CODES right here.
def download_sample_audio():
"""Obtain a pattern audio file for testing"""
!wget -q -O pattern.mp3 https://github.com/mozilla-extensions/speaktome/uncooked/grasp/content material/cv-valid-dev/sample-000000.mp3
print("✅ Pattern audio downloaded")
return "pattern.mp3"
def load_and_analyze_audio(audio_path):
"""Load audio and show fundamental data"""
audio = whisperx.load_audio(audio_path)
period = len(audio) / 16000
print(f"📁 Audio: {Path(audio_path).identify}")
print(f"⏱️ Period: {period:.2f} seconds")
print(f"🎵 Pattern fee: 16000 Hz")
show(Audio(audio_path))
return audio, period
def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
"""Transcribe audio utilizing WhisperX (batched inference)"""
print("n🎤 STEP 1: Transcribing audio...")
mannequin = whisperx.load_model(
model_size,
CONFIG["device"],
compute_type=CONFIG["compute_type"]
)
transcribe_kwargs = {
"batch_size": CONFIG["batch_size"]
}
if language:
transcribe_kwargs["language"] = language
consequence = mannequin.transcribe(audio, **transcribe_kwargs)
total_segments = len(consequence["segments"])
total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
del mannequin
gc.gather()
if CONFIG["device"] == "cuda":
torch.cuda.empty_cache()
print(f"✅ Transcription full!")
print(f" Language: {consequence['language']}")
print(f" Segments: {total_segments}")
print(f" Complete textual content size: {sum(len(seg['text']) for seg in consequence['segments'])} characters")
return consequence
We obtain a pattern audio file, load it for evaluation, after which transcribe it utilizing WhisperX. We arrange batched inference with our chosen mannequin dimension and configuration, and we output key particulars reminiscent of language, variety of segments, and complete textual content size. Take a look at the FULL CODES right here.
def align_transcription(segments, audio, language_code):
"""Align transcription for correct word-level timestamps"""
print("n🎯 STEP 2: Aligning for word-level timestamps...")
strive:
model_a, metadata = whisperx.load_align_model(
language_code=language_code,
gadget=CONFIG["device"]
)
consequence = whisperx.align(
segments,
model_a,
metadata,
audio,
CONFIG["device"],
return_char_alignments=False
)
total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
del model_a
gc.gather()
if CONFIG["device"] == "cuda":
torch.cuda.empty_cache()
print(f"✅ Alignment full!")
print(f" Aligned phrases: {total_words}")
return consequence
besides Exception as e:
print(f"⚠️ Alignment failed: {str(e)}")
print(" Persevering with with segment-level timestamps solely...")
return {"segments": segments, "word_segments": []}
We align the transcription to generate exact word-level timestamps. By loading the alignment mannequin and making use of it to the audio, we refine timing accuracy, after which report the full aligned phrases whereas making certain reminiscence is cleared for environment friendly processing. Take a look at the FULL CODES right here.
def analyze_transcription(consequence):
"""Generate statistics in regards to the transcription"""
print("n📊 TRANSCRIPTION STATISTICS")
print("="*70)
segments = consequence["segments"]
total_duration = max(seg["end"] for seg in segments) if segments else 0
total_words = sum(len(seg.get("phrases", [])) for seg in segments)
total_chars = sum(len(seg["text"].strip()) for seg in segments)
print(f"Complete period: {total_duration:.2f} seconds")
print(f"Complete segments: {len(segments)}")
print(f"Complete phrases: {total_words}")
print(f"Complete characters: {total_chars}")
if total_duration > 0:
print(f"Phrases per minute: {(total_words / total_duration * 60):.1f}")
pauses = []
for i in vary(len(segments) - 1):
pause = segments[i+1]["start"] - segments[i]["end"]
if pause > 0:
pauses.append(pause)
if pauses:
print(f"Common pause between segments: {sum(pauses)/len(pauses):.2f}s")
print(f"Longest pause: {max(pauses):.2f}s")
word_durations = []
for seg in segments:
if "phrases" in seg:
for phrase in seg["words"]:
period = phrase["end"] - phrase["start"]
word_durations.append(period)
if word_durations:
print(f"Common phrase period: {sum(word_durations)/len(word_durations):.3f}s")
print("="*70)
We analyze the transcription by producing detailed statistics reminiscent of complete period, phase depend, phrase depend, and character depend. We additionally calculate phrases per minute, pauses between segments, and common phrase period to raised perceive the pacing and stream of the audio. Take a look at the FULL CODES right here.
def display_results(consequence, show_words=False, max_rows=50):
"""Show transcription ends in formatted desk"""
information = []
for seg in consequence["segments"]:
textual content = seg["text"].strip()
begin = f"{seg['start']:.2f}s"
finish = f"{seg['end']:.2f}s"
period = f"{seg['end'] - seg['start']:.2f}s"
if show_words and "phrases" in seg:
for phrase in seg["words"]:
information.append({
"Begin": f"{phrase['start']:.2f}s",
"Finish": f"{phrase['end']:.2f}s",
"Period": f"{phrase['end'] - phrase['start']:.3f}s",
"Textual content": phrase["word"],
"Rating": f"{phrase.get('rating', 0):.2f}"
})
else:
information.append({
"Begin": begin,
"Finish": finish,
"Period": period,
"Textual content": textual content
})
df = pd.DataFrame(information)
if len(df) > max_rows:
print(f"Exhibiting first {max_rows} rows of {len(df)} complete...")
show(HTML(df.head(max_rows).to_html(index=False)))
else:
show(HTML(df.to_html(index=False)))
return df
def export_results(consequence, output_dir="output", filename="transcript"):
"""Export ends in a number of codecs"""
os.makedirs(output_dir, exist_ok=True)
json_path = f"{output_dir}/{filename}.json"
with open(json_path, "w", encoding="utf-8") as f:
json.dump(consequence, f, indent=2, ensure_ascii=False)
srt_path = f"{output_dir}/{filename}.srt"
with open(srt_path, "w", encoding="utf-8") as f:
for i, seg in enumerate(consequence["segments"], 1):
begin = format_timestamp(seg["start"])
finish = format_timestamp(seg["end"])
f.write(f"{i}n{begin} --> {finish}n{seg['text'].strip()}nn")
vtt_path = f"{output_dir}/{filename}.vtt"
with open(vtt_path, "w", encoding="utf-8") as f:
f.write("WEBVTTnn")
for i, seg in enumerate(consequence["segments"], 1):
begin = format_timestamp_vtt(seg["start"])
finish = format_timestamp_vtt(seg["end"])
f.write(f"{begin} --> {finish}n{seg['text'].strip()}nn")
txt_path = f"{output_dir}/{filename}.txt"
with open(txt_path, "w", encoding="utf-8") as f:
for seg in consequence["segments"]:
f.write(f"{seg['text'].strip()}n")
csv_path = f"{output_dir}/{filename}.csv"
df_data = []
for seg in consequence["segments"]:
df_data.append({
"begin": seg["start"],
"finish": seg["end"],
"textual content": seg["text"].strip()
})
pd.DataFrame(df_data).to_csv(csv_path, index=False)
print(f"n💾 Outcomes exported to '{output_dir}/' listing:")
print(f" ✓ {filename}.json (full structured information)")
print(f" ✓ {filename}.srt (subtitles)")
print(f" ✓ {filename}.vtt (net video subtitles)")
print(f" ✓ {filename}.txt (plain textual content)")
print(f" ✓ {filename}.csv (timestamps + textual content)")
def format_timestamp(seconds):
"""Convert seconds to SRT timestamp format"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def format_timestamp_vtt(seconds):
"""Convert seconds to VTT timestamp format"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
def batch_process_files(audio_files, output_dir="batch_output"):
"""Course of a number of audio information in batch"""
print(f"n📦 Batch processing {len(audio_files)} information...")
outcomes = {}
for i, audio_path in enumerate(audio_files, 1):
print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).identify}")
strive:
consequence, _ = process_audio_file(audio_path, show_output=False)
outcomes[audio_path] = consequence
filename = Path(audio_path).stem
export_results(consequence, output_dir, filename)
besides Exception as e:
print(f"❌ Error processing {audio_path}: {str(e)}")
outcomes[audio_path] = None
print(f"n✅ Batch processing full! Processed {len(outcomes)} information.")
return outcomes
def extract_keywords(consequence, top_n=10):
"""Extract commonest phrases from transcription"""
from collections import Counter
import re
textual content = " ".be part of(seg["text"] for seg in consequence["segments"])
phrases = re.findall(r'bw+b', textual content.decrease())
stop_words = {'the', 'a', 'an', 'and', 'or', 'however', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'is', 'was', 'are', 'had been', 'be', 'been', 'being',
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'may',
'ought to', 'could', 'may', 'should', 'can', 'this', 'that', 'these', 'these'}
filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
word_counts = Counter(filtered_words).most_common(top_n)
print(f"n🔑 Prime {top_n} Key phrases:")
for phrase, depend in word_counts:
print(f" {phrase}: {depend}")
return word_counts
We format outcomes into clear tables, export transcripts to JSON/SRT/VTT/TXT/CSV codecs, and keep exact timestamps with helper formatters. We additionally batch-process a number of audio information end-to-end and extract prime key phrases, enabling us to rapidly flip uncooked transcriptions into analysis-ready artifacts. Take a look at the FULL CODES right here.
def process_audio_file(audio_path, show_output=True, analyze=True):
"""Full WhisperX pipeline"""
if show_output:
print("="*70)
print("🎵 WhisperX Superior Tutorial")
print("="*70)
audio, period = load_and_analyze_audio(audio_path)
consequence = transcribe_audio(audio, CONFIG["model_size"], CONFIG["language"])
aligned_result = align_transcription(
consequence["segments"],
audio,
consequence["language"]
)
if analyze and show_output:
analyze_transcription(aligned_result)
extract_keywords(aligned_result)
if show_output:
print("n" + "="*70)
print("📋 TRANSCRIPTION RESULTS")
print("="*70)
df = display_results(aligned_result, show_words=False)
export_results(aligned_result)
else:
df = None
return aligned_result, df
# Instance 1: Course of pattern audio
# audio_path = download_sample_audio()
# consequence, df = process_audio_file(audio_path)
# Instance 2: Present word-level particulars
# consequence, df = process_audio_file(audio_path)
# word_df = display_results(consequence, show_words=True)
# Instance 3: Course of your individual audio
# audio_path = "your_audio.wav" # or .mp3, .m4a, and so on.
# consequence, df = process_audio_file(audio_path)
# Instance 4: Batch course of a number of information
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# outcomes = batch_process_files(audio_files)
# Instance 5: Use a bigger mannequin for higher accuracy
# CONFIG["model_size"] = "large-v2"
# consequence, df = process_audio_file("audio.mp3")
print("n✨ Setup full! Uncomment examples above to run.")
We run the complete WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract key phrases, render a clear outcomes desk, and export every part to a number of codecs, prepared for actual use.
In conclusion, we constructed an entire WhisperX pipeline that not solely transcribes audio but in addition aligns it with exact word-level timestamps. We export the ends in a number of codecs, course of information in batches, and analyze patterns to make the output extra significant. With this, we now have a versatile, ready-to-use workflow for transcription and audio evaluation on Colab, and we’re prepared to increase it additional into real-world tasks.
Take a look at the FULL CODES right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as nicely.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.