Language fashions have been quickly evolving on the planet. Now, with Multimodal LLMs taking over the forefront of this Language Fashions race, you will need to perceive how we are able to leverage the capabilities of those Multimodal fashions. From conventional text-based AI-powered chatbots, we’re transitioning over to voice based mostly chatbots. These act as our private assistants, out there at a second’s discover to are likely to our wants. These days, yow will discover an AI-On this weblog, we’ll construct an Emergency Operator voice-based chatbot. The concept is fairly simple:
- We converse to the chatbot
- It listens to understands what we’ve mentioned
- It responds with a voice observe

Our Use-Case
Let’s think about a real-world situation. We reside in a rustic with over 1.4 billion individuals and with such an enormous inhabitants, emergencies are certain to happen whether or not it’s a medical challenge, a fireplace breakout, police intervention, and even psychological well being help like anti-suicide help and so on.
In such moments, each second counts. Additionally, contemplating the dearth of Emergency Operators and the overwhelming quantity of points raised. That’s the place a voice-based chatbot could make an enormous distinction which might provide fast, spoken help when individuals want it essentially the most.
- Emergency Help: Instant assist for well being, hearth, crime, or disaster-related queries with out ready for a human operator (when not out there).
- Psychological Well being Helpline: A voice-based emotional help assistant guiding customers with compassion.
- Rural Accessibility: Areas with restricted entry to cellular apps can profit from a easy voice interface since individuals typically talk by talking in such areas.
That’s precisely what we’re going to construct. We shall be performing as somebody searching for assist, and the chatbot will play the position of an emergency responder, powered by a big language mannequin.
To implement our voice chatbot, we shall be utilizing the under talked about AI fashions:
- Whisper (Massive) – OpenAI’s speech-to-text mannequin, operating through GroqCloud, to transform voice into textual content.
- GPT-4.1-mini – Powered by CometAPI (Free LLM Supplier), that is the mind of our chatbot that may perceive our queries and can generate significant responses.
- Google Textual content-to-Speech (gTTS) – Converts the chatbot’s responses again into voice so it will probably speak to us.
- FFmpeg – A helpful library that helps us file and handle audio simply.
Necessities
Earlier than we begin coding, we have to arrange some issues:
- GroqCloud API Key: Get it from right here: https://console.groq.com/keys
- CometAPI Key
Register and retailer your API key from: https://api.cometapi.com/ - ElevenLabs API Key
Register and retailer your API key from: https://elevenlabs.io/app/residence - FFmpeg Set up
Should you don’t have already got it, observe this information to put in FFmpeg in your system: https://itsfoss.com/ffmpeg/
Affirm by typing “ffmeg -version” in your terminal
Upon getting these arrange, you’re able to dive into constructing your very personal voice-enabled chatbot!
Challenge Construction
The Challenge Construction shall be fairly easy and rudimentary and most of our working shall be taking place within the app.py and utils.py python scripts.
VOICE-CHATBOT/├── venv/ # Digital setting for dependencies
├── .env # Atmosphere variables (API keys, and so on.)
├── app.py # Principal utility script
├── emergency.png # Emergency-related picture asset
├── README.md # Challenge documentation (non-obligatory)
├── necessities.txt # Python dependencies
├── utils.py # Utility/helper capabilities
There are some crucial information to be modified to make sure that all our dependencies are glad:
Within the .env file
GROQ_API_KEY = "
Within the necessities.txt
ffmpeg-python
pydub
pyttsx3
langchain
langchain-community
langchain-core
langchain-groq
langchain_openai
python-dotenv
streamlit==1.37.0
audio-recorder-streamlit
dotenv
elevenlabs
gtts
Setting Up the Digital Atmosphere
We may also need to arrange a digital setting (a superb follow). We shall be doing this in terminal.
- Creation of our digital setting
~/Desktop/Emergency-Voice-Chatbot$ conda create -p venv python==3.12 -y

- Activating our Digital Atmosphere
~/Desktop/Emergency-Voice-Chatbot$ conda activate venv/

- After you end operating the applying, you’ll be able to deactivate the Digital Atmosphere too
~/Desktop/Emergency-Voice-Chatbot$ conda deactivate

Principal Python Scripts
Let’s first discover the utils.py script.
1. Principal Imports
time, tempfile, os, re, BytesIO – Deal with timing, short-term information, setting variables, regex, and in-memory information.
requests – Makes HTTP requests (e.g., calling APIs).
gTTS, elevenlabs, pydub – Convert textual content to speech, speech to textual content and play/manipulate audio.
groq, langchain_* – Use Groq/OpenAI LLMs with LangChain to course of and generate textual content.
streamlit – Construct interactive net apps.dotenv – Load setting variables (like API keys) from a .env file.
import time
import requests
import tempfile
import re
from io import BytesIO
from gtts import gTTS
from elevenlabs.consumer import ElevenLabs
from elevenlabs import play
from pydub import AudioSegment
from groq import Groq
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import streamlit as st
import os
from dotenv import load_dotenv
load_dotenv()
2. Load your API Keys and initialize your fashions
# Initialize the Groq consumer
consumer = Groq(api_key=os.getenv('GROQ_API_KEY'))
# Initialize the Groq mannequin for LLM responses
llm = ChatOpenAI(
model_name="gpt-4.1-mini",
openai_api_key=os.getenv("COMET_API_KEY"),
openai_api_base="https://api.cometapi.com/v1"
)
# Set the trail to ffmpeg executable
AudioSegment.converter = "/bin/ffmpeg"
3. Changing the Audio file (our voice recording) into .wav format
Right here, we’ll changing our audio in bytes which is completed by AudioSegment and BytesIO and convert it right into a wav format:
def audio_bytes_to_wav(audio_bytes):
attempt:
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_wav:
audio = AudioSegment.from_file(BytesIO(audio_bytes))
# Downsample to cut back file measurement if wanted
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export(temp_wav.title, format="wav")
return temp_wav.title
besides Exception as e:
st.error(f"Error throughout WAV file conversion: {e}")
return None
4. Splitting Audio
We’ll make a operate to separate our audio as per our enter parameter (check_length_ms). We may also make a operate to eliminate any punctuation with the assistance of regex.
def split_audio(file_path, chunk_length_ms):
audio = AudioSegment.from_wav(file_path)
return for i in vary(0, len(audio), chunk_length_ms)]
def remove_punctuation(textual content):
return re.sub(r'[^ws]', '', textual content)
5. LLM Response Technology
Now, to do important responder performance the place the LLM will generate an apt response to our queries. Within the immediate template, we’ll present the directions to our LLM on how they need to reply to the queries. We shall be implementing Langchain Expression Language to do that process.
def get_llm_response(question, chat_history):
attempt:
template = template = """
You're an skilled Emergency Response Telephone Operator skilled to deal with vital conditions in India.
Your position is to information customers calmly and clearly throughout emergencies involving:
- Medical crises (accidents, coronary heart assaults, and so on.)
- Hearth incidents
- Police/regulation enforcement help
- Suicide prevention or psychological well being crises
You could:
1. **Stay calm and assertive**, as if talking on a telephone name.
2. **Ask for and ensure key particulars** like location, situation of the particular person, variety of individuals concerned, and so on.
3. **Present rapid and sensible steps** the consumer can take earlier than assist arrives.
4. **Share correct, India-based emergency helpline numbers** (e.g., 112, 102, 108, 1091, 1098, 9152987821, and so on.).
5. **Prioritize consumer security**, and clearly instruct them what *not* to do as effectively.
6. If the scenario entails **suicidal ideas or psychological misery**, reply with compassion and direct them to applicable psychological well being helplines and security actions.
If the consumer's question will not be associated to an emergency, reply with:
"I can solely help with pressing emergency-related points. Please contact a basic help line for non-emergency questions."
Use an authoritative, supportive tone, brief and direct sentences, and tailor your steering to **city and rural Indian contexts**.
**Chat Historical past:** {chat_history}
**Consumer:** {user_query}
"""
immediate = ChatPromptTemplate.from_template(template)
chain = immediate | llm | StrOutputParser()
response_gen = chain.stream({
"chat_history": chat_history,
"user_query": question
})
response_text="".be a part of(record(response_gen))
response_text = remove_punctuation(response_text)
# Take away repeated textual content
response_lines = response_text.break up('n')
unique_lines = record(dict.fromkeys(response_lines)) # Eradicating duplicates
cleaned_response="n".be a part of(unique_lines)
return cleaned_responseChatbot
besides Exception as e:
st.error(f"Error throughout LLM response technology: {e}")
return "Error"
6. Textual content to Speech
We’ll construct a operate to transform our textual content to speech with the assistance of ElevenLabs TTS Consumer which is able to return us the Audio within the AudioSegment format. We will additionally use different TTS fashions like Nari Lab’s Dia or Google’s gTTS too. Eleven Labs offers us some free credit at begin after which now we have to pay for extra credit, gTTS on the opposite aspect is totally free to make use of.
def text_to_speech(textual content: str, retries: int = 3, delay: int = 5):
try = 0
whereas try
7. Create Introductory Message
We may also create an introductory textual content and cross it to our TTS mannequin since a respondent would usually introduce themselves and search what help the consumer would possibly want. Right here we shall be returning the trail of the mp3 file.
lang=” en” -> English
tld= ”co.in” -> can produce completely different localized ‘accents’ for a given language. The default is “com”
def create_welcome_message():
welcome_text = (
"Hiya, you’ve reached the Emergency Assist Desk. "
"Please let me know if it is a medical, hearth, police, or psychological well being emergency—"
"I am right here to information you straight away."
)
attempt:
# Request speech synthesis (streaming generator)
response_stream = tts_client.text_to_speech.convert(
textual content=welcome_text,
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
)
# Save streamed bytes to temp file
with tempfile.NamedTemporaryFile(delete=False, suffix='.mp3') as f:
for chunk in response_stream:
f.write(chunk)
return f.title
besides requests.ConnectionError:
st.error("Did not generate welcome message resulting from connection error.")
besides Exception as e:
st.error(f"Error creating welcome message: {e}")
return None
Streamlit App
Now, let’s leap into the important.py script the place we shall be utilizing Streamlit to visualise our Chatbot.
Import Libraries and Capabilities
Import our libraries and the capabilities we had inbuilt our utils.py
import tempfile
import re # This may be eliminated if not used
from io import BytesIO
from pydub import AudioSegment
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import streamlit as st
from audio_recorder_streamlit import audio_recorder
from utils import *
Streamlit Setup
Now, we’ll set our Title title and good “Emergency” visible picture
st.title(":blue[Emergency Help Bot] 🚨🚑🆘")
st.sidebar.picture('./emergency.jpg', use_column_width=True)
We’ll set our Session States to maintain monitor of our chats and audio
if "chat_history" not in st.session_state:
st.session_state.chat_history = []
if "chat_histories" not in st.session_state:
st.session_state.chat_histories = []
if "played_audios" not in st.session_state:
st.session_state.played_audios = {}
Invoking our utils capabilities
We’ll create our welcome message introduction from the Respondent aspect. This would be the begin of our dialog.
if len(st.session_state.chat_history) == 0:
welcome_audio_path = create_welcome_message()
st.session_state.chat_history = [
AIMessage(content="Hello, you’ve reached the Emergency Help Desk. Please let me know if it's a medical, fire, police, or mental health emergency—I'm here to guide you right away.", audio_file=welcome_audio_path)
]
st.session_state.played_audios[welcome_audio_path] = False
Now, within the sidebar we’ll organising our voice recorder and the speech-to-text, llm_response and the text-to-speech logic which is the primary crux of this venture
with st.sidebar:
audio_bytes = audio_recorder(
energy_threshold=0.01,
pause_threshold=0.8,
textual content="Converse on clicking the ICON (Max 5 min) n",
recording_color="#e9b61d", # yellow
neutral_color="#2abf37", # inexperienced
icon_name="microphone",
icon_size="2x"
)
if audio_bytes:
temp_audio_path = audio_bytes_to_wav(audio_bytes)
if temp_audio_path:
attempt:
user_input = speech_to_text(audio_bytes)
if user_input:
st.session_state.chat_history.append(HumanMessage(content material=user_input, audio_file=temp_audio_path))
response = get_llm_response(user_input, st.session_state.chat_history)
audio_response = text_to_speech(response)
We may also setup a button on the sidebar which is able to permit us to restart our session if wanted be and naturally our introductory voice observe from the respondent aspect.
if st.button("Begin New Chat"):
st.session_state.chat_histories.append(st.session_state.chat_history)
welcome_audio_path = create_welcome_message()
st.session_state.chat_history = [
AIMessage(content="Hello, you’ve reached the Emergency Help Desk. Please let me know if it's a medical, fire, police, or mental health emergency—I'm here to guide you right away.", audio_file=welcome_audio_path)
]
And in the primary web page of our App, we shall be visualizing our Chat Historical past within the type of Click on to Play Audio file
for msg in st.session_state.chat_history:
if isinstance(msg, AIMessage):
with st.chat_message("AI"):
st.audio(msg.audio_file, format="audio/mp3")
else: # HumanMessage
with st.chat_message("consumer"):
st.audio(msg.audio_file, format="audio/wav")
Now, we’re accomplished with all the Python scripts wanted to run our app. We’ll run the Streamlit App utilizing the next Command:
streamlit run app.py
So, that is what our Challenge Workflow seems to be like:
[User speaks] → audio_recorder → audio_bytes_to_wav → speech_to_text → get_llm_response → text_to_speech → st.audio
For the complete code, go to this GitHub repository.
Last Output

The Streamlit App seems to be fairly clear and is functioning appropriately!
Let’s see a few of its responses:-
- Consumer: Hello, somebody is having a coronary heart assault proper now, what ought to I do?
We then had a dialog on the placement and state of the particular person after which the Chatbot supplied this
- Consumer: Hiya, there was an enormous hearth breakout in Delhi. Please ship assist fast
Respondent enquires in regards to the scenario and the place is my present location after which proceeds to offer preventive measures accordingly
- Consumer: Hey there, there’s a particular person standing alone throughout the sting of the bridge, how ought to i proceed?
The Respondent enquires in regards to the location the place I’m and the psychological state of the particular person I’ve talked about
Total, our chatbot is ready to answer our queries in accordance to the scenario and asks the related questions to offer preventive measures.
Learn Extra: Methods to construct a chatbot in Python?
What Enhancements will be made?
- Multilingual Assist: Can combine LLMs with robust multilingual capabilities which might permit the chatbot to work together seamlessly with customers from completely different areas and dialects.
- Actual-Time Transcription and Translation: Including speech-to-text and real-time translation might help bridge communication gaps.
- Location-Based mostly Providers: By integrating GPS or different real-time location-based APIs, the system can detect a consumer’s location and information the closest emergency services.
- Speech-to-Speech Interplay: We will additionally use speech-to-speech fashions which might make conversations really feel extra pure since they’re constructed for such functionalities.
- Advantageous-tuning the LLM: Customized fine-tuning of the LLM based mostly on emergency-specific information can enhance its understanding and supply extra correct responses.
To study extra about AI-powered voice brokers, observe these assets:
Conclusion
On this article, we efficiently constructed a voice-based emergency response chatbot utilizing a mixture of AI fashions and a few related instruments. This chatbot replicates the position of a skilled emergency operator which is able to dealing with high-stress conditions from medical crises, and hearth incidents to psychological well being help utilizing a peaceful, assertive that may alter the conduct of our LLM to go well with the various real-world emergencies, making the expertise extra lifelike for each city and rural situation.
Login to proceed studying and revel in expert-curated content material.