Should you’ve ever tried to maneuver your LLM prototype out of your Jupyter pocket book into the actual world then you definitely already understand it’s not so simple as clicking “Run.” I keep in mind the primary time I attempted internet hosting an LLM endpoint; it was messy. The mannequin ran effective domestically, however as soon as customers began sending a number of requests…the whole lot broke.
That’s once I realized what I used to be lacking wasn’t ML information, however LLMOps.
This information is my try to take you from zero to LLMOps Hero, a whole walkthrough of run LLMs in manufacturing utilizing LangChain, FastAPI, Docker, and a conceptual overview of AWS deployment.
We’ll construct a small but practical RAG chatbot as an ideal entry level to know the LLMOps lifecycle.
What Precisely is LLMOps?
Merely put, LLMOps is the extension of MLOps for Massive Language Fashions.
It’s the whole lot that occurs after you’ve educated or chosen your mannequin, from inference optimization, monitoring, immediate orchestration, knowledge pipelines, to deployment, scaling, and governance.
Consider it because the bridge between your mannequin and the real-world guaranteeing reliability, scalability, observability, and compliance.
Right here’s the high-level LLMOps pipeline we’ll cowl:
- Information ingestion & vectorization
- Understanding the Dataset
- Constructing the Vector Retailer
- Constructing and Operating the RAG Chatbot
- Immediate orchestration utilizing LangChain
- Serving through FastAPI
- Chat UI with Streamlit
- Public Entry through Ngrok
- Containerization utilizing Docker
- Deployment ideas with AWS
- Monitoring, versioning, analysis, and governance
Fundamental Setup
Earlier than we are able to begin with implementing stuff, let’s do fundamental setup by putting in essential libraries. I’m utilizing Google Colab because the programming setting for this mission. Copy the next code for establishing the libraries:
!mkdir llmops-demo
!cd llmops-demo
!python3 -m venv venv
!supply venv/bin/activate
!pip set up langchain openai faiss-cpu fastapi uvicorn python-dotenv langchain-community
!pip set up nest_asyncio pyngrok streamlit
We’re additionally going to make use of OpenAI and NGROK for this mission. You may get their API entry tokens/keys from these hyperlinks:
Word that utilizing free model of their API/Auth tokens is sufficient for this weblog. Write the next code in colab to setup these tokens:
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass("Enter OPENAI KEY:")
os.environ['NGROK_AUTHTOKEN'] = getpass("ENTER NGROK TOKEN:")
Information Ingestion and Vectorization
Understanding the Dataset
Earlier than constructing something, let’s begin with the information as a result of that’s actually what provides your RAG system its mind. I used a small Kaggle dataset referred to as Pattern RAG Data Merchandise Dataset. It’s easy, clear, and ideal for studying. Every row is sort of a mini “IT helpdesk” be aware, a brief piece of data underneath a particular matter.
You’ll discover two important columns:
- ki_text → the precise content material (like “Steps to troubleshoot your VPN connection”)
- ki_topic → the subject label, equivalent to “Networking”, “{Hardware}”, or “Safety”
This dataset is deliberately small which I really love, as a result of it enables you to shortly check totally different RAG concepts like chunking sizes, embedding fashions, and retrieval methods with out ready hours for indexing. It’s extracted from a bigger IT information base dataset (round 100 articles), however this trimmed-down model is ideal for experimentation because it’s quick, centered, and nice for demos.
Now that we all know what our knowledge seems to be like, let’s begin instructing our mannequin “keep in mind” it by turning this textual content into embeddings and storing them inside a vector database.
Constructing the Vector Retailer
As soon as we now have our dataset prepared, the subsequent objective is to make the mannequin perceive it not simply learn it. And to do this, we have to convert textual content into embeddings, numerical vectors that symbolize which means.
Right here’s the code to do this:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.doc import Doc
# Load dataset
df = pd.read_csv("/content material/rag_sample_qas_from_kis.csv")
# Use ki_text as important textual content and ki_topic as metadata
docs = [
Document(page_content=row["ki_text"], metadata={"matter": row["ki_topic"]})
for _, row in df.iterrows()
]
# Break up into chunks for embedding
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(docs)
# Embed and retailer in FAISS
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
db = FAISS.from_documents(chunks, embeddings)
db.save_local("vectorstore")
print("Vectorstore created efficiently utilizing ki_text as content material!")
Woah that’s lots of code, let’s see what is occurring right here!
After establishing the dataset, the subsequent step is to assist the mannequin “perceive” it by changing textual content into embeddings. We begin by loading the CSV and wrapping every file into LangChain Doc objects, storing “ki_text” as content material and “ki_topic” as metadata. This metadata can later assist in filtering or topic-specific retrieval.
Subsequent, every doc is break up into smaller overlapping chunks (800 characters, 100 overlap) so no thought will get misplaced between splits. Then, we use OpenAIEmbeddings to transform every chunk into vectors: dense numerical representations of semantic which means.
Lastly, all embeddings are saved in a FAISS vector retailer, an environment friendly similarity search index that permits fast retrieval of probably the most related chunks throughout queries.
On the finish of this step, the vector retailer/ folder acts as your mannequin’s native “reminiscence,” able to be queried within the subsequent part.
Additionally Learn: High 15 Vector Databases for 2025
Constructing and Operating the RAG Chatbot
Right here is how our remaining Chatbot seems to be like when deployed:
Now let’s cowl all of the libraries/instruments that we have to make this occur and see the way it all comes collectively.
Immediate Orchestration utilizing LangChain
That is the place we carry our vector retailer, retriever, and LLM collectively utilizing LangChain’s RetrievalQA chain.
The FAISS vector retailer created earlier is loaded again into reminiscence and linked to OpenAI embeddings. The retriever then acts because the clever lookup engine fetching solely probably the most related chunks out of your dataset.
Every question now flows by way of this pipeline: retrieved → augmented → generated.
Serving through FastAPI
To simulate a production-ready setup, we add a light-weight FastAPI backend. Whereas non-obligatory for Colab, this step mirrors how your RAG mannequin can be uncovered in a real-world setting with endpoints prepared for exterior integrations or API requests.
The “/” route merely returns a well being message for now, however this construction can simply be prolonged to deal with queries out of your UI, Slack bots, or internet shoppers.
Chat UI with Streamlit
On high of the backend, we construct an interactive chat interface utilizing Streamlit. This supplies a clear, browser-based expertise to speak to your RAG pipeline. Customers can sort questions, hit “Ship,” and see contextual, document-aware responses powered by LangChain’s retrieval and reasoning chain.
Every alternate is saved in st.session_state, making a persistent conversational stream between you and the mannequin.
Public Entry through Ngrok
Since Colab doesn’t assist direct internet hosting, we use ngrok to show the Streamlit app securely. Ngrok tunnels your native port to a short lived public URL letting anybody (or simply you) entry the chatbot UI in an actual browser tab.
As soon as ngrok prints a public hyperlink, the chatbot turns into immediately accessible.
Placing It All Collectively
- LangChain orchestrates embeddings, retrieval, and technology
- FastAPI supplies an non-obligatory backend layer
- Streamlit serves because the interactive UI
- Ngrok bridges Colab to the surface world
And with that you’ve a completely useful, end-to-end RAG-powered chatbot working dwell out of your pocket book.
Right here’s your full code for the RAG Chatbot combining all the above instruments:
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from pyngrok import ngrok
import nest_asyncio
import threading
import streamlit as st
import uvicorn
from fastapi import FastAPI
# --- Create FastAPI backend (non-obligatory) ---
app = FastAPI(title="RAG Chatbot – Backend")
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
db = FAISS.load_local("vectorstore", embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), temperature=0),
chain_type="stuff",
retriever=retriever,
)
@app.get("/")
def dwelling():
return {"message": "Backend prepared"}
# --- Streamlit Chat UI ---
def run_streamlit():
st.set_page_config(page_title="💬 RAG Chatbot", format="centered")
st.title("💬 Chat along with your RAG-Powered LLM")
if "historical past" not in st.session_state:
st.session_state.historical past = []
question = st.text_input("Ask me one thing:")
if st.button("Ship") and question:
with st.spinner("Pondering..."):
reply = qa_chain.run(question)
st.session_state.historical past.append((question, reply))
for q, a in reversed(st.session_state.historical past):
st.markdown(f"**You:** {q}")
st.markdown(f"**Bot:** {a}")
st.markdown("---")
# --- Launch ngrok and Streamlit ---
ngrok.kill()
public_url = ngrok.join(8501)
print(f"Chat UI accessible at: {public_url}")
# Run Streamlit app in background
nest_asyncio.apply()
def start_streamlit():
!streamlit run app.py &>/dev/null&
# Save UI to file
with open("app.py", "w") as f:
f.write('''
import streamlit as st
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "''' + os.environ["OPENAI_API_KEY"] + '''"
st.set_page_config(page_title="💬 RAG Chatbot", format="centered")
st.title("💬 Chat along with your RAG-Powered LLM")
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
db = FAISS.load_local("vectorstore", embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), temperature=0),
chain_type="stuff",
retriever=retriever,
)
if "historical past" not in st.session_state:
st.session_state.historical past = []
question = st.text_input("Ask me one thing:")
if st.button("Ship") and question:
with st.spinner("Pondering..."):
reply = qa_chain.run(question)
st.session_state.historical past.append((question, reply))
for q, a in reversed(st.session_state.historical past):
st.markdown(f"**You:** {q}")
st.markdown(f"**Bot:** {a}")
st.markdown("---")
''')
# Begin Streamlit
get_ipython().system_raw('streamlit run app.py --server.port 8501 &')
print("Chatbot is working. Open the ngrok URL above")
If you run the above code, you’re going to get an output like this:
Chat UI accessible at: NgrokTunnel: “https://22f6c6d1ef68.ngrok-free.app” -> “http://localhost:8501” Chatbot is working. Open the ngrok URL above.
That is the URL on which your Chatbot is deployed: https://22f6c6d1ef68.ngrok-free.app/
This hyperlink is on the market for entry from wherever on the web on the earth. You may share it with your folks or use it to strive your bot your self. Fairly neat proper?
By using instruments equivalent to Streamlit, FastAPI, NGrok and LangChain we have been in a position to deploy and finish to finish RAG based mostly Chatbot in the actual world with simply few strains of code. Think about the chances this has for you; the sky is the restrict of what you may obtain.
Containerization utilizing Docker
At this level, we now have a completely working RAG chatbot: backend, UI, and all. However if you happen to’ve ever tried deploying that setup throughout environments, you’ll know the ache: lacking dependencies, model mismatches, damaged paths are the same old chaos.
That’s the place Docker is available in. Docker provides us a neat, transportable approach to package deal your complete setting with all of the instruments equivalent to FastAPI, Streamlit, LangChain, FAISS, and even your vector retailer into one constant unit that may run wherever.
If you wish to deploy very same RAG Chatbot as above however need to do it by way of Docker then this might be your up to date code to take action:
import os
import subprocess
from pyngrok import ngrok
# --- Create Streamlit UI ---
streamlit_code = """
import streamlit as st
import requests
st.set_page_config(page_title="RAG Chatbot", page_icon="", format="centered")
st.title("RAG Chatbot – LLMOps Demo")
st.write("Ask questions out of your information base. Backend powered by FastAPI, deployed in Docker.")
question = st.text_input("💬 Your query:")
if question:
with st.spinner("Pondering..."):
strive:
response = requests.get(f"http://localhost:8000/ask", params={"question": question})
if response.status_code == 200:
st.success(response.json()["answer"])
else:
st.error("One thing went fallacious with the backend.")
besides Exception as e:
st.error(f"Error: {e}")
"""
with open("streamlit_app.py", "w") as f:
f.write(streamlit_code.strip())
# --- Create necessities.txt ---
with open("necessities.txt", "w") as f:
f.write("fastapinuvicornnstreamlitnlangchainnopenainfaiss-cpunpyngroknrequestsn")
# --- Create Dockerfile ---
dockerfile = f"""
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip set up --no-cache-dir -r necessities.txt
EXPOSE 8000 8501
ENV OPENAI_API_KEY=${{OPENAI_API_KEY}}
CMD bash -c "uvicorn from_zero_to_llmops_hero_your_101_guide_to_running_llms_in_production:app --host 0.0.0.0 --port 8000 & streamlit run streamlit_app.py --server.port 8501 --server.handle 0.0.0.0"
"""
with open("Dockerfile", "w") as f:
f.write(dockerfile.strip())
# --- Set up Docker if wanted ---
strive:
subprocess.run(["docker", "--version"], examine=True)
besides Exception:
print("Putting in Docker...")
subprocess.run(["apt-get", "update", "-qq"], examine=True)
subprocess.run(["apt-get", "install", "-qq", "-y", "docker.io"], examine=True)
# --- Examine if Docker daemon accessible ---
strive:
subprocess.run(["docker", "build", "--version"], examine=True)
docker_available = True
besides Exception:
docker_available = False
if docker_available:
print("Constructing Docker picture...")
subprocess.run(["docker", "build", "-t", "rag-chatbot-ui", "."], examine=True)
print("Operating container (FastAPI + Streamlit)...")
subprocess.run(["docker", "run", "-d", "-p", "8000:8000", "-p", "8501:8501",
"-e", f"OPENAI_API_KEY={os.getenv('OPENAI_API_KEY')}",
"rag-chatbot-ui"], examine=True)
else:
print("Docker not supported in Colab — working natively as a substitute.")
print("Beginning FastAPI + Streamlit domestically...")
# Run each apps immediately
import threading
def run_fastapi():
os.system("uvicorn from_zero_to_llmops_hero_your_101_guide_to_running_llms_in_production:app --host 0.0.0.0 --port 8000")
def run_streamlit():
os.system("streamlit run streamlit_app.py --server.port 8501 --server.handle 0.0.0.0")
threading.Thread(goal=run_fastapi).begin()
threading.Thread(goal=run_streamlit).begin()
# --- Expose through ngrok ---
ngrok.kill()
public_url = ngrok.join(8501)
print(f"Your RAG Chatbot UI is dwell at: {public_url}")
print("API (FastAPI docs) at: /docs")
What Truly Modified within the Code
Should you look carefully on the new code block, a number of main issues stand out:
1. Creation of a Dockerfile
That is just like the recipe to your app. We’re ranging from a light-weight python:3.10-slim picture, copying all mission information, putting in dependencies from necessities.txt, and exposing two ports:
- 8000 for the FastAPI backend, and
- 8501 for the Streamlit UI.
On the finish, each servers are launched inside the identical container utilizing a single command:
CMD bash -c "uvicorn ... & streamlit run ..."
That single line makes Docker run each FastAPI and Streamlit collectively, one serving the API, the opposite serving the interface.
2. Automated Necessities Dealing with
Earlier than constructing the Docker picture, we generate a clear necessities.txt file programmatically, containing all of the libraries we used: fastapi, uvicorn, streamlit, langchain, openai, faiss-cpu, pyngrok, and requests.
This ensures your Docker picture all the time installs the precise packages your pocket book used, no guide copy-paste.
3. Constructing and Operating the Container
As soon as Docker is on the market, we run:
subprocess.run(["docker", "build", "-t", "rag-chatbot-ui", "."])
subprocess.run(["docker", "run", "-d", "-p", "8000:8000", "-p", "8501:8501", ...])
This builds a picture named rag-chatbot-ui and spins up a container that runs each the backend and frontend. Should you’re on a neighborhood machine, this could offer you two dwell endpoints instantly:
4. Fallback for Colab
As a result of Google Colab doesn’t assist the Docker daemon immediately, we deal with it gracefully. If Docker isn’t accessible, the script routinely launches FastAPI and Streamlit in two background threads, so the expertise stays similar.
Why This Issues?
This small shift from pocket book code to a containerized setup is the actual turning level from “taking part in with LLMs” to “working LLMs like a product.”
Now you might have:
- Reproducibility → the identical setting in all places
- Portability → deploy on AWS, GCP, and even your laptop computer
- Scalability → a number of containers behind a load balancer later
And sure, it’s the very same app simply wrapped neatly in Docker.
Deployment Ideas with AWS
Deploying your LLM system on AWS opens up production-grade scalability but it surely’s a subject deserving its personal deep dive. From internet hosting your FastAPI app on EC2 or ECS, to storing vector databases on S3, and utilizing AWS Lambda for event-driven triggers, the chances are huge. You may even combine CloudFront + API Gateway for safe, globally distributed inference endpoints.
For now, simply know that AWS provides you all of the constructing blocks to maneuver from a neighborhood Docker container to a completely managed, auto-scaling setup. We’ll discover this complete workflow with infrastructure-as-code, CI/CD pipelines, and deployment scripts within the subsequent article.
Monitoring, Versioning, Analysis, and Governance
As soon as deployed, your system’s actual work begins monitoring and sustaining it. Monitoring latency, retrieval accuracy, hallucination charges, and model drift are important to preserving your chatbot each dependable and explainable. Add versioning, analysis, and governance, and you’ve got the foundations of production-grade LLMOps.
These areas deserve their very own highlight. Within the subsequent article, we’ll go hands-on with logging, analysis dashboards, and mannequin governance all constructed immediately into your RAG pipeline.
Conclusion
We began from zero, establishing a easy chatbot UI and step-by-step, constructed a whole LLMOps workflow:
- Information ingestion, Vector retailer creation, Embedding, Immediate orchestration, Serving through FastAPI, containerizing with Docker, and making ready for deployment on AWS.
- Every layer added construction, scalability, and reliability turning an experiment right into a deployable, maintainable product.
- However that is just the start. Actual LLMOps begins after deployment once you’re monitoring conduct, optimizing retrieval, versioning embeddings, and preserving your fashions protected and compliant.
Within the subsequent a part of this sequence, we’ll transcend setup and dive into AWS deployments, observability, versioning, and governance the issues that actually make your LLMs production-ready.
Let me know your ideas/questions under!
Login to proceed studying and revel in expert-curated content material.

