

Picture by Creator
llama.cpp is the unique, high-performance framework that powers many standard native AI instruments, together with Ollama, native chatbots, and different on-device LLM options. By working straight with llama.cpp, you possibly can reduce overhead, achieve fine-grained management, and optimize efficiency in your particular {hardware}, making your native AI brokers and functions sooner and extra configurable
On this tutorial, I’ll information you thru constructing AI functions utilizing llama.cpp, a robust C/C++ library for operating giant language fashions (LLMs) effectively. We are going to cowl organising a llama.cpp server, integrating it with Langchain, and constructing a ReAct agent able to utilizing instruments like net search and a Python REPL.
1. Establishing the llama.cpp Server
This part covers the set up of llama.cpp and its dependencies, configuring it for CUDA assist, constructing the mandatory binaries, and operating the server.
Word: we’re utilizing an NVIDIA RTX 4090 graphics card operating on a Linux working system with the CUDA toolkit pre-configured. If you do not have entry to comparable native {hardware}, you possibly can lease GPU cases from Huge.ai for a less expensive value.


Screenshot from Huge.ai | Console
- Replace your system’s package deal checklist and set up important instruments like build-essential, cmake, curl, and git. pciutils is included for {hardware} data, and libcurl4-openssl-dev is required for llama.cpp to obtain fashions from Hugging Face.
apt-get replace
apt-get set up pciutils build-essential cmake curl libcurl4-openssl-dev git -y
- Clone the official llama.cpp repository from GitHub and use cmake to configure the construct.
# Clone llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp
# Configure construct with CUDA assist
cmake llama.cpp -B llama.cpp/construct
-DBUILD_SHARED_LIBS=OFF
-DGGML_CUDA=ON
-DLLAMA_CURL=ON
- Compile llama.cpp and all its instruments, together with the server. For comfort, copy all of the compiled binaries from the llama.cpp/construct/bin/ listing to the primary llama.cpp/ listing.
# Construct all obligatory binaries together with server
cmake --build llama.cpp/construct --config Launch -j --clean-first
# Copy all binaries to principal listing
cp llama.cpp/construct/bin/* llama.cpp/
- Begin the llama.cpp server with a unsloth/gemma-3-4b-it-GGUF mannequin.
./llama.cpp/llama-server
-hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
--host 0.0.0.0
--port 8000
--n-gpu-layers 999
--ctx-size 8192
--threads $(nproc)
--temp 0.6
--cache-type-k q4_0
--jinja
- You may take a look at if the server is operating accurately by sending a POST request utilizing curl.
(principal) [email protected]:/workspace$ curl -X POST http://localhost:8000/v1/chat/completions
-H "Content material-Sort: utility/json"
-d '{
"messages": [
{"role": "user", "content": "Hello! How are you today?"}
],
"max_tokens": 150,
"temperature": 0.7
}'
Output:
{"decisions":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"nOkay, user greeted me with a simple "Hello! How are you today?" nnHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes "being" but in a friendly way. nnI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. nnSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. nnI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}],"created":1749319250,"mannequin":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","utilization":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}
2. Constructing an AI Agent with Langgraph and llama.cpp
Now, let’s use Langgraph and Langchain to work together with the llama.cpp server and construct a multi instrument AI agent.
- Set your Tavily API key for search capabilities.
- For Langchain to work with the native llama.cpp server (which emulates an OpenAI API), you possibly can set OPENAI_API_KEY to native or any non-empty string, because the base_url will direct requests regionally.
export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=native
- Set up the mandatory Python libraries: langgraph for creating brokers, tavily-python for the Tavily search instrument, and varied langchain packages for LLM interactions and instruments.
%%seize
!pip set up -U
langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai
- Configure ChatOpenAI from Langchain to speak along with your native llama.cpp server.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
mannequin="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",
temperature=0.6,
base_url="http://localhost:8000/v1",
)
- Arrange the instruments that your agent will have the ability to use.
- TavilySearchResults: Permits the agent to go looking the net.
- PythonREPLTool: Supplies the agent with a Python Learn-Eval-Print Loop to execute code.
from langchain_community.instruments import TavilySearchResults
from langchain_experimental.instruments.python.instrument import PythonREPLTool
search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool = PythonREPLTool()
instruments = [search_tool, code_tool]
- Use LangGraph’s pre constructed create_react_agent perform to create an agent that may purpose and act (ReAct framework) utilizing the LLM and the outlined instruments.
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
mannequin=llm,
instruments=instruments,
)
3. Check the AI Agent with Instance Queries
Now, we are going to take a look at the AI agent and in addition show which instruments the agent makes use of.
- This helper perform extracts the names of the instruments utilized by the agent from the dialog historical past. That is helpful for understanding the agent’s decision-making course of.
def extract_tool_names(dialog: dict) -> checklist[str]:
tool_names = set()
for msg in dialog.get('messages', []):
calls = []
if hasattr(msg, 'tool_calls'):
calls = msg.tool_calls or []
elif isinstance(msg, dict):
calls = msg.get('tool_calls') or []
if not calls and isinstance(msg.get('additional_kwargs'), dict):
calls = msg['additional_kwargs'].get('tool_calls', [])
else:
ak = getattr(msg, 'additional_kwargs', None)
if isinstance(ak, dict):
calls = ak.get('tool_calls', [])
for name in calls:
if isinstance(name, dict):
if 'title' in name:
tool_names.add(name['name'])
elif 'perform' in name and isinstance(name['function'], dict):
fn = name['function']
if 'title' in fn:
tool_names.add(fn['name'])
return sorted(tool_names)
- Outline a perform to run the agent with a given query and print the instruments used and the ultimate reply.
def run_agent(query: str):
outcome = agent.invoke({"messages": [{"role": "user", "content": question}]})
raw_answer = outcome["messages"][-1].content material
tools_used = extract_tool_names(outcome)
return tools_used, raw_answer
- Let’s ask the agent for the highest 5 breaking information tales. It ought to use the tavily_search_results_json instrument.
instruments, reply = run_agent("What are the highest 5 breaking information tales?")
print("Instruments used ➡️", instruments)
print(reply)
Output:
Instruments used ➡️ ['tavily_search_results_json']
Listed below are the highest 5 breaking information tales primarily based on the offered sources:
1. **Gaza Humanitarian Disaster:** Ongoing battle and challenges in Gaza, together with the Eid al-Adha vacation, and the retrieval of a Thai hostage's physique.
2. **Russian Drone Assaults on Kharkiv:** Russia continues to focus on Ukrainian cities with drone and missile strikes.
3. **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, however Russia's Africa Corps stays.
4. **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk might have implications for Tesla inventory and the U.S. house program.
5. **Training Division Staffing Cuts:** The Biden administration is searching for Supreme Court docket intervention to dam deliberate staffing cuts on the Training Division.
- Let’s ask the agent to write down and execute Python code for the Fibonacci sequence. It ought to use the Python_REPL instrument.
instruments, reply = run_agent(
"Write a code for the Fibonacci sequence and execute it utilizing Python REPL."
)
print("Instruments used ➡️", instruments)
print(reply)
Output:
Instruments used ➡️ ['Python_REPL']
The Fibonacci sequence as much as 10 phrases is [0, 1, 1, 2, 3, 5, 8, 13, 21, 34].
Last Ideas
On this information, I’ve used a small quantized LLM, which typically struggles with accuracy, particularly in relation to choosing the instruments. In case your aim is to construct production-ready AI brokers, I extremely advocate operating the most recent, full-sized fashions with llama.cpp. Bigger and more moderen fashions usually present higher outcomes and extra dependable outputs
It’s necessary to notice that organising llama.cpp could be tougher in comparison with user-friendly instruments like Ollama. Nevertheless, in case you are prepared to take a position the time to debug, optimize, and tailor llama.cpp in your particular {hardware}, the efficiency beneficial properties and adaptability are nicely value it.
One of many greatest benefits of llama.cpp is its effectivity: you don’t want high-end {hardware} to get began. It runs nicely on common CPUs and laptops with out devoted GPUs, making native AI accessible to virtually everybody. And if you happen to ever want extra energy, you possibly can all the time lease an reasonably priced GPU occasion from a cloud supplier.
Abid Ali Awan (@1abidaliawan) is an authorized information scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids combating psychological sickness.