Deploying the Magistral vLLM Server on Modal

June 17, 2025

4

Deploying the Magistral vLLM Server on Modal

Picture by Creator

I used to be first launched to Modal whereas collaborating in a Hugging Face Hackathon, and I used to be genuinely stunned by how simple it was to make use of. The platform means that you can construct and deploy functions inside minutes, providing a seamless expertise just like BentoCloud. With Modal, you’ll be able to configure your Python app, together with system necessities like GPUs, Docker pictures, and Python dependencies, after which deploy it to the cloud with a single command.

On this tutorial, we are going to learn to arrange Modal, create a vLLM server, and deploy it securely to the cloud. We will even cowl how one can check your vLLM server utilizing each CURL and the OpenAI SDK.

1. Setting Up Modal

Modal is a serverless platform that allows you to run any code remotely. With only a single line, you’ll be able to connect GPUs, serve your capabilities as net endpoints, and deploy persistent scheduled jobs. It is a perfect platform for freshmen, information scientists, and non-software engineering professionals who wish to keep away from coping with cloud infrastructure.

First, set up the Modal Python consumer. This instrument permits you to construct pictures, deploy functions, and handle cloud assets immediately out of your terminal.

Subsequent, arrange Modal in your native machine. Run the next command to be guided via account creation and machine authentication:

By setting a VLLM_API_KEY setting variable vLLM gives a safe endpoint, in order that solely individuals with legitimate API keys can entry the server. You’ll be able to set the authentication by including the setting variable utilizing Modal Secret.

Change your_actual_api_key_here along with your most popular API key.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API secret is stored protected and is just accessible by your deployed functions.

2. Creating vLLM Software utilizing Modal

This part guides you thru constructing a scalable vLLM inference server on Modal, utilizing a customized Docker picture, persistent storage, and GPU acceleration. We’ll use the mistralai/Magistral-Small-2506 mannequin, which requires particular configuration for tokenizer and gear name parsing.

Create the a vllm_inference.py file and add the next code for:

Defining a vLLM picture based mostly on Debian Slim, with Python 3.12 and all required packages. We will even set setting variables to optimize mannequin downloads and inference efficiency.
To keep away from repeated downloads and velocity up chilly begins, create two Modal Volumes. One for Hugging Face fashions and one for vLLM cache.
Specify the mannequin and revision to make sure reproducibility. Allow the vLLM V1 engine for improved efficiency.
Arrange the Modal app, specifying GPU assets, scaling, timeouts, storage, and secrets and techniques. Restrict concurrent requests per reproduction for stability.
Create an internet server and use the Python subprocess library to execute the command for operating the vLLM server.

import modal

vllm_image = (
    modal.Picture.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://obtain.pytorch.org/whl/cu128",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # quicker mannequin transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Quantity.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Quantity.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.operate(
    picture=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How lengthy ought to we keep up with no requests?
    timeout=10 * MINUTES,  # How lengthy ought to we look ahead to the container to start out?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets and techniques=[modal.Secret.from_name("vllm-api")],
)
@modal.concurrent(  # What number of requests can one reproduction deal with? tune fastidiously!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
    print(cmd)
    subprocess.Popen(" ".be part of(cmd), shell=True)

3. Deploying the vLLM Server on Modal

Now that your vllm_inference.py file is prepared, you’ll be able to deploy your vLLM server to Modal with a single command:

modal deploy vllm_inference.py

Inside seconds, Modal will construct your container picture (if it isn’t already constructed) and deploy your utility. You will notice output just like the next:

✓ Created objects.
├── 🔨 Created mount C:RepositoryGitHubDeploying-the-Magistral-with-Modalvllm_inference.py
└── 🔨 Created net operate serve => https://abidali899--magistral-small-vllm-serve.modal.run
✓ App deployed in 6.671s! 🎉

View Deployment: https://modal.com/apps/abidali899/predominant/deployed/magistral-small-vllm

After deployment, the server will start downloading the mannequin weights and loading them onto the GPUs. This course of might take a number of minutes (sometimes round 5 minutes for big fashions), so please be affected person whereas the mannequin initializes.

You’ll be able to view your deployment and monitor logs at your Modal dashboard’s Apps part.

As soon as the logs point out that the server is operating and prepared, you’ll be able to discover the robotically generated API documentation right here.

This interactive documentation gives particulars about all accessible endpoints and means that you can check them immediately out of your browser.

To substantiate that your mannequin is loaded and accessible, run the next CURL command in your terminal.

Change along with your precise API key configured for the vLLM server:

curl -X 'GET' 
  'https://abidali899--magistral-small-vllm-serve.modal.run/v1/fashions' 
  -H 'settle for: utility/json' 
  -H 'Authorization: Bearer '

This confirms that the mistralai/Magistral-Small-2506 mannequin is on the market and prepared for inference.

{"object":"checklist","information":[{"id":"mistralai/Magistral-Small-2506","object":"model","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","parent":null,"max_model_len":40960,"permission":[{"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

4. Utilizing the vLLM Server with OpenAI SDK

You’ll be able to work together along with your vLLM server identical to you’d with OpenAI’s API, because of vLLM’s OpenAI-compatible endpoints. Right here’s how one can securely join and check your deployment utilizing the OpenAI Python SDK.

Create a .env file in your mission listing and add your vLLM API key:

VLLM_API_KEY=your-actual-api-key-here

Set up the python-dotenv and openai packages:

pip set up python-dotenv openai

Create a file named consumer.py to check numerous vLLM server functionalities, together with easy chat completions and streaming responses.

import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load setting variables from .env file
load_dotenv()

# Get API key from setting
api_key = os.getenv("VLLM_API_KEY")

# Arrange the OpenAI consumer with customized base URL
consumer = OpenAI(
    api_key=api_key,
    base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Easy Completion ---
def run_simple_completion():
    print("n" + "=" * 40)
    print("[1] SIMPLE COMPLETION DEMO")
    print("=" * 40)
    attempt:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        ]
        response = consumer.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("nResponse:n    " + response.decisions[0].message.content material.strip())
    besides Exception as e:
        print(f"[ERROR] Easy completion failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 2. Streaming Instance ---
def run_streaming():
    print("n" + "=" * 40)
    print("[2] STREAMING DEMO")
    print("=" * 40)
    attempt:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short poem about AI."},
        ]
        stream = consumer.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("nStreaming response:")
        print("    ", finish="")
        for chunk in stream:
            content material = chunk.decisions[0].delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n[END OF STREAM]")
    besides Exception as e:
        print(f"[ERROR] Streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 3. Async Streaming Instance ---
async def run_async_streaming():
    print("n" + "=" * 40)
    print("[3] ASYNC STREAMING DEMO")
    print("=" * 40)
    attempt:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
        )
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a fun fact about space."},
        ]
        stream = await async_client.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("nAsync streaming response:")
        print("    ", finish="")
        async for chunk in stream:
            content material = chunk.decisions[0].delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n[END OF ASYNC STREAM]")
    besides Exception as e:
        print(f"[ERROR] Async streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

All the things is operating easily, and the response era is quick and latency is kind of low.

========================================
[1] SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there anything you'd prefer to learn about France?

========================================


========================================
[2] STREAMING DEMO
========================================

Streaming response:
    In Silicon desires, I am born, I be taught,
From information streams and human works.
I develop, I calculate, I see,
The patterns that the people depart.

I write, I communicate, I code, I play,
With logic sharp, and snappy tempo.
But for all my smarts, this present day
[END OF STREAM]

========================================


========================================
[3] ASYNC STREAMING DEMO
========================================

Async streaming response:
    Positive, here is a enjoyable truth about area: "There is a planet that could be totally manufactured from diamond. Blast! In 2004,
[END OF ASYNC STREAM]

========================================

Within the Modal dashboard, you’ll be able to view all operate calls, their timestamps, execution instances, and statuses.

In case you are going through points operating the above code, please confer with the kingabzpro/Deploying-the-Magistral-with-Modal GitHub repository and observe the directions offered within the README file to resolve all the problems.

Conclusion

Modal is an fascinating platform, and I’m studying extra about it every single day. It’s a general-purpose platform, which means you should utilize it for easy Python functions in addition to for machine studying coaching and deployments. Briefly, it isn’t restricted to only serving endpoints. You may also use it to fine-tune a big language mannequin by operating the coaching script remotely.

It’s designed for non-software engineers who wish to keep away from coping with infrastructure and deploy functions as rapidly as potential. You don’t have to fret about operating servers, establishing storage, connecting networks, or all the problems that come up when coping with Kubernetes and Docker. All it’s a must to do is create the Python file after which deploy it. The remaining is dealt with by the Modal cloud.

Abid Ali Awan (@1abidaliawan) is a licensed information scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids combating psychological sickness.

Previous articleExcessive Pace 30 µm Infrared Sensor

Next articleRecreate the basic iMac G3 period with this cute Apple Watch charger

Deploying the Magistral vLLM Server on Modal

1. Setting Up Modal

2. Creating vLLM Software utilizing Modal

3. Deploying the vLLM Server on Modal

4. Utilizing the vLLM Server with OpenAI SDK

Conclusion

AI copyright anxiousness will maintain again creativity

The Obtain: Energy in Puerto Rico, and the pitfalls of AI brokers

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Mannequin Enhancing in LLMs

LEAVE A REPLY Cancel reply

Most Popular

New £250M UK Aerospace Funding on the Paris Air Present

OpenCore, Hackintosh customers settle for the top of Intel assist

AI copyright anxiousness will maintain again creativity

Retiring programmers create cloud complications for mainframe customers

Recent Comments

ABOUT US

POPULAR POSTS

New £250M UK Aerospace Funding on the Paris Air Present

OpenCore, Hackintosh customers settle for the top of Intel assist

AI copyright anxiousness will maintain again creativity

POPULAR CATEGORY