Nano-vLLM meets Inference Endpoints

Community Article Published June 25, 2025

Recently, I worked on a proof of concept to simplify the integration of various engines for providing inference endpoint services. This led me to create hfendpoint-draft, a project to experiment with ideas to isolate the exposed API logic from the inference engine as much as possible. The main idea was to avoid forcing native Python bindings (which most projects do, and for good reasons) and instead build a system that is not impacted by the language or the build environment used by the engine.

While testing the integration with transformers and llama.cpp, I added APIs for images generation, embeddings, and chat completions with streaming support. And I was pretty happy with the first results and benchmarks.

So, what's new? An exciting project popped up recently: Nano-vLLM, a lightweight vLLM implementation built from scratch, the kind of minimalist project I love to hear about. It was also the perfect way to challenge my new project! How hard would it be to bind this new engine? Let's find out!

Understanding the architecture

The project provides a very simple and clean example that shows exactly how the engine can be used, using Qwen/Qwen3-0.6B, the model Nano-vLLM is designed for at the moment:

import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer

def main():
    path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
    tokenizer = AutoTokenizer.from_pretrained(path)
    llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)

    sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
    prompts = [
        "introduce yourself",
        "list all prime numbers within 100",
    ]
    prompts = [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=True
        )
        for prompt in prompts
    ]
    outputs = llm.generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print("\n")
        print(f"Prompt: {prompt!r}")
        print(f"Completion: {output['text']!r}")

if __name__ == "__main__":
    main()

This code is great for offline generation, but we can't use it for a concurrent web service. We need more control over the request lifecycle.

So, before writing a single line of code, let's break down the building blocks of Nano-vLLM. A quick look at the source code reveals a modular and well-designed architecture:

LLMEngine: The class used in the example, the one we need to rewrite.
Scheduler: It manages a queue of running and waiting sequences, decides which sequences to process in the next batch, and handles the logic for both prefilling (ingesting the prompt) and decoding (generating tokens).
ModelRunner: This is where the magic happens. It loads the model weights, manages the GPU memory and the KV cache, and executes the actual forward pass of the model.
Sequence: A simple data class that represents a single request, holding its token IDs, status and sampling parameters.

Our goal is to build a service that can handle many concurrent requests without blocking. To do this, we will create our own engine loop using the Scheduler and ModelRunner directly.

Building a Custom Engine Service

The Worker class

To achieve this, we will build a Worker that will encapsulate the core Nano-vLLM components and manage them in a dedicated thread. The constructor of our Worker is a direct adaptation of the LLMEngine's __init__ method, but we introduce two key modifications:

Asynchronous Hooks: We add a queue.Queue and a threading.Condition. These will serve as the interface for submitting requests from the asynchronous endpoint to our synchronous engine.
Automated Model Fetching: We replace the path argument with a call to huggingface_hub.snapshot_download. This makes the service self-contained and simplifies deployment by ensuring the model is automatically downloaded to the cache if not already present.

That was easy, we’re mostly stealing some code so far.

class Worker:
    def __init__(self):
        model_path = snapshot_download(repo_id="Qwen/Qwen3-0.6B")
        self.config = Config(model_path, **CONFIG)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
        self.config.eos = self.tokenizer.eos_token_id
        self.requests = queue.Queue()
        self.notifier = threading.Condition()
        self.loop = None
        self.engine = None
        self.processes = []
        self.events = []
        if self.config.tensor_parallel_size > 1:
            ctx = mp.get_context("spawn")
            for i in range(1, self.config.tensor_parallel_size):
                event = ctx.Event()
                process = ctx.Process(target=ModelRunner, args=(self.config, i, event))
                process.start()
                self.processes.append(process)
                self.events.append(event)
        self.model_runner = ModelRunner(self.config, 0, self.events)
        self.scheduler = Scheduler(self.config)
        atexit.register(self.stop)

The running loop

To build a responsive service, we need a processing loop that runs on its own. That’s the job of the _run method. This method acts as the beating heart of our worker. It connects two worlds: the synchronous, batch-oriented logic of the Scheduler and ModelRunner, and the asynchronous, event-driven world of our web service. Its job is simple: keep checking for work, process a batch, and send results back to the waiting code.

def _run(self):
        while True:
            try:
                with self.notifier:
                    self.notifier.wait_for(lambda: not self.requests.empty() or not self.scheduler.is_finished())

                while not self.requests.empty():
                    seq = self.requests.get_nowait()
                    self.scheduler.add(seq)

                sequences, is_prefill = self.scheduler.schedule()
                if not sequences:
                    continue

                new_token_ids = self.model_runner.call("run", sequences, is_prefill)
                self.scheduler.postprocess(sequences, new_token_ids)

                for seq, token_id in zip(sequences, new_token_ids):
                    response_queue = getattr(seq, 'response_queue', None)
                    if not response_queue:
                        continue
                    self.loop.call_soon_threadsafe(response_queue.put_nowait, token_id)
                    if seq.is_finished:
                        self.loop.call_soon_threadsafe(response_queue.put_nowait, None)

            except Exception as e:
                hfendpoint.error(f"worker loop: {e}")

Let’s break it down:

The loop starts with self.notifier.wait_for(...), which puts the thread to sleep until new requests arrive or the scheduler still has work to do.
When it wakes up, it drains the request queue and adds any new sequences to the scheduler.
The scheduler then batches sequences.
Next comes the compute: self.model_runner.call("run", ...) generates the next token for each sequence. self.scheduler.postprocess(...) updates the state, appends the new tokens, and marks finished sequences.
Finally, we send the tokens back to the right queue. When a sequence is done, we also send a None to signal the end of the stream.

That’s it!

Submitting requests

With the engine loop running, we need a thread-safe method to submit requests from our web service. This is the role of the submit method, which is the spiritual successor to the add_request method found in the LLMEngine class.

    def submit(self, prompt_token_ids: list[int], sampling_params: SamplingParams) -> asyncio.Queue:
        seq = Sequence(prompt_token_ids, sampling_params)
        seq.response_queue = asyncio.Queue()
        self.requests.put(seq)
        with self.notifier:
            self.notifier.notify()
        return seq.response_queue

Let’s break it down too:

First, we create a Sequence, like in LLMEngine. But this time, we attach an asyncio.Queue to it. This gives each request its own private channel for streaming tokens back to the handler that made the call.
We put the sequence into self.requests and call self.notifier.notify() to wake up the engine thread if it’s waiting.
Finally, and most importantly, we return the response_queue. The caller can start waiting for tokens immediately, without being blocked during queueing or inference.

That was even easier!

Creating the handler

The final step is to create the hfendpoint handler that will expose our service to the daemon.

worker = Worker()

@hfendpoint.handler("chat_completions")
async def chat(request_data: Dict[str, Any]):
    prompt_text = worker.tokenizer.apply_chat_template(
        request_data["messages"],
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True
    )
    prompt_token_ids = worker.tokenizer.encode(prompt_text)

    sampling_params = SamplingParams(
        temperature=request_data.get("temperature", 0.7),
        max_tokens=request_data.get("max_tokens", 2048),
    )
    response = worker.submit(prompt_token_ids, sampling_params)
    decoder = DecodeStream(skip_special_tokens=True)

    while True:
        token_id = await response.get()
        if token_id is None:
            break
        output = decoder.step(worker.tokenizer._tokenizer, token_id)
        if output:
            yield {"content": output}

    yield {"content":"", "finish_reason": "stop"}

if __name__ == "__main__":
    asyncio.run(worker.start())

After the Worker is instantiated, the chat handler becomes responsible for processing each incoming chat_completions request. It tokenizes the input data, creates the appropriate SamplingParams, and then calls worker.submit to dispatch the job to the engine.

The handler can then await tokens from this queue. We use the convenient DecodeStream utility from the tokenizers library to convert the stream of tokens back into valid text. Once the None is received, the loop terminates and a final message is yielded to the client.

And with that, the implementation is complete. The full source code for the service is available here.

Deploying on Inference Endpoints

Now you're ready to do some inference. You hit Enter, and... crickets. Nothing. That's because Nano-vLLM requires a modern GPU architecture. This is where Hugging Face Inference Endpoints particularly shines. It provides the best hardware, fully configured, at no cost when we just want to do some tests.

To get started, go to this page and click on New endpoint. Then, in the Hardware Configuration, select GPU and choose the Nvidia L4 instance with 1 GPU:

Finally, in the Container Configuration, select the Custom type and use this URL for the container:

ghcr.io/angt/hfendpoint-draft-nanovllm

When the endpoint is ready, its status will change to Running. A dedicated URL will be available to access your container, which will look something like this: https://<your-endpoint-name>.endpoints.huggingface.cloud.

By default, the endpoint is Protected, so you need an access token to send requests. Luckily, it's super easy to get one. Go to your Access Tokens section and create a token (the Read role is sufficient).

Let's set HF_ENDPOINT_URL and HF_TOKEN for convenience:

HF_ENDPOINT_URL=https://<your-endpoint-name>.endpoints.huggingface.cloud
HF_TOKEN=hf_xxxx

and run the following curl command:

curl "$HF_ENDPOINT_URL/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $HF_TOKEN" -d'{
    "messages": [
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": "Hello!" }
    ]
}'

Et voilà ! I hope this inspires you to try Nano-vLLM on your own endpoints. Happy hacking!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote