Nano-vLLM meets Inference Endpoints
Recently, I worked on a proof of concept to simplify the integration of various engines for providing inference endpoint services. This led me to create hfendpoint-draft, a project to experiment with ideas to isolate the exposed API logic from the inference engine as much as possible. The main idea was to avoid forcing native Python bindings (which most projects do, and for good reasons) and instead build a system that is not impacted by the language or the build environment used by the engine.
While testing the integration with transformers and llama.cpp, I added APIs for images generation, embeddings, and chat completions with streaming support. And I was pretty happy with the first results and benchmarks.
So, what's new? An exciting project popped up recently: Nano-vLLM, a lightweight vLLM implementation built from scratch, the kind of minimalist project I love to hear about. It was also the perfect way to challenge my new project! How hard would it be to bind this new engine? Let's find out!
Understanding the architecture
The project provides a very simple and clean example that shows exactly how the engine can be used, using Qwen/Qwen3-0.6B
, the model Nano-vLLM is designed for at the moment:
import os
from nanovllm import LLM, SamplingParams
from transformers import AutoTokenizer
def main():
path = os.path.expanduser("~/huggingface/Qwen3-0.6B/")
tokenizer = AutoTokenizer.from_pretrained(path)
llm = LLM(path, enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = [
"introduce yourself",
"list all prime numbers within 100",
]
prompts = [
tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
for prompt in prompts
]
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("\n")
print(f"Prompt: {prompt!r}")
print(f"Completion: {output['text']!r}")
if __name__ == "__main__":
main()
This code is great for offline generation, but we can't use it for a concurrent web service. We need more control over the request lifecycle.
So, before writing a single line of code, let's break down the building blocks of Nano-vLLM. A quick look at the source code reveals a modular and well-designed architecture:
LLMEngine
: The class used in the example, the one we need to rewrite.Scheduler
: It manages a queue of running and waiting sequences, decides which sequences to process in the next batch, and handles the logic for both prefilling (ingesting the prompt) and decoding (generating tokens).ModelRunner
: This is where the magic happens. It loads the model weights, manages the GPU memory and the KV cache, and executes the actual forward pass of the model.Sequence
: A simple data class that represents a single request, holding its token IDs, status and sampling parameters.
Our goal is to build a service that can handle many concurrent requests without blocking. To do this, we will create our own engine loop using the Scheduler
and ModelRunner
directly.
Building a Custom Engine Service
The Worker class
To achieve this, we will build a Worker
that will encapsulate the core Nano-vLLM components and manage them in a dedicated thread. The constructor of our Worker
is a direct adaptation of the LLMEngine
's __init__
method, but we introduce two key modifications:
Asynchronous Hooks: We add a
queue.Queue
and athreading.Condition
. These will serve as the interface for submitting requests from the asynchronous endpoint to our synchronous engine.Automated Model Fetching: We replace the path argument with a call to
huggingface_hub.snapshot_download
. This makes the service self-contained and simplifies deployment by ensuring the model is automatically downloaded to the cache if not already present.
That was easy, we’re mostly stealing some code so far.
class Worker:
def __init__(self):
model_path = snapshot_download(repo_id="Qwen/Qwen3-0.6B")
self.config = Config(model_path, **CONFIG)
self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
self.config.eos = self.tokenizer.eos_token_id
self.requests = queue.Queue()
self.notifier = threading.Condition()
self.loop = None
self.engine = None
self.processes = []
self.events = []
if self.config.tensor_parallel_size > 1:
ctx = mp.get_context("spawn")
for i in range(1, self.config.tensor_parallel_size):
event = ctx.Event()
process = ctx.Process(target=ModelRunner, args=(self.config, i, event))
process.start()
self.processes.append(process)
self.events.append(event)
self.model_runner = ModelRunner(self.config, 0, self.events)
self.scheduler = Scheduler(self.config)
atexit.register(self.stop)
The running loop
To build a responsive service, we need a processing loop that runs on its own. That’s the job of the _run
method.
This method acts as the beating heart of our worker. It connects two worlds: the synchronous, batch-oriented logic of the Scheduler
and ModelRunner
, and the asynchronous, event-driven world of our web service. Its job is simple: keep checking for work, process a batch, and send results back to the waiting code.
def _run(self):
while True:
try:
with self.notifier:
self.notifier.wait_for(lambda: not self.requests.empty() or not self.scheduler.is_finished())
while not self.requests.empty():
seq = self.requests.get_nowait()
self.scheduler.add(seq)
sequences, is_prefill = self.scheduler.schedule()
if not sequences:
continue
new_token_ids = self.model_runner.call("run", sequences, is_prefill)
self.scheduler.postprocess(sequences, new_token_ids)
for seq, token_id in zip(sequences, new_token_ids):
response_queue = getattr(seq, 'response_queue', None)
if not response_queue:
continue
self.loop.call_soon_threadsafe(response_queue.put_nowait, token_id)
if seq.is_finished:
self.loop.call_soon_threadsafe(response_queue.put_nowait, None)
except Exception as e:
hfendpoint.error(f"worker loop: {e}")
Let’s break it down:
The loop starts with
self.notifier.wait_for(...)
, which puts the thread to sleep until new requests arrive or the scheduler still has work to do.When it wakes up, it drains the request queue and adds any new sequences to the scheduler.
The scheduler then batches sequences.
Next comes the compute:
self.model_runner.call("run", ...)
generates the next token for each sequence.self.scheduler.postprocess(...)
updates the state, appends the new tokens, and marks finished sequences.Finally, we send the tokens back to the right queue. When a sequence is done, we also send a
None
to signal the end of the stream.
That’s it!
Submitting requests
With the engine loop running, we need a thread-safe method to submit requests from our web service. This is the role of the submit
method, which is the spiritual successor to the add_request
method found in the LLMEngine
class.
def submit(self, prompt_token_ids: list[int], sampling_params: SamplingParams) -> asyncio.Queue:
seq = Sequence(prompt_token_ids, sampling_params)
seq.response_queue = asyncio.Queue()
self.requests.put(seq)
with self.notifier:
self.notifier.notify()
return seq.response_queue
Let’s break it down too:
First, we create a Sequence, like in
LLMEngine
. But this time, we attach anasyncio.Queue
to it. This gives each request its own private channel for streaming tokens back to the handler that made the call.We put the sequence into
self.requests
and callself.notifier.notify()
to wake up the engine thread if it’s waiting.Finally, and most importantly, we return the
response_queue
. The caller can start waiting for tokens immediately, without being blocked during queueing or inference.
That was even easier!
Creating the handler
The final step is to create the hfendpoint
handler that will expose our service to the daemon.
worker = Worker()
@hfendpoint.handler("chat_completions")
async def chat(request_data: Dict[str, Any]):
prompt_text = worker.tokenizer.apply_chat_template(
request_data["messages"],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
prompt_token_ids = worker.tokenizer.encode(prompt_text)
sampling_params = SamplingParams(
temperature=request_data.get("temperature", 0.7),
max_tokens=request_data.get("max_tokens", 2048),
)
response = worker.submit(prompt_token_ids, sampling_params)
decoder = DecodeStream(skip_special_tokens=True)
while True:
token_id = await response.get()
if token_id is None:
break
output = decoder.step(worker.tokenizer._tokenizer, token_id)
if output:
yield {"content": output}
yield {"content":"", "finish_reason": "stop"}
if __name__ == "__main__":
asyncio.run(worker.start())
After the Worker
is instantiated, the chat
handler becomes responsible for processing each incoming chat_completions
request. It tokenizes the input data, creates the appropriate SamplingParams
, and then calls worker.submit
to dispatch the job to the engine.
The handler can then await
tokens from this queue. We use the convenient DecodeStream
utility from the tokenizers
library to convert the stream of tokens back into valid text. Once the None
is received, the loop terminates and a final message is yielded to the client.
And with that, the implementation is complete. The full source code for the service is available here.
Deploying on Inference Endpoints
Now you're ready to do some inference. You hit Enter
, and... crickets. Nothing. That's because Nano-vLLM requires a modern GPU architecture. This is where Hugging Face Inference Endpoints particularly shines. It provides the best hardware, fully configured, at no cost when we just want to do some tests.
To get started, go to this page and click on New endpoint
. Then, in the Hardware Configuration
, select GPU
and choose the Nvidia L4
instance with 1 GPU:
Finally, in the Container Configuration
, select the Custom
type and use this URL for the container:
ghcr.io/angt/hfendpoint-draft-nanovllm
When the endpoint is ready, its status will change to Running
. A dedicated URL will be available to access your container, which will look something like this: https://<your-endpoint-name>.endpoints.huggingface.cloud
.
By default, the endpoint is Protected
, so you need an access token to send requests. Luckily, it's super easy to get one. Go to your Access Tokens
section and create a token (the Read
role is sufficient).
Let's set HF_ENDPOINT_URL
and HF_TOKEN
for convenience:
HF_ENDPOINT_URL=https://<your-endpoint-name>.endpoints.huggingface.cloud
HF_TOKEN=hf_xxxx
and run the following curl
command:
curl "$HF_ENDPOINT_URL/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $HF_TOKEN" -d'{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
]
}'
Et voilà ! I hope this inspires you to try Nano-vLLM on your own endpoints. Happy hacking!