Deep Dive: vLLM Handling Complex & Multiple User
Behaviors?
When you use ChatGPT or similar tools, sometimes you:
Ask for multiple answers
to choose from like "Give me 3 variations”,
Or want the best possible answer
out of many options like "Translate this correctly"
Or ask repeatedly
using the same format like "Summarize this article, then this one,
then that one..."
Sometimes, too many people try to use the AI at the same time. The
computer (especially its memory) can't handle everything all at once. When thousands of people use
ChatGPT at once:
-
The model batches requests smartly,
-
Shares memory when people ask similar prompts,
-
Evicts low-priority tasks,
-
Swaps old tasks to RAM,
-
Or redoes the work efficiently later.
That’s how it handles millions of complex, simultaneous requests without crashing or
slowing down too much.
Here comes our hero - vLLM.
Let us see how it has been solved. But first let's go through some
unique concepts in brief:
Using vLLM for Other Types of Responses (Decoding Scenarios)
What does vLLM do?
Think of vLLM as a very smart waiter at a crowded restaurant. Instead
of taking each order separately and running back and forth for every
little thing, it does this:
-
Parallel Sampling (Multiple Answers) - When you want several outputs, vLLM reuses what’s common
(like the same question) to save time and memory. It only copies
data when the answers start to differ
(like different replies to the same prompt).
-
Beam Search (Best Answer Out of Many) - Imagine trying 5 different answers at once and keeping only the
best ones at each step. vLLM shares the common parts
between answers and only separates them when they start to differ.
-
Shared Prompt Prefix - If many people start their questions the same way (like “Translate
this English word to French…”), vLLM remembers the common start
and doesn't repeat the same work every time.
-
Mixing Different Requests - Some people ask for one answer, others ask for five—vLLM can handle them all together, efficiently, without mixing them up.
Let us deep dive into an advanced explanation:
Parallel Sampling - Asking for Multiple Responses at Once
You might ask an LLM: “Give me 3 different versions of this tweet.”
Normally, this would take 3× memory, since each response has to be generated and stored
separately.
But with vLLM, if all 3 responses start from the same input (prompt), it shares
the starting part's memory across all samples. It only creates separate copies
when the answers begin to change. This is called copy-on-write
– memory is copied only when a change happens.
Think of it like: 3 people writing different endings to the same
story – you give them one copy of the start, and they continue
individually from there.
Beam Search - Picking the Best Response from Many
Suppose you want not just different replies, but the best possible
one. Beam search tries many options at each step and keeps the top few
(like a tournament). Normally each "beam" takes its own memory. But
vLLM solves this problem.
vLLM’s advantage:
-
Shares the common parts
of beams.
-
Only stores new memory when the beams start to differ.
-
When a beam is no longer useful, its memory is freed.
-
This reduces memory use by over 50%
in many cases.
You can think of it like following several storylines at once and
keeping only the best ones. Remember the possibilities that Dr.
Strange saw in Endgame but found only one possibility. Bam!! It is
similar to that.
Shared Prompt Prefix – Using a Common Starting Template
In many AI tasks, the beginning of every prompt is the same. For example:
“Translate English to French: ‘apple’ => ‘pomme’”
Why generate the same thing over and over? vLLM allows
this prefix to be:
-
Cached once
and reused across many requests.
-
Works like how operating systems share libraries between
programs.
-
This saves memory and speeds up response time.
Mixing Different Requests
-
Batches different request types together
— no need to separate them.
-
Uses a block-based memory system
so each request can be tracked and managed individually.
-
Shares memory
wherever possible (like shared prompts or common starting
tokens).
-
Keeps things efficient
without mixing up responses.
Now, let us combine all these together
People can send all types of requests—some simple, some complex.
Older systems can’t batch them well because of the differences.
vLLM solves this
by abstracting memory management:
-
It gives each request a “list of blocks” to use.
-
The model doesn’t care how memory is shared—it just works.
What If Memory Runs Out?
vLLM chooses which requests to pause or remove
from GPU memory (temporarily).
Two smart strategies are there:
-
Swapping – Move the Memory to CPU (like a slower storage room) - When memory is full, vLLM moves some paused requests
from GPU to CPU RAM. When it’s time to resume, it moves them back.
This avoids completely canceling requests. Like putting less-needed
items in a storage room when your desk is full.
-
Recomputation – Redo the Work If Needed - Instead of storing old memory, vLLM can recalculate it quickly
when required. Especially useful when recomputation is faster
than swapping. Like rewriting a note from memory instead of
searching for the old one.
Special Features:
It handles grouped requests
like beam search together—they’re scheduled or paused as a group. The swap area size is limited, so it never overwhelms the CPU.
vLLM does the following:
-
First-Come, First-Served - Like a queue in a bakery, the first person is served first, so it’s fair.
-
Preemption (Temporary Pausing) - If new requests keep coming in and there's no space, vLLM temporarily pauses the newest ones
to focus on the older ones.
-
Two Smart Tricks to Save Work - If it had to stop working on a request, it has two options:
1. Swapping: Move the work to a slower room (CPU) and come back to it
later.
2. Recomputation: Instead of storing everything, it just redoes the work quickly
when it’s needed again.
Think of it like putting things in a freezer (swap) or just cooking
again from a recipe (recompute), depending on what’s faster and more
efficient at the time.
How vLLM Works Across Multiple GPUs (Distributed)
Big models (like GPT-3 or LLaMA-65B) don’t fit into one GPU. So we use multiple GPUs
working as a team.
vLLm design is such that it
-
Uses a common method called tensor model parallelism
(like in Megatron-LM).
-
Each GPU takes a slice of the model (like one person in an assembly
line).
Memory Management in vLLM is like a central manager (scheduler)
keeps track of all memory across GPUs. Each GPU gets the input tokens
and memory block map, runs its part of the model, shares results using fast GPU communication (all-reduce)
GPUs don’t need to coordinate memory themselves. They just follow the
instructions given by the central brain (scheduler).