AIMLverse Lab

Deep Dive: vLLM Handling Complex & Multiple User Behaviors?

When you use ChatGPT or similar tools, sometimes you:

Ask for multiple answers to choose from like "Give me 3 variations”,

Or want the best possible answer out of many options like "Translate this correctly"

Or ask repeatedly using the same format like "Summarize this article, then this one, then that one..."

Sometimes, too many people try to use the AI at the same time. The computer (especially its memory) can't handle everything all at once. When thousands of people use ChatGPT at once:

  1. The model batches requests smartly,
  2. Shares memory when people ask similar prompts,
  3. Evicts low-priority tasks,
  4. Swaps old tasks to RAM,
  5. Or redoes the work efficiently later.

That’s how it handles millions of complex, simultaneous requests without crashing or slowing down too much.

Here comes our hero - vLLM.

Let us see how it has been solved. But first let's go through some unique concepts in brief:

Using vLLM for Other Types of Responses (Decoding Scenarios)

What does vLLM do?

Think of vLLM as a very smart waiter at a crowded restaurant. Instead of taking each order separately and running back and forth for every little thing, it does this:

  1. Parallel Sampling (Multiple Answers) - When you want several outputs, vLLM reuses what’s common (like the same question) to save time and memory. It only copies data when the answers start to differ (like different replies to the same prompt).
  2. Beam Search (Best Answer Out of Many) - Imagine trying 5 different answers at once and keeping only the best ones at each step. vLLM shares the common parts between answers and only separates them when they start to differ.
  3. Shared Prompt Prefix - If many people start their questions the same way (like “Translate this English word to French…”), vLLM remembers the common start and doesn't repeat the same work every time.
  4. Mixing Different Requests - Some people ask for one answer, others ask for five—vLLM can handle them all together, efficiently, without mixing them up.

Let us deep dive into an advanced explanation:

Parallel Sampling - Asking for Multiple Responses at Once

You might ask an LLM: “Give me 3 different versions of this tweet.”

Normally, this would take 3× memory, since each response has to be generated and stored separately.

But with vLLM, if all 3 responses start from the same input (prompt), it shares the starting part's memory across all samples. It only creates separate copies when the answers begin to change. This is called copy-on-write – memory is copied only when a change happens.

Think of it like: 3 people writing different endings to the same story – you give them one copy of the start, and they continue individually from there.

Beam Search - Picking the Best Response from Many

Suppose you want not just different replies, but the best possible one. Beam search tries many options at each step and keeps the top few (like a tournament). Normally each "beam" takes its own memory. But vLLM solves this problem.

vLLM’s advantage:

  1. Shares the common parts of beams.
  2. Only stores new memory when the beams start to differ.
  3. When a beam is no longer useful, its memory is freed.
  4. This reduces memory use by over 50% in many cases.

You can think of it like following several storylines at once and keeping only the best ones. Remember the possibilities that Dr. Strange saw in Endgame but found only one possibility. Bam!! It is similar to that.

Shared Prompt Prefix – Using a Common Starting Template

In many AI tasks, the beginning of every prompt is the same. For example:

“Translate English to French: ‘apple’ => ‘pomme’”

Why generate the same thing over and over? vLLM allows this prefix to be:

  1. Cached once and reused across many requests.
  2. Works like how operating systems share libraries between programs.
  3. This saves memory and speeds up response time.

Mixing Different Requests

  1. Batches different request types together — no need to separate them.
  2. Uses a block-based memory system so each request can be tracked and managed individually.
  3. Shares memory wherever possible (like shared prompts or common starting tokens).
  4. Keeps things efficient without mixing up responses.

Now, let us combine all these together

People can send all types of requests—some simple, some complex. Older systems can’t batch them well because of the differences.

vLLM solves this by abstracting memory management:

  1. It gives each request a “list of blocks” to use.
  2. The model doesn’t care how memory is shared—it just works.

What If Memory Runs Out?

vLLM chooses which requests to pause or remove from GPU memory (temporarily).

Two smart strategies are there:

  1. Swapping – Move the Memory to CPU (like a slower storage room) - When memory is full, vLLM moves some paused requests from GPU to CPU RAM. When it’s time to resume, it moves them back. This avoids completely canceling requests. Like putting less-needed items in a storage room when your desk is full.
  2. Recomputation – Redo the Work If Needed - Instead of storing old memory, vLLM can recalculate it quickly when required. Especially useful when recomputation is faster than swapping. Like rewriting a note from memory instead of searching for the old one.

Special Features:

It handles grouped requests like beam search together—they’re scheduled or paused as a group. The swap area size is limited, so it never overwhelms the CPU.

vLLM does the following:

  1. First-Come, First-Served - Like a queue in a bakery, the first person is served first, so it’s fair.
  2. Preemption (Temporary Pausing) - If new requests keep coming in and there's no space, vLLM temporarily pauses the newest ones to focus on the older ones.
  3. Two Smart Tricks to Save Work - If it had to stop working on a request, it has two options:

1. Swapping: Move the work to a slower room (CPU) and come back to it later.

2. Recomputation: Instead of storing everything, it just redoes the work quickly when it’s needed again.

Think of it like putting things in a freezer (swap) or just cooking again from a recipe (recompute), depending on what’s faster and more efficient at the time.

How vLLM Works Across Multiple GPUs (Distributed)

Big models (like GPT-3 or LLaMA-65B) don’t fit into one GPU. So we use multiple GPUs working as a team. vLLm design is such that it

  1. Uses a common method called tensor model parallelism (like in Megatron-LM).
  2. Each GPU takes a slice of the model (like one person in an assembly line).

Memory Management in vLLM is like a central manager (scheduler) keeps track of all memory across GPUs. Each GPU gets the input tokens and memory block map, runs its part of the model, shares results using fast GPU communication (all-reduce)

GPUs don’t need to coordinate memory themselves. They just follow the instructions given by the central brain (scheduler).


Share this article: