vllm
[MISC] Add prefix cache hit rate to metrics
#7606
Merged

[MISC] Add prefix cache hit rate to metrics #7606

comaniac
comaniac278 days ago (edited 278 days ago)❤ 2🚀 2

This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:

[INFO 08-16 11:53:40 metrics.py:418] Avg prompt throughput: 2876.7 tokens/s, Avg generation throughput: 384.8 tokens/s, Running: 91 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 95.2%, CPU KV cache usage: 0.0%.
[INFO 08-16 11:53:40 metrics.py:434] Prefix cache hit rate: GPU: 22.16%, CPU: 0.00%

This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to .move_to_end after updating the last access time, because the hit block will always be removed from evictor and added back when free. Since the free_table is an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.

Here are some results based on my downstream task for Llama-3-8B on L4:

Block Manager Hit Rate Throughput
v1 18.93% 3614 toks/s
v2 (main) 22.16% 3184 toks/s
v2 (this PR) 22.16% 3208 toks/s

The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.

cc @cadedaniel @xiaobochen123

github-actions
github-actions278 days ago

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

cadedaniel
cadedaniel commented on 2024-08-16
cadedaniel278 days ago

small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level

class TestPrefixCachingBlockAllocator:

Conversation is marked as resolved
Show resolved
vllm/core/block/interfaces.py
186186 num_lookahead_slots: int = 0) -> int:
187187 pass
188188
189
def prefix_cache_hit_rate(self) -> float:
190
"""Prefix cache hit rate. -1 means not supported or disabled."""
191
return -1
cadedaniel278 days ago👍 1

nit: better to keep it abstract so this file doesn't have any implementation logic (trying to keep them as strictly interfaces, not interface+impl)

Conversation is marked as resolved
Show resolved
vllm/core/block/prefix_caching_block.py
404415 def all_block_ids(self) -> FrozenSet[int]:
405416 return self._hashless_allocator.all_block_ids
406417
418
def prefix_cache_hit_rate(self):
cadedaniel278 days ago👍 1

nit: typing -> float
nit: this is a good use-case for a read-only @property. if you want to leave it as a method since the upper layers take device arg, then make it a verb get_prefix_cache_hit_rate

Conversation is marked as resolved
Show resolved
vllm/engine/llm_engine.py
13911391 cpu_cache_usage_sys = 1.0 - (num_free_cpu / num_total_cpu)
13921392
1393 # Prefix Cache Hit Rate
1394
cpu_prefix_cache_hit_rate = self.scheduler[
1395
0].block_manager.prefix_cache_hit_rate(Device.CPU)
1396
gpu_prefix_cache_hit_rate = self.scheduler[
1397
0].block_manager.prefix_cache_hit_rate(Device.GPU)
cadedaniel278 days ago

nit: should this go into the scheduler so that the coupling of llm engine <--> scheduler is reduced?

do we care about PP here? how to manage that

comaniac278 days ago
  • Moved to scheduler.
  • This should also cover PP because the kv-cache in all PP stages are identical (correct me if wrong).
comaniac
cadedaniel
cadedaniel approved these changes on 2024-08-16
tests/core/block/test_prefix_caching_block.py
684684
685 # Test case for cache mertics
686 @staticmethod
687
def test_metric():
cadedaniel278 days ago

nit: test overflow case

comaniac278 days ago (edited 278 days ago)🚀 1

I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000 queries, where n is an integer. Additionally, we maintain hit_count and query_count for less than 1000 queries. Thus, we could combine them to get the real hit rate:

incomplete_ratio = query_count / 1000
(grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)

Also improved the test to cover this case.

cadedaniel278 days ago

SG. btw i don't think we need this since python int won't overflow

comaniac278 days ago

That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance

cadedaniel278 days ago👍 1

I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.

comaniac comaniac added ready
comaniac update
d674e590
comaniac comaniac force pushed to d674e590 275 days ago
comaniac comaniac merged 3ac50b47 into main 275 days ago
comaniac comaniac deleted the prefix-hit-rate branch 275 days ago
yudian0504
yudian0504 commented on 2024-11-27
vllm/core/evictor_v2.py
110112
111113 def update(self, block_id: int, last_accessed: float):
112114 self.free_table[block_id].last_accessed = last_accessed
113
self.free_table.move_to_end(block_id)
yudian0504175 days ago

why remove this line?
the free_table will be unordered if update op happens.

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone