PR #7606 [MISC] Add prefix cache hit rate to metrics

[MISC] Add prefix cache hit rate to metrics #7606

comaniac merged 1 commit into vllm-project:main from comaniac:prefix-hit-rate

comaniac278 days ago (edited 278 days ago)❤ 2🚀 2

This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:

[INFO 08-16 11:53:40 metrics.py:418] Avg prompt throughput: 2876.7 tokens/s, Avg generation throughput: 384.8 tokens/s, Running: 91 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 95.2%, CPU KV cache usage: 0.0%.
[INFO 08-16 11:53:40 metrics.py:434] Prefix cache hit rate: GPU: 22.16%, CPU: 0.00%

This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to .move_to_end after updating the last access time, because the hit block will always be removed from evictor and added back when free. Since the free_table is an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.

Here are some results based on my downstream task for Llama-3-8B on L4:

Block Manager	Hit Rate	Throughput
v1	18.93%	3614 toks/s
v2 (main)	22.16%	3184 toks/s
v2 (this PR)	22.16%	3208 toks/s

The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.

cc @cadedaniel @xiaobochen123

github-actions278 days ago

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

cadedaniel commented on 2024-08-16

cadedaniel278 days ago

small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level

vllm/tests/core/block/test_prefix_caching_block.py

Line 154 in f366f63

class TestPrefixCachingBlockAllocator:

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

cadedaniel approved these changes on 2024-08-16

tests/core/block/test_prefix_caching_block.py

684	684
	685	# Test case for cache mertics
	686	@staticmethod
	687	def test_metric():

cadedaniel278 days ago

nit: test overflow case

comaniac278 days ago (edited 278 days ago)🚀 1

I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000 queries, where n is an integer. Additionally, we maintain hit_count and query_count for less than 1000 queries. Thus, we could combine them to get the real hit rate:

incomplete_ratio = query_count / 1000
(grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)

Also improved the test to cover this case.

cadedaniel278 days ago

SG. btw i don't think we need this since python int won't overflow

comaniac278 days ago

That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance

cadedaniel278 days ago👍 1

I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.

comaniac added ready

update

d674e590

comaniac force pushed to d674e590 275 days ago

comaniac merged 3ac50b47 into main 275 days ago

comaniac deleted the prefix-hit-rate branch 275 days ago

yudian0504 commented on 2024-11-27

vllm/core/evictor_v2.py

110	112
111	113	def update(self, block_id: int, last_accessed: float):
112	114	self.free_table[block_id].last_accessed = last_accessed
113		self.free_table.move_to_end(block_id)

yudian0504175 days ago

why remove this line?
the free_table will be unordered if update op happens.

Reviewers

cadedaniel

yudian0504

Assignees

No one assigned

Labels

ready

Milestone

No milestone

186	186	num_lookahead_slots: int = 0) -> int:
187	187	pass
188	188
	189	def prefix_cache_hit_rate(self) -> float:
	190	"""Prefix cache hit rate. -1 means not supported or disabled."""
	191	return -1

404	415	def all_block_ids(self) -> FrozenSet[int]:
405	416	return self._hashless_allocator.all_block_ids
406	417
	418	def prefix_cache_hit_rate(self):

1391	1391	cpu_cache_usage_sys = 1.0 - (num_free_cpu / num_total_cpu)
1392	1392
	1393	# Prefix Cache Hit Rate
	1394	cpu_prefix_cache_hit_rate = self.scheduler[
	1395	0].block_manager.prefix_cache_hit_rate(Device.CPU)
	1396	gpu_prefix_cache_hit_rate = self.scheduler[
	1397	0].block_manager.prefix_cache_hit_rate(Device.GPU)

vllm [MISC] Add prefix cache hit rate to metrics #7606 Merged

[MISC] Add prefix cache hit rate to metrics #7606

vllm
[MISC] Add prefix cache hit rate to metrics
#7606

Merged