👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck
CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check
build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
/ready
on the PRready
label to the PR🚀
small comments. can we add a test for at least the block manager v2 case? should be pretty easy to add at the block allocator level
684 | 684 | ||
685 | # Test case for cache mertics | ||
686 | @staticmethod | ||
687 | def test_metric(): |
nit: test overflow case
I improved the way of handling overflow so there won't be overflow anymore. Specifically, we group the hit rate of n*1000
queries, where n
is an integer. Additionally, we maintain hit_count
and query_count
for less than 1000 queries. Thus, we could combine them to get the real hit rate:
incomplete_ratio = query_count / 1000
(grouped_hit_rate * n + (hit_count / query_count) * incomplete_ratio) / (n + incomplete_ratio)
Also improved the test to cover this case.
SG. btw i don't think we need this since python int won't overflow
That's true. I'm just afraid that if we host an endpoint for months, the counter will grow to a huge number which might hurt performance
I feel there will be many other performance issues in such a case in vLLM. But I don't mind this code being here, as long as it's well tested.
110 | 112 | ||
111 | 113 | def update(self, block_id: int, last_accessed: float): | |
112 | 114 | self.free_table[block_id].last_accessed = last_accessed | |
113 | self.free_table.move_to_end(block_id) |
why remove this line?
the free_table will be unordered if update op happens.
Login to write a write a comment.
This PR adds prefix cache hit rate to log metrics. The metrics will be logged only when the prefix cache is enabled. Here is an example:
This PR also makes a minor improvement after #7193. Specifically in the evictor v2, we don't have to
.move_to_end
after updating the last access time, because the hit block will always be removed from evictor and added back when free. Since thefree_table
is an ordered dict, this process already guarantees the blocks are sorted by access time. The evictor v1 also leverages this characteristic.Here are some results based on my downstream task for Llama-3-8B on L4:
The gap between v1 and v2 (this PR) is still under investigation and is out of scope of this PR.
cc @cadedaniel @xiaobochen123