openvino
e5c24ae9 - [NPUW] Support prefill-chunk for text-embedding model (#33076)

Commit

135 days ago

[NPUW] Support prefill-chunk for text-embedding model (#33076) ### Details: Qwen3-text-embedding is a transformer-based casual model and it's not the traditional LLM and is not directly adapted to NPUW. The benefits of prefill-chunk for `Qwen3-text-embedding`: - support long context - Performance improvement Changes: - Added KVCache nodes in model and updated shapes for related nodes. - Added `positon_ids` input node since it's hardcoded in original model. - Created a separate model to handle the post-processing. - Cached the output of prefill since `mean` post-processing needs entire output data. Notes: 1. Though kvcache model is not needed at all, it's still there. As I don't want to add many `if-else`. And the penalty is the compilation time increasing. 2. Padding is only supported in the mean post-processing mode for now, which makes thing simple. I can add the padding support on left in following PRs if required. 3. GenAI PR: https://github.com/openvinotoolkit/openvino.genai/pull/3088 4. The [tests](https://jira.devtools.intel.com/secure/attachment/5782028/text_embeddings.py) has been verified to work with both NPUW and GenAI updates. Update: 1. Introduced new files `embedding_model_utils.cpp and embedding_model_utils.hpp` to encapsulate all embedding-related functionality. 2. Added `embedding_infer_request.cpp and embedding_infer_request.hpp` to implement the new request type `EmbeddingInferRequest`. 3. Created `llm_infer_base_request.hpp` as a common base class for `llm-infer-request` and `embedding-infer-request`. ### Tickets: - [CVS-177453](https://jira.devtools.intel.com/browse/CVS-177453)

References

#33076 - [NPUW] Support prefill-chunk for text-embedding model

Author

mengweiguo

Parents

894827f0

openvino e5c24ae9 - [NPUW] Support prefill-chunk for text-embedding model (#33076)

openvino
e5c24ae9 - [NPUW] Support prefill-chunk for text-embedding model (#33076)