transformers
a91232af - [serving] Fix continuous batching JSON response serialization (#45057)

Commit
2 days ago
[serving] Fix continuous batching JSON response serialization (#45057) * Fix continuous batching JSON response serialization Change model_dump_json() to model_dump() to avoid double JSON encoding. When using continuous batching with stream=false, the response was being double-encoded as a string instead of returning a proper JSON object. * add example script eval-job * fix script * Add test for continuous batching non-streaming JSON response Test verifies that non-streaming responses with continuous batching return proper JSON objects rather than double-encoded JSON strings. This is a regression test for the fix where model_dump_json() was changed to model_dump() in the continuous batching response handler. * fix ci * Update eval script to use official transformers repo main branch Changed dependency from personal fork to official huggingface/transformers@main for production use of the evaluation script. * add kernels and flash attn 2 * Add continuous batching configuration CLI arguments to serve command - Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags - Flags allow users to customize KV cache and performance settings for continuous batching - Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments - Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch - All arguments use auto-inference defaults when not specified (backward compatible) * Add thread lock for manager creation to avoid double manager * change transformers dep --------- Co-authored-by: remi-or <remi.pierre_o@orange.fr>
Author
Parents
Loading