[serving] Fix continuous batching JSON response serialization (#45057)
* Fix continuous batching JSON response serialization
Change model_dump_json() to model_dump() to avoid double JSON encoding.
When using continuous batching with stream=false, the response was being
double-encoded as a string instead of returning a proper JSON object.
* add example script eval-job
* fix script
* Add test for continuous batching non-streaming JSON response
Test verifies that non-streaming responses with continuous batching
return proper JSON objects rather than double-encoded JSON strings.
This is a regression test for the fix where model_dump_json() was
changed to model_dump() in the continuous batching response handler.
* fix ci
* Update eval script to use official transformers repo main branch
Changed dependency from personal fork to official huggingface/transformers@main
for production use of the evaluation script.
* add kernels and flash attn 2
* Add continuous batching configuration CLI arguments to serve command
- Add --cb-block-size, --cb-num-blocks, --cb-max-batch-tokens, --cb-max-memory-percent, and --cb-use-cuda-graph flags
- Flags allow users to customize KV cache and performance settings for continuous batching
- Update transformers_serve_cb_eval_job.py to support and pass through CB config arguments
- Update transformers dependency to use NathanHB/transformers@fix-continuous-batching-json-response branch
- All arguments use auto-inference defaults when not specified (backward compatible)
* Add thread lock for manager creation to avoid double manager
* change transformers dep
---------
Co-authored-by: remi-or <remi.pierre_o@orange.fr>