[CB] [Major] Asynchronous batching (#43960)
* Cleanup: batch is more self contained
* Created utils.py file
* Moved pad to utils
* Pin memory for input and outputs
* Consolidate inputs into a bulk tensor
* Consolidated read and write indices
* Add the transfer_inputs fn
* Renames and getters
* Remove useless sync
* Move graphs to the IOs
* Async done except for carry_in_ids
* Add carry over (scheduler not picking up tho)
* Remodeled scheduling
* Fix carry over
* Fix stream
* Bumped _upper_bound_num_blocks
* Faster compute for physical read indices
* Final actual changes
* Adress some todos
* Rename input_outputs
* Modify the behavior of async
* Fix bugs
* Added async tests
* Fix test
* Remodel example
* Fix offload test
* Fix real cause of offload fail
* Nits
* Propagate use_async
* Performance fixes 1
* More flexibility for cuda graphs
* Remodeled the read and write indices
* Review compliance
* More doc and beautifull ascii
* Style
* Fixes for end of generation
* Review compliance
* More tokens to pass test