Fixes for continuous batching (#40828)
* Fix for CB attn mask and refactor
* Tests for CB (not all passing)
* Passing tests and a logger fix
* Fixed the KV metrics that were broken when we moved to hybrid alloc
* Fix circular import and style
* Added tests for FA
* Unfolded test to have device expectations
* Fixes for H100
* more fixes for h100
* H100 are good
* Style
* Adding some comments from #40831
* Rename test
* Avoid 1 letter variables
* Dictonnary is only removed during kwargs
* Test for supported sample
* Fix a unvoluntary slice
* Fixes for non-sliced inputs and small example improvments
* Slice inputs is more understandabe
* Style