Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827)
I added two constants. First helps with avoiding rounding while we hit a certain threshold, and second, to control what blocks can be cached.
Allocations larger than `kMaxRoundThreshold` will not be rounded to the next power of two anymore. Generally it is expected that larger allocations happen less frequently, and this more or less matches what happens in `CudaCachingAllocator`.
Blocks larger than `kMaxCachedSize` will not be cached. This is a separate problem than the above but I noticed this caching is poorly implemented here and doesn't do anything to avoid fragmentation or to help with good resource utilization. For example, the following allocations:
```
t1 = alloc(4GB)
del t1
t2 = alloc(10k)
t3 = alloc(4GB)
```
this results in allocating 8GB, because the first 4GB block that is cached gets assigned to the 10k allocation wasting the rest of the block.
Lastly, ideally I would make this constants configurable, but looking around the code I didn't see any existing mechanisms in ATen to configure things at runtime.
Fixes #95823
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95827
Approved by: https://github.com/ngimel