[inductor] improve bandwidth computation (#97057)
When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057
Approved by: https://github.com/Chillee