[inductor] let coordinate descent tuning respect max block size (#103660)
It turns out that we need fix https://github.com/pytorch/pytorch/issues/103656 in coordinate descent tuner.
Inductor generate triton code with assumption of max-block-size. If inductor is sure that numel is a multiple of the max-block-size, inductor will safely skip the check of the corresponding mask for perf reason.
Coordinate descent tuner previous does not respect this assumption and may pick triton config with even larger block size. That will cause IMA.
BTW, I was wondering how we pick those max block size. Not enforcing a max block size may allow coordinate descent tuner find an even better config. But it may slow down other cases a bit because of extra mask check.
Test:
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --amp --performance --inference --inductor --only alexnet
```
Fail before and works after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103660
Approved by: https://github.com/spectrometerHBH, https://github.com/jansel