[inductor] add the ability to do heavier search for coordinate descent tuning (#99403)
When checking Meta's internal cmf10x model, I found this interesting kernel https://gist.github.com/shunting314/d4b1fc7352c840ef185c607392e21f31 . Doing coordinate descent tuning starting from the out of box tuner find sub-optimal config: a config worse than the best one max-autotuner can find.
This indicates that the coordinate descent tuner does not necessarily find the optimal config. Starting point matters.
I want to make the coordinate descent tuning less depend on the starting point. Also I think by improving that, the coordinate descent tuner may be more likely to find even better configs when starting from max-autotune result.
There are 2 ideas.
1. currently coordinate descent tuning only considers changing one field/coordinate at a time. I add the ability to check all directions (i.e. tuning all tunable fields at the same time) after the normal coordinate descent searching does not find better choices. I'll check how that works in cmf10x
2. currently when we change a field, we only change 1 step (i.e. radius is 1). I add the ability to use a larger radius. This only affect the search in all directions and does not affect the normal coordinate descent searching workflow.
Both are disabled by default.
Here are the tests I've done:
- OOB (out of the box): 0.083ms 0.003GB 38.13GB/s
- MA (max autotune): 0.016ms 0.003GB 195.60GB/s
- best config: XBLOCK: 4, RBLOCK: 128, num_warps: 4, num_stages: 1
Default coordinate descent:
- Coordesc (coordinate descent tuner) upon OOB: 0.024ms 0.003GB 131.52GB/s ( **WORSE than Max Autotune** )
- best config: XBLOCK: 64, RBLOCK: 4, num_warps: 16, num_stages: 1
- Coordesc upon MA: 0.016ms 0.003GB 194.31GB/s (no further improvement upon MA)
Search in all directions: (radius = 1)
- Coordesc upon OOB: 0.017ms 0.003GB 184.55GB/s
- best config: XBLOCK: 32, RBLOCK: 16, num_warps: 32, num_stages: 1
- **IMPROVE FROM 0.024ms to 0.017ms. QUITE CLOSE TO THE ONE FIND BY MAX-AUTOTUNE**
- Coordesc upon MA: no further improvements upon MA
Search in all directions: (radius = 2)
- Coordesc upon OOB: 0.016ms 0.003GB 192.60GB/s
- best config: XBLOCK: 8, RBLOCK: 16, num_warps: 8, num_stages: 1
- **SLIGHTLY BETTER THAN RADIUS=1 for this kernel and on par with max-autotune**
- Coordesc upon MA: no further improvements upon MA
**Overall max-autotuner does a really good job for this kernel**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99403
Approved by: https://github.com/jansel