[reland][inductor] do benchmark in sub processes for max autotuning (#97215)
Previous attempt of landing this PR is reverted due to a landrace: https://github.com/pytorch/pytorch/pull/96410 .
The reason is `PyCodeCache.load` has a new linemap argument being added but my previous PR does not handle it (due to a stale checkout). Fix is trivial.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97215
Approved by: https://github.com/Chillee, https://github.com/jansel