[pt2.0/inductor] Fix race in cache dir across ranks on the same host (#92664)
Summary:
It looks we have some race in the cache directory for triton codegen, when we have multiple processes on the same host:
1. Rank A and B cannot find the code in cache (/tmp/uid/triton/cache) and start compilation separately
2. Most of the times the codegen is the same; but rarely it may produce different llir and different shared memory (in our case it's 544 and 2560, both are valid for the llir/ptx generated). See repro D42584580
3. They both write the compiled so and metadata into the local cache folder, with the same directory name (same hash, without considering device id). There will be a race here even if they grab the file lock, because it only locks each file but not the entire transaction
4. We then load the so and meta data back from the file. What happens can be we load so from rank A and shared memory from rank B and they mismatch.
Test Plan:
Run the faulty program to double check
```
[trainer5]: cache dir: /tmp/root/4951/triton/cache/198ef4405d2e525acd20d5c2d01ad099
[trainer1]: cache dir: /tmp/root/4947/triton/cache/198ef4405d2e525acd20d5c2d01ad099
```
Differential Revision: D42619405
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92664
Approved by: https://github.com/bertmaher, https://github.com/ngimel, https://github.com/jansel