[pt2][inductor] add `triton.__verison__` as cache key, update cache layout (#98010)
Summary:
* change caching to have `system` and `cache` components, where `system` servers as an identifier for that machine's performance. similar to original method of having GPU type and CUDA version be cache keys, and now also includes Triton version. `cache` is similar to the original cache type, but now without GPU name or CUDA version
```
{
"system": {
"device": "NVIDIA PG509-210",
"version": {
"cuda": "11.4.0",
"triton": "2.1.0"
},
"hash": "e7cfb8786d2e1366b3df564bcb2f957d07545e98bf20c98d33a43b6ee80a91e0"
},
"cache": {
"bias_addmm": {
"[('cuda', 'torch.float32', 2048, 160, 0, 1, 0), ('cuda', 'torch.float32', 2048, 1140, 228148, 1, 206080), ('cuda', 'torch.float32', 1140, 160, 1, 1140, 0)]": {
"bias_addmm-alpha=1-beta=1-c73frtshmeth2spjun3zc4l2q7ck43wl356pnlmsmxgmzbfsz7ef": 0.03654399886727333,
"addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf": 0.03564799949526787,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cxgwpjkimm4azwffrfuqniwncnv4h5bxrpo4od4an4bstnh7qrqh": 0.04927999898791313,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cqlirysniekkuuvc4ue33dr4gpfzsb5e4bexarrsnsyei4slxvcz": 0.03651199862360954,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cww5uss3k4d3ei2c4lx63pudyzxdwl3ieibhxcrue4zg424eqrnu": 0.03580800071358681,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-cqcla5edxdm3n6rrkmjehexsudravx6lpphfo5zazldpo3rzpqc4": 0.03558399900794029,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-c7gdf2snt4bjlnuzdy3px4pyq3lbsdh4jp6jaie7lq6mdxccy6nl": 0.03455999866127968,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cjhcy4scxgy4lxbhjiinvxl3bbrqya63jilcckx2ltsg3mpzxyqr": 0.036288000643253326,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=32-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cu32a5vsbaln3t55jm2y6xhwgyggejmoatyakcm2huvxofw2zzva": 0.0398080013692379,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=8-croberh4l55jxlrlgkttigtebsnmosycc5rdtbtn3lp3bpovgz4a": 0.0732479989528656,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=64-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=8-c6oxgunysrqpiwwoinylb3sb2hzvx66yhehma64drqvmz52h3r5t": 0.0306560005992651,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=128-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cdrev5e3zno6z6flmhlbxgd26gkdpurljyhrw3ovx6pftoe62dpf": 0.04800000041723251,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-ce3ofrgngrwuo45hw5wqlzztium7gfkf4n5x25gwu4d6ygkea4bs": 0.0751039981842041,
"triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=1-num_warps=2-cfkz2smezre4x7hyhc2kbeawhqup6qpwzgiavrai2ghe5ghouvn4": 0.07401599735021591
},
...,
},
...,
}
}
```
Test Plan:
MAST no global: sw-966772723-OfflineTraining_df2509b8
MAST global: sw-966766969-OfflineTraining_19df7c20
Differential Revision: D44550100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98010
Approved by: https://github.com/jansel