[Static Runtime] Added a cache for NNC generated code across different calls to the same ops (#62921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62921
Added a cache for NNC generated code across different calls to the same ops.
Before this diff:
```
ProcessedNode time 13402.9 ms
Static Module initialization took 30964.8 ms
```
After this diff:
```
ProcessedNode time 85.4195 ms
Static Module initialization took 4348.42 ms
```
There is one global cache for all the ops. It is guarded with a reader-writer lock. This is necessary because we could have multiple threads loading different models in parallel. Note that this locking does not guarantee that there will be exactly one code generated for each op. There could be more than one thread generating code for the same op simultaneously and all of them will update the cache in some order. But that should be small number bounded by the number of threads. Also, there is no correctness issue, since the generated code is always the same and the one generated by the last thread is retained in the cache and reused later while running the model.
Test Plan: Tested inline_cvr model
Reviewed By: hlu1
Differential Revision: D30104017
fbshipit-source-id: 32e9af43d7e724ed54b661dfe58a73a14e443ff7