[PyTorch Edge] Make contexts thread local for quantized matmul (#74676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74676
We don't want to create and destroy a new context with each multiplication
Test Plan:
From fbcode:
```buck test caffe2/test:quantization -- test_qmatmul```
# Performance Improvement
*Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505*
*Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=1891629751047842*
**Improvement from this diff alone**
~9.71% Reduction in Latency
- Non Thread Local Contexts (before this diff, D35087184 v2): [8.5410ms](https://www.internalfb.com/intern/aibench/details/661728682381311
)
- Thread Local Contexts (this diff, v12): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198)
**FP32 Matmul vs Quantized Matmul, Overall Improvement from this diff stack**
56% reduction in latency compared to FP32 Matmul, 71% reduction in latency compared to Naive QMatmul
- FP32 Matmul: [17.4910ms](https://www.internalfb.com/intern/aibench/details/875394396322469)
- Quantized Matmul (after this diff): [7.7113ms](https://www.internalfb.com/intern/aibench/details/956655867696198
)
- Naive Quantized Matmul (dequantize → fp32matmul → quantize): [26.8639ms](https://www.internalfb.com/intern/aibench/details/52181682131461
)
Reviewed By: kimishpatel
Differential Revision: D34756288
fbshipit-source-id: b000658152cf71b4185dcd34a3cccc71b4cec1f0
(cherry picked from commit 5bc7ef6b5c3255388eb8fab230e44073004d2266)