[Pytorch] Improve scale and zero point extraction for per channel quantized (#53726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53726
In quantized linear layers, during deserialization we create scales and zero
points which are later used for qnnpack kernels.
Scales and zero pointer extraction for per channel quantized tensors is slow.
This is due to the fact that we index directly into zero point and scales
tensor and this indexing creates a tensor slice of 1 element which is then cast
to int32 or float.
This is super slow and increases model loading time.
This diff fixes that.
Test Plan: CI
Reviewed By: raziel
Differential Revision: D26922138
fbshipit-source-id: b78e8548f736e8fa2f6636324ab1a2239b94a27c