[PyTorch] Don't read 1 char per iteration in Unpickler::readString (#51901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51901
It's much more efficient to read multiple chars with 1 memcpy than to call `read<char>` multiple times.
ghstack-source-id: 121278774
Test Plan:
Run WireSerializerBench before/after for small tensors:
```
/tmp/WireSerializerBench.Reader --real_data /mnt/homedir/hwwang/test_serialized_api_request --real_pytorch_api_request --bm_regex '[Ss]mall'
```
Before:
```
DeSerializeWire(Small) 7.65us 130.65K
DeSerializeWire(small_Zstd) 100.49% 7.62us 131.29K
DeSerializeWire(small_Snappy) 100.49% 7.62us 131.29K
DeSerializeWireIValue(Small) 82.89% 9.23us 108.30K
DeSerializeWireIValue(small_Zstd) 82.87% 9.24us 108.27K
DeSerializeWireIValue(small_Snappy) 82.33% 9.30us 107.57K
DeSerializeC2ToBlob(small_NoCompress) 1150.28% 665.39ns 1.50M
DeSerializeC2ToBlob(small_Zstd) 1149.70% 665.72ns 1.50M
DeSerializeC2ToBlob(small_Zstd_Fast) 1150.94% 665.00ns 1.50M
DeSerializeC2ToBlob(Small_Snappy) 1151.70% 664.57ns 1.50M
DeSerializeC2ToString(small) 9297.81% 82.32ns 12.15M
```
After:
```
DeSerializeWire(Small) 6.86us 145.84K
DeSerializeWire(small_Zstd) 100.52% 6.82us 146.60K
DeSerializeWire(small_Snappy) 100.13% 6.85us 146.03K
DeSerializeWireIValue(Small) 83.94% 8.17us 122.42K
DeSerializeWireIValue(small_Zstd) 84.00% 8.16us 122.50K
DeSerializeWireIValue(small_Snappy) 84.53% 8.11us 123.28K
DeSerializeC2ToBlob(small_NoCompress) 1019.48% 672.58ns 1.49M
DeSerializeC2ToBlob(small_Zstd) 1020.03% 672.23ns 1.49M
DeSerializeC2ToBlob(small_Zstd_Fast) 1020.59% 671.85ns 1.49M
DeSerializeC2ToBlob(Small_Snappy) 1020.30% 672.05ns 1.49M
DeSerializeC2ToString(small) 7709.63% 88.94ns 11.24M
```
Second run after to demonstrate it wasn't just variance:
```
DeSerializeWire(Small) 6.92us 144.57K
DeSerializeWire(small_Zstd) 99.24% 6.97us 143.47K
DeSerializeWire(small_Snappy) 99.58% 6.95us 143.97K
DeSerializeWireIValue(Small) 84.83% 8.15us 122.63K
DeSerializeWireIValue(small_Zstd) 84.72% 8.16us 122.49K
DeSerializeWireIValue(small_Snappy) 84.59% 8.18us 122.29K
DeSerializeC2ToBlob(small_NoCompress) 1031.03% 670.89ns 1.49M
DeSerializeC2ToBlob(small_Zstd) 1030.64% 671.14ns 1.49M
DeSerializeC2ToBlob(small_Zstd_Fast) 1013.39% 682.57ns 1.47M
DeSerializeC2ToBlob(Small_Snappy) 1013.95% 682.19ns 1.47M
DeSerializeC2ToString(small) 8155.98% 84.81ns 11.79M
```
By the way, this gets us closer to deserialization parity for the real data sample included in D26049387:
baseline:
```
DeSerializeWire(RealData) 7.34ms 136.24
DeSerializeWire(RealData_Zstd) 99.95% 7.34ms 136.17
DeSerializeWire(RealData_Snappy) 100.09% 7.33ms 136.36
DeSerializeWireIValue(RealData) 82.69% 8.88ms 112.65
DeSerializeWireIValue(RealData_Zstd) 82.76% 8.87ms 112.75
DeSerializeWireIValue(RealData_Snappy) 82.68% 8.88ms 112.64
DeSerializeC2ToBlob(RealData_NoCompress) 116.87% 6.28ms 159.23
DeSerializeC2ToBlob(RealData_Zstd) 117.33% 6.26ms 159.85
DeSerializeC2ToBlob(RealData_Zstd_Fast) 117.38% 6.25ms 159.91
DeSerializeC2ToBlob(RealData_Snappy) 117.61% 6.24ms 160.23
DeSerializeC2ToString(RealData) 4571.81% 160.55us 6.23K
```
with this diff:
```
DeSerializeWire(RealData) 6.57ms 152.17
DeSerializeWire(RealData_Zstd) 100.17% 6.56ms 152.43
DeSerializeWire(RealData_Snappy) 100.09% 6.57ms 152.31
DeSerializeWireIValue(RealData) 83.06% 7.91ms 126.40
DeSerializeWireIValue(RealData_Zstd) 83.16% 7.90ms 126.54
DeSerializeWireIValue(RealData_Snappy) 83.22% 7.90ms 126.64
DeSerializeC2ToBlob(RealData_NoCompress) 104.02% 6.32ms 158.29
DeSerializeC2ToBlob(RealData_Zstd) 103.46% 6.35ms 157.43
DeSerializeC2ToBlob(RealData_Zstd_Fast) 104.64% 6.28ms 159.23
DeSerializeC2ToBlob(RealData_Snappy) 104.65% 6.28ms 159.25
DeSerializeC2ToString(RealData) 4051.03% 162.22us 6.16K
```
Reviewed By: qizzzh
Differential Revision: D26321083
fbshipit-source-id: 92d45e760580bb290078ddac84128174daef0e55