Improve float pickling speed. (#28553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28553
This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).
The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.
Unpickle is already doing something reasonable, so no need to change.
fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.
ghstack-source-id: 92585739
Test Plan:
Benchmarks: buck build mode/opt experimental/jeremyl/c2:SerializationBench
buck-out/opt/gen/experimental/jeremyl/c2/SerializationBench --bm_regex=.*Float.*
Correctness: buck build mode/dev-nosan caffe2/test/...
Differential Revision: D18089481
fbshipit-source-id: a5f39e5d38c432893844241a7cce244831037e1f