String optimizations related to serialization. (#28230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28230
This change improves the pickling small data benchmark by roughly 30%.
(25.8usec -> 18.05usec).
One of the main issues was that we were spending 25%+ of the cpu profile
time in std::[o]stringstream constructors alone.
Two main parts
- Change some std::stringstream to std::ostringstream, when they
showed up on hot-ish paths, and it was trivial to convert them.
Roughly 27% of the std::stringstream constructor time is spent
building the constituent std::basic_istream. If the istream isn't
needed, don't construct it.
- For a couple of very hot paths (e.g. Pickler::pushGlobal), just
convert to traditional string::append(). std::ostringstream is
convenient, but not particularly efficient.
ghstack-source-id: 92153103
Test Plan:
Benchmarking: buck build mode/opt experimental/jeremyl/c2:SerializationBench
Correctness: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D17982181
fbshipit-source-id: 7fd4d267293231244c10c1e5b8f4951a7a3d852f