Reduce string formatting overhead in PyWarningHandler
Closes #76952
This does `processErrorMsg` inplace on the warning string, so that in
the fast-path of no type translation it doesn't need to allocate a new
string just to copy the contents over. I also replaced `ostringstream`
with `fmt::format_to` which has noticably better performance.
Overall in a benchmark of `torch.floor_divide`, this drops the
callgrind instruction count from 703,168 to 571,774 and the bechmark
improves by 300 ns from 2.26 us to 1.94 us.
This brings the callgrind count for `~PyWarningHandler` up to ~80%
from `PyErr_WarnEx` so this is probably about as fast as our warning
handling can reasonably get.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76977
Approved by: https://github.com/swolchok