Remove unnecessary atomic ops in DispatchStub (#40930)
Summary:
I noticed this very unusual use of atomics in `at::native::DispatchStub`. The comment asserts that `choose_cpu_impl()` will always return the same value on every thread, yet for some reason it uses a CAS loop to exchange the value instead of a simple store? That makes no sense considering it doesn't even read the exchanged value.
This replaces the CAS loop with a simple store and also improves the non-initializing case to a single atomic load instead of two.
For reference, the `compare_exchange` was added in https://github.com/pytorch/pytorch/issues/32148 and the while loop added in https://github.com/pytorch/pytorch/issues/35794.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40930
Differential Revision: D22438224
Pulled By: ezyang
fbshipit-source-id: d56028ce18c8c5dbabdf366379a0b6aaa41aa391