directly init a zero immediate buffer to reduce overhead for batch_norm cpu path (#82558)
For batch_norm cpu path, the immediate buffer is firstly inited an empty buffer and then set the buffer value to zero again, there are two dispatches, **empty** and **zeros_**, but we can directly init a zero buffer to reduce the dispatch overhead.
see the following profiler for batch_norm backward:
![image](https://user-images.githubusercontent.com/16217777/182063350-a6680a06-6901-4e12-8207-93517c3c4529.png)
the **at::native_batch_norm_backward** time is 0.173 ms, and the **aten::zeros_** consume 0.019 ms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82558
Approved by: https://github.com/mingfeima, https://github.com/albanD