Get rid of some template arguments in GPU loop (#33308)
Summary:
Globally define
```C++
constexpr int num_threads = C10_WARP_SIZE * 2;
constexpr int thread_work_size = 4;
constexpr int block_work_size = thread_work_size * num_threads;
```
and kill all the template arguments passing these values.
These are effectively global, but we are now passing them around by template arguments, causing many inconvenience in coding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33308
Differential Revision: D19907250
Pulled By: ngimel
fbshipit-source-id: 4623b69baea7e6e77f460ffdfa07cf9f8cba588a