Speed up CUDA kernel launch when block/thread extents are statically known (#42899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42899
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23078708
Pulled By: bertmaher
fbshipit-source-id: 237404b47a31672d7145d70996868a3b9b97924e