use acc16 only when n>128 and k>128 in Skylake (#18672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18672
In Skylake, when n < 128 or k < 128, acc16 is slower.
Reviewed By: jianyuh
Differential Revision: D14700576
fbshipit-source-id: 80ca9f1af4626637eed9c5ca49f95ae744811189