qnnpack hardswish - LUTs (#36252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36252
Adds a baseline hardswish kernel using LUTs in QNNPACK.
Performance is 1.9 GB/s on a Nexus 6 and 2.2 GB/s on Pixel 3 - same as other LUT based ops.
Enforcing scale and zp to be equal to the input, to match the server implementation.
There are some potential improvements in rewriting this as NEON
kernels for a further speedup - saving that until later, if we need it.
Test Plan:
```
with-proxy ./scripts/build-local.sh
./build/local/hardswish-test
with-proxy scripts/build-android-armv7.sh
adb push ./build/android/armeabi-v7a/hardswish-* /data/qnnpack
adb shell
/data/qnnpack/hardswish-test
/data/qnnpack/hardswish-bench
with-proxy scripts/build-android-arm64.sh
adb push ./build/android/arm64-v8a/hardswish-* /data/qnnpack
/data/qnnpack/hardswish-test
/data/qnnpack/hardswish-bench
```
Imported from OSS
Differential Revision: D20965044
fbshipit-source-id: 982938361971513cb15873438e12c23a38e819e3