Support aarch32 neon backend for Vec256 (#41267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41267
Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.
Test Plan:
vec256_test run on android phone.
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D22482196
fbshipit-source-id: 1c22cf67ec352942c465552031e9329550b27b3e