Mobile CPU allocator. (#36032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032
QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors.
This is by-design, and chosen to make the implementation of micro-kernels
both simpler and faster as a result of not having to individually handle the
corner cases where the number of processed elements is not a multiple of SIMD
register width. This behavior will trigger ASAN though, and may result in a
segfault if the accessed memory location just so happens to fall on a page
the current process has no read access to. Here we define a custom allocator
that allocates the extra storage required to keep this behavior safe. This
allocator could have been restricted to QNNPACK and XNNPACK only, but that
would have negative performance ramifications, as input tensors must now be
reallocated, and copied over, if the tensor is not allocated with this
allocator to begin with. Making this allocator the default on mobile builds
minimizes the probability of unnecessary reallocations and copies, and
also enables acceleration of operations where the output tensor is allocated
outside of the function doing the implementation, wherein the implementation
cannot simply re-allocate the output with the guarding allocator.
Test Plan: Imported from OSS
Differential Revision: D20970217
Pulled By: AshkanAliabadi
fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199