[ESIMD] Fix perf regression caused by assumed align in block_load(usm) (#11850)
The element-size address alignment is valid from correctness point of
view, but using 1-byte and 2-byte alignment implicitly causes
performance regression for block_load(const int8_t *, ...) and
block_load(const int16_t *, ...) because GPU BE have to generate slower
GATHER instead of more efficient BLOCK-LOAD. Without this fix block-load
causes up to 44% performance slow-down on some apps that used
block_load() with alignment assumptions used before block_load(usm, ...,
compile_time_props) was implemented.
The reasoning for the expected/assumed alignment from element-size to
4-bytes for byte- and word-vectors is such:
The idea of block_load() call (opposing to gather() call) is to have
efficient block-load, and thus the assumed alignment is such that
allows to generate block-load. This is a bit more tricky for user
but that is how block_load/store API always worked before: block-load
had restrictions that needed to be honored.
To be on safer side, user can always pass the guaranteed alignment.
---------
Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>