[msan] Fix bfmmla instrumentation incompatibility issue (#188834)
#176264 instrumented bfmmla by applying ummla to the shadows. However,
Armv8.2+bf16 (as an example) supports bfmmla but not ummla, thus the
instrumentation is not always compatible.
This patch changes the bfmmla instrumentation to use bfmmla and basic
LLVM intrinsics, thus guaranteeing backend compatibility. The key
insights are that we can 1) use CreateSelect to convert the integer
shadows to bf16 2) apply bfmmla to these "shadows" 3) use FCmpULT to
check that the matrix entries denote fully initialized outputs.
This patch significantly refactors `handleNEONMatrixMultiply`, which is
also used for {s,u,su}mmla instrumentation, but the output is unaffected
for {s,u,su}mmla.