Add BFloat16 runtime intrinsics. (#51790)
After switching to LLVM for BFloat16 in #51470 (i.e., relying on
`Intrinsics.sub_float` etc instead of hand-rolling bit-twiddling
implementations), we also need to provide fallback runtime
implementations for these intrinsics. This is too bad; I had hoped to
put as much BFloat16-related things as possible in BFloat16s.jl.
This required modifying the unary operator preprocessor macros in order
to differentiate between Float16 and BFloat16; I didn't generalize that to
all intrinsics as the code is hairy enough already (and it's currently
only useful for fptrunc/fpext).