[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce excessive codegen (#171253)
Investigation revealed that scalarized copy results in a long chain of
extract/insert elements which can explode in generated temps in the
AMDGPU backend as there is no efficient representation for extracting
subvector with dynamic index. Using identity bitcasts can reduce the
number of extract/insert elements down to 1 and produce much smaller,
efficient generated code.
Credit: ruiling