clang/AMDGPU: Fix workgroup size builtins for nonuniform work group sizes (#185098)
These were assuming uniform work group sizes. Emit the v4 and v5
sequences to take the remainder group for the nonuniform case.
Currently the device libs uses this builtin on the legacy ABI path with
the same sequence to calculate the remainder, and fully implements the
v5 path. If you perform a franken-build of the library with the updated
builtin, the result is worse. The duplicate sequence does not fully fold out.
However, it does not appear to be wrong. The relevant conformance tests still
pass.