[libclc] Refine generic __clc_get_sub_group_size with fast full sub-group path (#188895)
Add a fast path for the common case that total work-group size is
multiple of max sub-group size.
The fallback path is ported from amdgpu/workitem/clc_get_sub_group_size.cl.
Compiler can generate predicated instructions for the fallback path to
avoid branches.