[AMDGPU] Insert waitcnt for non-global fence release in GFX12 (#159282)
A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.
Fixes SWDEV-554932.