[AMDGPU] Global and Buffer loads to LDS should not increase `lgkmcnt` (#179305)
`global_load_lds` and `buffer_load to lds` do only increment `vmcnt` and
not touch `lgkmcnt`. This causes invalid `waitcnts` for some Triton
kernels, similar to the added lit tests.
Note that the change for buffer ops is not necesssary, i.e. the lit test
passes even before this PR, because it seems like `SIInsertWaitcnts`
does not use `LGKM_CNT` for buffer ops. But this change might prevent a
bug in the future.