[mlir][sparse][gpu] fix sparse GPU codegen out buffer (#189221)
When lowering sparse tensor operations to GPU code using
`-sparse-gpu-codegen`, the generated `gpu.memcpy` op for device-to-host
copy was targeting the wrong buffer. In my case, it did not copy back
the output buffer and instead only copied back the input positions
buffer which results in the output buffer in host memory being empty.
The `SparseGPUCodegen` pass carries an assumption that the first buffer
is the out buffer. It looks like this assumption is not always true, as
in my case its the input positions buffer which made it the only buffer
getting copied back to host.
This change introduces a fix by removing the assumption and replacing it
with an analysis that checks for `memref::StoreOp` and write
MemoryEffects. This change also adds a regression test which highlights
the problematic edge case.
Assisted by Gemini 3.1 Pro for finding the issue of using incorrect
buffers in `gpu.memcpy` op in the lowered code.