[NVPTX] Improve device function byval parameter lowering (#129188)
PTX supports 2 methods of accessing device function parameters:
- "simple" case: If a parameters is only loaded, and all loads can
address the parameter via a constant offset, then the parameter may be
loaded via the ".param" address space. This case is not possible if the
parameters is stored to or has it's address taken. This method is
preferable when possible.
- "move param" case: For more complex cases the address of the param may
be placed in a register via a "mov" instruction. This mov also
implicitly moves the param to the ".local" address space and allows for
it to be written to. This essentially defers the responsibilty of the
byval copy to the PTX calling convention.
The handling of these cases in the NVPTX backend for byval pointers has
some major issues. We currently attempt to determine if a copy is
necessary in NVPTXLowerArgs and either explicitly make an additional
copy in the IR, or insert "addrspacecast" to move the param to the param
address space. Unfortunately the criteria for determining which case is
possible are not correct, leading to miscompilations
(https://godbolt.org/z/Gq1fP7a3G). Further, the criteria for the
"simple" case aren't enforceable in LLVM IR across other transformations
and instruction selection, making deciding between the 2 cases in
NVPTXLowerArgs brittle and buggy.
This patch aims to fix these issues and improve address space related
optimization. In NVPTXLowerArgs, we conservatively assume that all
parameters will use the "move param" case and the local address space.
Responsibility for switching to the "simple" case is given to a new
MachineIR pass, NVPTXForwardParams, which runs once it has become clear
whether or not this is possible. This ensures that the correct address
space is known for the "move param" case allowing for optimization,
while still using the "simple" case where ever possible.