[offload] Fix kernel record/replay and add extensible mechanism (#190588)
This commit fixes the kernel record replay on both AMD and CUDA devices. It
also re-organizes the record replay code, moves the whole code to separate
files, and makes it extensible to support other record formats (potentially in
the future). The environment variables for controlling the recording have also
been modified.