fix(zero): detach flat buffer to prevent autograd inplace error on CPU accelerator
The on-device flatten path (introduced in #7828) passes nn.Parameter objects
with requires_grad=True to torch.cat(), creating a flat buffer with
CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces SplitBackward0
views that are assigned to model params. Inplace copy_() on these views during
optimizer step raises:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace.
This especially affects CPU training where CPU_Accelerator.is_available()
returns True and available_memory() returns system RAM, so the on-device path
is always taken.
Fix: add .detach() to the flattened buffer, matching the implicit detach
behavior of the CPU-offload path (param.data.cpu() + .to(device)).
Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific
terminology in comments/logs with accelerator-generic equivalents.
Signed-off-by: Guokai Ma <guokai.ma@intel.com>