LU solve/unpack fix to prevent bad memory usage on CPU (#85922)
Fixes https://github.com/pytorch/pytorch/issues/77898
Fixes https://github.com/pytorch/pytorch/issues/85026
There is a minor perf impact but:
- For lu_solve, the actual compute is going to be more expensive than this O(n) check (ones pass over the other matrices is O(n^2) in any case)
- For lu_unpack, the check inside the kernel should be almost free.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85922
Approved by: https://github.com/ngimel, https://github.com/nikitaved