Reduce peak gpu memory usage and support moe estimation (#981)
- Reduce peak memory usage by calling clear_memory cosidering performance effort.
- Move best_params to CPU and make sure clear memory before moving back.
- move loss device to the second card if card_0_in_high_risk
- support Deepseek R1 W4A16 tuning with 3 CUDA cards (80GB) (--enable_torch_compile)
- support llama3.3 70B W4A16 tuning with 2 Intel GPU cards (24GB)(--enable_torch_compile)