Using explicit GPU upcast for ZeRO-Offload (#6962)
Following discussion in
[PR-6670](https://github.com/microsoft/DeepSpeed/pull/6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.
The results on 3B model are shown below:
| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |