Fix issue with BF16 optimizer selection (#7788)
**Note:** Updated based on the change
64b10739a66704afc4112d10ab2d70f2b3a2266c for #7790. With the fix,
`BF16_Optimizer` now requires ZeRO stage 1 to be explicitly enabled.
The test `test_bf16_optimizer_fragments` fails with an `AssertionError`
because the `BF16_Optimizer` is not being instantiated when expected.
The test checks for `_hp_mapping` attribute on parameters, which is only
set by `BF16_Optimizer`.
The test `test_bf16_optimizer_fragments` fails because:
1. The test config (`bf16=True` without grad_accum_dtype) **correctly**
uses `FP16_Optimizer`, but the test expects `BF16_Optimizer` (which sets
`_hp_mapping`)
2. `BFLOAT16` and `DDP_BFLOAT16` have the same value `"bf16"`,
preventing proper optimizer selection
3. `BF16_Optimizer` is missing attributes required by the base class API
This PR addresses these issues.
Optimizer selection summary:
| ZeRO Stage | Config | Optimizer | Gradient Accumulation |
|------------|--------|-----------|----------------------|
| 0 | `bf16=True` (default) | `FP16_Optimizer` | bf16 |
| 0 | `bf16=True` + `grad_accum_dtype=fp32` | `NotImplementedError` | -
|
| 1 | `bf16=True` + `grad_accum_dtype=fp32` | `BF16_Optimizer` | fp32 |
This is confusing (e.g., `FP16_Optimizer` handles both fp16 and bf16).
We would need to simplify the code paths and clarify the behaviors in
the future.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>