Fix CPU Attention causal mask alignment (#29050)
## Summary
- align CPU ONNX Attention causal masking with upper-left behavior for
q_len=1, kv_len>1, no past
- preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query
causal behavior
- update Python attention reference causal mask to model ONNX upper-left
alignment with an explicit past offset
- add a regression test for issue #29020
Fixes #29020
## Validation
- `python -m py_compile
onnxruntime/test/python/transformers/test_onnx_attention/common.py
onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py
onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py`
- `git diff --check`
Notes:
- `pytest
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py
-k "cpu_fp32 and causal" -q` could not run locally because this Python
environment does not have `onnx` / `onnxruntime` installed.
- After the latest follow-up commit, an incremental rebuild of
`onnxruntime_provider_test` was attempted but failed in MSBuild before
compiling this change due to a local environment issue: duplicate `Path`
/ `PATH` environment keys when launching `CL.exe`.