Fix segfault for multiple GPU run (regression) (#15823)
### Fix segfault for multiple GPU run
https://github.com/microsoft/onnxruntime/pull/15618 introduced
`GetOrtDeviceByMemType`. The intention should be: handle CPU device
differently in the if branch, while might by mistakenly passing the
unique default non-cpu device id.
```
OrtDevice CUDAExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
return OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, default_device_.Id());
}
return default_device_;
}
```
We observed a segement fault thrown when running multiple GPU training
`
CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch
--nproc_per_node=2
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path distilbert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs222/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps
400 --logging_steps 1
`
It is found GPU0 works fine, GPU1 throw segement fault. Looking further,
a Shape node trying to allocate it's output tensor, trying to fetch
corresponding allocator with ORTDevice(Device:[DeviceType:0 MemoryType:1
DeviceId:1]), while CPU device did not have device id = 1, so a no
allocator returned. When we try to call `AsStreamBasedAllocator` for the
allocator, segement happens as no null check was done there.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->