[Pallas TPU] Add use_dummy_device_index to dma_wait_p for inserting dummy device ids in the lowering.
This change is meant to address a design issue with the wait_dma2 op, where we don't provide both src/dst refs and sems. For dma wait_sends, they can be purely local (eg waiting on SC for an SC -> TC DMA to be sent) but still need a device_id/core_id to be lowered correctly. This doesn't make sense for a wait_send because neither the ref nor sem are non-local.
This can be undone once wait_dma3 lands, which matches the signature of enqueue_dma and will therefore have enough info to lower correctly without this additional bit.
PiperOrigin-RevId: 903562977