xla
Fix race condition when use multi threads to transfer data in parallel Loader
#5267
Merged

Fix race condition when use multi threads to transfer data in parallel Loader #5267

aws-tianquaw
aws-tianquaw2 years ago (edited 2 years ago)

This pull request includes a fix to the issue brought by the previous pull request that added feature to increase the number of host to device transfer threads. When using > 1 workers to transfer data, a possible race condition could happen and caused some data to be lost:

Currently, each thread will call queue.close_write() when there is no more data in the loader queue. But if one thread called queue.close_write() while others still need to to put data to the queue, these batches of data will be lost. In this case. the next_item() call in this line will return None while it should return valid data.

Ideally, we should only call queue.close_write() after all threads have completed writing data to the queue to avoid this race condition.

aws-tianquaw Fix race condition when use > 1 threads to transfer data in parallel …
8f4b5243
aws-tianquaw aws-tianquaw marked this pull request as draft 2 years ago
aws-tianquaw aws-tianquaw marked this pull request as ready for review 2 years ago
JackCaoG JackCaoG requested a review from chandrasekhard2 chandrasekhard2 2 years ago
JackCaoG
JackCaoG2 years ago

@Liyang90 Do you have cycle to take a look at this one?

chandrasekhard2
chandrasekhard2 approved these changes on 2023-07-13
JackCaoG
JackCaoG2 years ago

@aws-tianquaw DO you mind rebasing this pr? Then ci should start passing

aws-tianquaw Merge branch 'pytorch:master' into fix-parallel-loader
6cc17a01
JackCaoG JackCaoG merged 1dc5af55 into master 2 years ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone