Capacity aware partitioning (#22766)

Commit

1 year ago

Capacity aware partitioning (#22766) ### Description Allow users to specify per EP specific resource constraints. Currently, models that do not fit into device memory error out. This PR lays groundwork for EP specific resource constrained graph partitioning, subject to incremental feature additions. Partitioning in this context means to assign graph nodes to a specific device (Execution Provider) up to a certain limit that is every automatically inferred or provided by configuration. In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit. This allows users to run models on devices with limited memory or other limited resources and offload parts of the graph on CPU or other EPs as configured. The PR also introduces an ability to profile and save resource consumption on a per node basis. The results of one or more runs are saved into a CSV file which can then be loaded to assist partitioning. Model architecture-based partitioning (like put N transformer blocks on GPU and embedding on CPU) is not implemented in this PR but will be coming in the future. ### Motivation and Context We want to allow models to run in constrained environments. ### Pending Annotation assisted partitioning

References

#22766 - Capacity aware partitioning

Author

yuslepukhin

Parents

2d33ee91

onnxruntime b230c7bc - Capacity aware partitioning (#22766)

onnxruntime
b230c7bc - Capacity aware partitioning (#22766)