Revise Send and Recv (#4547)
* Add ability to retrieve inferred shapes when executing a kernel.
This ability helps Recv to know its output shapes without doing
actual cummunication. Of course, if the output shapes cannot be
inferred, Recv still needs to do communication to get shapes from
Send.
* Avoid communicating shape information when it can be inferred statically
* Replace unordered_map with thread-safe wrapper.
We don't want to have racing condition and undefined behavior
when using parallel executor.y
* Remove cout
* Add missing file
* Address comments
* Check dim_value. -1 means missing
* lock properly
* Address comments (remove thread-safe map)
* Remove poc header
* Replace Stream with DeferredReleaseCPUPtr