Add client side compitation cache.
This reduces compile time in many ways.
The ClientSession::Run() method needs to serialize the request to proto, which for big computation size objects can be a few ms.
The GRPC latency can vary in the range of few ms as well.
On the service side, the compilation cache is distributed, which requires network lookup.
The de-serialization on the service side for big computation objects can be in the few ms as well.
For reference, each of the 8 parallel computations we issue for resnet50 is about 500KB in size.