Normalize reward-to-go in C++ actor-critic (#33550)
Summary:
Comparing to the [Python implementation](https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py), it seems like the tensor of normalized reward-to-go is computed but never used. Even if it's just an integration test, this PR switches to the normalized version for better convergence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33550
Differential Revision: D20024393
Pulled By: yf225
fbshipit-source-id: ebcf0fee14ff39f65f6744278fb0cbf1fc92b919