Fix sync placement in some cases where it was less than optimal or wrong. (#1600)
* Fix placment of added test.
* Place RAW sync at computeAt position rather than unroll position
When a shared-mem tensor is unrolled, which shouldn't be common as
unroll is meant to allocate enough registers for loop unrolling, its RAW
sync is needed at the computeAt loop as there are consumers sharing the
computeAt loop.
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>