[inductor] realize boundaries in bucketize() lowering (#106107)
ops.bucketize() implements a binary search: it takes values and offsets; offsets defines a set of buckets, and ops.bucketize() returns, for each value, the index of the bucket it lies in. The op is elemenwise with regard to the values and outputs, but it needs access to the entire offsets tensor in global memory so that it can perform the binary search. So, we need to realize the boundaries into global memory before running the op. The scheduler won't try to fuse the two kernels together because the input to ops.bucketize() is marked as a StarDep.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106107
Approved by: https://github.com/jansel