[PT-Vulkan] aten::conv1d - opt: width-pack weight tensor (>2x speedup) (#118835)
## This diff
This optimization reduces calls to `texelFetch(uKernel, ...)` by 4.
We borrow MatMul's work to do the re-packing:
https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50
## Future optimziations
We are already batching reads from input/weight tensors, and writes to output tensor.
Here are other ideas, which I won't pursue for now. (2) is the most doable.
1. **Batch reads/writes along the dimension that is most commonly > 1.** For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization.
2. **Batch an optimal number of reads/writes.** Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile.
3. **Obscure shader compiler optimizations.** Since MatMul seemed to benefit from several seemingly equivalent ways to write code.
Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835
Approved by: https://github.com/SS-JIA, https://github.com/liuk22