[SYCL RTC] Introduce `--persistent-auto-pch` support (#20374)
Built on top of `--auto-pch` (in-memory) introduced in
https://github.com/intel/llvm/pull/20226.
The most significant technical decision was how to implement the
filesystem cache. I've looked into the following options:
* `sycl/source/detail/persistent_device_code_cache.hpp` Also, see
`sycl/doc/design/KernelProgramCache.md` Seems to be tailored for the
very specific usage scenarios, would be very resource consuming to split
into a generic data structure that would then be used for two different
use cases.
This cache is disabled by default and I'm not sure how well-tested it
is. Also, using plain ".lock" files for "advisory locking" instead of
the native filesystem mechanisms (e.g., locking APIs in
`fcntl`/`flock`/`CreateFile`/`LockFileEx`) made me question if it's
worth generalizing and how much work would be necessary there.
* `llvm/include/llvm/Support/Caching.hpp` Originally implemented as part
of ThinLTO implementation, moved into `LLVMSupport` later with the
following commit message:
> We would like to move ThinLTO’s battle-tested file caching
> mechanism to the LLVM Support library so that we can use it
> elsewhere in LLVM.
API is rather unexpected, so my research hasn't stopped here.
* `lldb/include/lldb/Core/DataFileCache.h` Uses `LLVMSupport`'s caching
from the previous bullet under the hood, but provides an easier to grasp
API. If we were developing upstream I think uplifting that abstraction
into `LLVMSupport` library and then using in both `lldb` and `libsycl`
would probably be the choice I'd vote for. However, doing that
downstream was too much efforts so I ultimately decided not to go with
this approach.
That cache also has a `std::mutex` on the "hot"
`DataFileCache::GetCachedData` path, I presume to avoid creating the
same entry from multiple threads.
In the end, I've chosen to use `LLVMSupport`'s quirky (or maybe I just
hasn't grown enough to appreciate it) caching API directly and that's
what is done in this PR. Unlike `lldb`'s cache, I decided to trade
possible duplicate work of building the preamble on a cache miss from
concurrent threads in favor of no inter-thread synchronization (not
profiled/measured though) on the cache hit path and implementation
simplicity.