[`FA`] Cleanup loading logic (#41427)
* fix
* style
* fix kernels loading as well
* fix typing
* refactor CB loading logic as well
* fix base fa logic
* rename
* properly lazy load paged fa
* fix
* check if ci is crashing again
* fix fallback
* style
* allow varlen only, e.g. for metal kernel
* fixup new namings from flash-attn to flash-attn2
* make it a bit more explicit
* add comment