most of the matching time is spent in loops so move memory accesses out of the loop header and into our jump locations
this requires some code duplication but because we eliminate reading `self.instructions` in the hot loops we can eliminate most of the memory operations and thus accelerate matching by a lot.
With this we consistently beat 'fast-glob' and are neck and neck with globset.