stdlib: fix performance regression for long string appends.
Re-wrote the inner memcpy loops so that they can be vectorized.
Also added a few inline(__always).
Since we removed some @inlineable attributes this string-append code is not code generated in the client anymore.
The code generation in the stdlib binary is different because all the precondition checks are not folded away.
Using explicit loop control statements instead of for-in-range removes the precondition-overhead for those time critical memcpy loops.