strings: assert nothrow effects on Char and String predicates (#61625)
Continuation of #61616 #61615
Developed with claude:
---
Several `Char` predicates and a couple of `String` predicates in
`base/strings/unicode.jl` and `base/strings/string.jl` are spuriously
inferred as `nothrow=false`, even though they are provably total. This
prevents the compiler from constant-folding them or from eliminating
them when their result is unused.
The root cause is that the compiler cannot prove that the `ismalformed`
/ `is_overlong_enc` guards rule out the throwing paths in their bodies —
for example the `Int(::Cint)` conversion after the `utf8proc` ccall, or
the `UInt32(::AbstractChar)` conversion in `category_code`. The
`AbstractChar` contract guarantees that `UInt32(c)` only throws when
`ismalformed(c)` returns `true`, and the `utf8proc_charwidth` result
fits in a `Cint` so the `Int` conversion never throws `InexactError`.
This PR annotates the affected definitions with `@assume_effects` and
adds regression tests in `test/char.jl` and `test/strings/basic.jl`.
### Scope
`Char`:
- `textwidth`, `category_code`, `category_abbrev`, `category_string`,
`isassigned`, `islowercase`, `isuppercase`
`String`:
- `isascii`, `lastindex`, `textwidth`
`SubString{String}`:
- `length`, `isascii`, `lastindex`, `textwidth`
`String` / `SubString{String}` predicates:
- `startswith`, `endswith`, `in(::Char, _)`
Annotations are added on narrow concrete-type method overloads so they
don't constrain user-defined `AbstractChar` / `AbstractString` subtypes.
A static `@assert length(category_strings) == 32` is added so the
`category_string(::Char)` annotation stays sound if the table is ever
edited.
The bodies of `startswith` / `endswith` are `_memcmp` byte-bounded by
`sizeof(b) ≤ ncodeunits(a)` plus `nextind` / `thisind` calls on indices
proven in range, so they cannot throw even on strings containing
arbitrary, malformed UTF-8 byte sequences. A stress test exercises every
annotated function on adversarial byte sequences and on `Char` values
constructed from invalid bit-patterns to lock in the `:nothrow`
soundness against future refactors.
### Effect on generated code
`textwidth("hello world")` now constant-folds:
**nightly:**
```julia-repl
julia> code_typed(() -> textwidth("hello world"), ())[1]
CodeInfo(
1 ─ %1 = invoke Base._foldl_impl(
Base.MappingRF{typeof(textwidth), Base.BottomRF{typeof(+)}}(
textwidth, Base.BottomRF{typeof(+)}(+))::Base.MappingRF{...},
0::Int64, "hello world"::String)::Int64
└── return %1
) => Int64
```
**this PR:**
```julia-repl
julia> code_typed(() -> textwidth("hello world"), ())[1]
CodeInfo(
1 ─ return 11
) => Int64
```
Likewise, `startswith` / `endswith` / `in` over literal `String`s now
constant-fold.
**nightly:**
```julia-repl
julia> code_typed(() -> startswith("https://example.com/foo", "https://"), ())[1]
CodeInfo(
1 ─ %1 = invoke Main.startswith("https://example.com/foo"::String, "https://"::String)::Bool
└── return %1
) => Bool
julia> code_typed(() -> endswith("foo.jl", ".jl"), ())[1]
CodeInfo(
1 ─ %1 = invoke Main.endswith("foo.jl"::String, ".jl"::String)::Bool
└── return %1
) => Bool
julia> code_typed(() -> in('a', "aeiou"), ())[1]
CodeInfo(
1 ─ %1 = invoke Base.codeunits("aeiou"::String)::Base.CodeUnits{UInt8, String}
│ %2 = $(Expr(:gc_preserve_begin, :(%1)))
│ %3 = builtin Base.getfield(%1, :s)::String
│ %4 = invoke Base.cconvert(Ptr{UInt8}::Type{Ptr{UInt8}}, %3::String)::String
│ %5 = $(Expr(:foreigncall, :((:jl_string_ptr,)), Ptr{UInt8}, svec(Any), 0, :(:ccall), :(%4)))::Ptr{UInt8}
│ %6 = intrinsic Base.add_ptr(%5, 0x0000000000000001)::Ptr{UInt8}
│ %7 = intrinsic Base.sub_ptr(%6, 0x0000000000000001)::Ptr{UInt8}
│ %8 = $(Expr(:foreigncall, :((:memchr,)), Ptr{UInt8}, svec(Ptr{UInt8}, Int32, UInt64), 0, :(:ccall), :(%7), 97, 0x0000000000000005, 0x0000000000000005, 97, :(%7)))::Ptr{UInt8}
│ $(Expr(:gc_preserve_end, :(%2)))
│ %10 = intrinsic Core.bitcast(Core.UInt, %8)::UInt64
│ %11 = builtin (%10 === 0x0000000000000000)::Bool
└── goto #3 if not %11
2 ─ goto #4
3 ─ goto #4
4 ┄ %15 = φ (#2 => true, #3 => false)::Bool
│ %16 = intrinsic Core.Intrinsics.not_int(%15)::Bool
└── goto #5
5 ─ return %16
) => Bool
```
**this PR:**
```julia-repl
julia> code_typed(() -> startswith("https://example.com/foo", "https://"), ())[1]
CodeInfo(
1 ─ return true
) => Bool
julia> code_typed(() -> endswith("foo.jl", ".jl"), ())[1]
CodeInfo(
1 ─ return true
) => Bool
julia> code_typed(() -> in('a', "aeiou"), ())[1]
CodeInfo(
1 ─ return true
) => Bool
```
And `SubString{String}` queries can now be DCE'd when their result is
unused. With `describe(ss)` calling `length`, `lastindex`, `textwidth`,
`isascii`:
**nightly:**
```llvm
define i64 @julia_caller_0(...) {
top:
%"sret::NamedTuple" = alloca [4 x i64], align 8
call void @j_describe_0(ptr ... sret(...) align 8 %"sret::NamedTuple", ...)
ret i64 42
}
```
**this PR:**
```llvm
define i64 @julia_caller_0(...) {
top:
ret i64 42
}
```
---------
Co-authored-by: Claude <noreply@anthropic.com>