julia
0677c85b - strings: assert nothrow effects on Char and String predicates (#61625)

Commit
14 days ago
strings: assert nothrow effects on Char and String predicates (#61625) Continuation of #61616 #61615 Developed with claude: --- Several `Char` predicates and a couple of `String` predicates in `base/strings/unicode.jl` and `base/strings/string.jl` are spuriously inferred as `nothrow=false`, even though they are provably total. This prevents the compiler from constant-folding them or from eliminating them when their result is unused. The root cause is that the compiler cannot prove that the `ismalformed` / `is_overlong_enc` guards rule out the throwing paths in their bodies — for example the `Int(::Cint)` conversion after the `utf8proc` ccall, or the `UInt32(::AbstractChar)` conversion in `category_code`. The `AbstractChar` contract guarantees that `UInt32(c)` only throws when `ismalformed(c)` returns `true`, and the `utf8proc_charwidth` result fits in a `Cint` so the `Int` conversion never throws `InexactError`. This PR annotates the affected definitions with `@assume_effects` and adds regression tests in `test/char.jl` and `test/strings/basic.jl`. ### Scope `Char`: - `textwidth`, `category_code`, `category_abbrev`, `category_string`, `isassigned`, `islowercase`, `isuppercase` `String`: - `isascii`, `lastindex`, `textwidth` `SubString{String}`: - `length`, `isascii`, `lastindex`, `textwidth` `String` / `SubString{String}` predicates: - `startswith`, `endswith`, `in(::Char, _)` Annotations are added on narrow concrete-type method overloads so they don't constrain user-defined `AbstractChar` / `AbstractString` subtypes. A static `@assert length(category_strings) == 32` is added so the `category_string(::Char)` annotation stays sound if the table is ever edited. The bodies of `startswith` / `endswith` are `_memcmp` byte-bounded by `sizeof(b) ≤ ncodeunits(a)` plus `nextind` / `thisind` calls on indices proven in range, so they cannot throw even on strings containing arbitrary, malformed UTF-8 byte sequences. A stress test exercises every annotated function on adversarial byte sequences and on `Char` values constructed from invalid bit-patterns to lock in the `:nothrow` soundness against future refactors. ### Effect on generated code `textwidth("hello world")` now constant-folds: **nightly:** ```julia-repl julia> code_typed(() -> textwidth("hello world"), ())[1] CodeInfo( 1 ─ %1 = invoke Base._foldl_impl( Base.MappingRF{typeof(textwidth), Base.BottomRF{typeof(+)}}( textwidth, Base.BottomRF{typeof(+)}(+))::Base.MappingRF{...}, 0::Int64, "hello world"::String)::Int64 └── return %1 ) => Int64 ``` **this PR:** ```julia-repl julia> code_typed(() -> textwidth("hello world"), ())[1] CodeInfo( 1 ─ return 11 ) => Int64 ``` Likewise, `startswith` / `endswith` / `in` over literal `String`s now constant-fold. **nightly:** ```julia-repl julia> code_typed(() -> startswith("https://example.com/foo", "https://"), ())[1] CodeInfo( 1 ─ %1 = invoke Main.startswith("https://example.com/foo"::String, "https://"::String)::Bool └── return %1 ) => Bool julia> code_typed(() -> endswith("foo.jl", ".jl"), ())[1] CodeInfo( 1 ─ %1 = invoke Main.endswith("foo.jl"::String, ".jl"::String)::Bool └── return %1 ) => Bool julia> code_typed(() -> in('a', "aeiou"), ())[1] CodeInfo( 1 ─ %1 = invoke Base.codeunits("aeiou"::String)::Base.CodeUnits{UInt8, String} │ %2 = $(Expr(:gc_preserve_begin, :(%1))) │ %3 = builtin Base.getfield(%1, :s)::String │ %4 = invoke Base.cconvert(Ptr{UInt8}::Type{Ptr{UInt8}}, %3::String)::String │ %5 = $(Expr(:foreigncall, :((:jl_string_ptr,)), Ptr{UInt8}, svec(Any), 0, :(:ccall), :(%4)))::Ptr{UInt8} │ %6 = intrinsic Base.add_ptr(%5, 0x0000000000000001)::Ptr{UInt8} │ %7 = intrinsic Base.sub_ptr(%6, 0x0000000000000001)::Ptr{UInt8} │ %8 = $(Expr(:foreigncall, :((:memchr,)), Ptr{UInt8}, svec(Ptr{UInt8}, Int32, UInt64), 0, :(:ccall), :(%7), 97, 0x0000000000000005, 0x0000000000000005, 97, :(%7)))::Ptr{UInt8} │ $(Expr(:gc_preserve_end, :(%2))) │ %10 = intrinsic Core.bitcast(Core.UInt, %8)::UInt64 │ %11 = builtin (%10 === 0x0000000000000000)::Bool └── goto #3 if not %11 2 ─ goto #4 3 ─ goto #4 4 ┄ %15 = φ (#2 => true, #3 => false)::Bool │ %16 = intrinsic Core.Intrinsics.not_int(%15)::Bool └── goto #5 5 ─ return %16 ) => Bool ``` **this PR:** ```julia-repl julia> code_typed(() -> startswith("https://example.com/foo", "https://"), ())[1] CodeInfo( 1 ─ return true ) => Bool julia> code_typed(() -> endswith("foo.jl", ".jl"), ())[1] CodeInfo( 1 ─ return true ) => Bool julia> code_typed(() -> in('a', "aeiou"), ())[1] CodeInfo( 1 ─ return true ) => Bool ``` And `SubString{String}` queries can now be DCE'd when their result is unused. With `describe(ss)` calling `length`, `lastindex`, `textwidth`, `isascii`: **nightly:** ```llvm define i64 @julia_caller_0(...) { top: %"sret::NamedTuple" = alloca [4 x i64], align 8 call void @j_describe_0(ptr ... sret(...) align 8 %"sret::NamedTuple", ...) ret i64 42 } ``` **this PR:** ```llvm define i64 @julia_caller_0(...) { top: ret i64 42 } ``` --------- Co-authored-by: Claude <noreply@anthropic.com>
Parents
Loading