8365675: Add String.toCaseFold() to support Unicode Case-Folding #26892
+816
−28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Case folding is a key operation for case-insensitive matching (e.g., string equality, hashing, indexing, or regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
String.equalsIgnoreCase(String)
Character.toLowerCase(int) / Character.toUpperCase(int)
String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)
Examples
Some cases where current APIs differ from Unicode case folding:
Greek sigma forms
1:M mappings, e.g. U+00DF (ß)
Motivation
Adding a direct API in the JDK aligns Java with other languages and makes Unicode-compliant case-less matching simpler and more efficient.
The New API
Usage Examples
// Kelvin sign (U+212A) is case-folded to "k"
"K".toCaseFold().equals("k"); // true
Performance
A JMH microbenchmark has been added (StringToCaseFold.java) to compare toCaseFold() with the commonly used toUpperCase().toLowerCase() pattern.
Results (Latin-1, BMP, and surrogate-containing inputs) show that toCaseFold() is both faster and more memory-efficient, since it requires a single pass over the data.
Refs
Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt
Other Languages
Python string.casefold()
The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.
Perl’s fc()
Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].
ICU4J UCharacter.foldCase (Java)
Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.
u_strFoldCase (C/C++)
A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892
$ git checkout pull/26892
Update a local copy of the PR:
$ git checkout pull/26892
$ git pull https://git.openjdk.org/jdk.git pull/26892/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26892
View PR using the GUI difftool:
$ git pr show -t 26892
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26892.diff