Skip to content

8365675: Add String.toCaseFold() to support Unicode Case-Folding #26892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xuemingshen-oracle
Copy link

@xuemingshen-oracle xuemingshen-oracle commented Aug 22, 2025

Summary

Case folding is a key operation for case-insensitive matching (e.g., string equality, hashing, indexing, or regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

String.equalsIgnoreCase(String)

  • Unicode-aware, locale-independent.
  • Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
  • Limited: does not support 1:M mapping defined in Unicode case folding.

Character.toLowerCase(int) / Character.toUpperCase(int)

  • Locale-independent, single code point only.
  • No support for 1:M mappings.

String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)

  • Based on Unicode SpecialCasing.txt, supports 1:M mappings.
  • Intended primarily for presentation/display, not structural case-insensitive matching.
  • Not fully aligned with Unicode case folding rules.

Examples
Some cases where current APIs differ from Unicode case folding:

Greek sigma forms

  • U+03A3 (Σ), U+03C2 (ς), U+03C3 (σ)
  • equalsIgnoreCase() matches correctly
  • toUpperCase().toLowerCase not unify final sigma (ς) with normal sigma (σ)
  • Case folding maps all forms consistently.
jshell> "ΜΙΚΡΟΣ Σ".equalsIgnoreCase("μικροσ σ")
$20 ==> true

jshell> "ΜΙΚΡΟΣ Σ".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("μικροσ σ")
$21 ==> false

1:M mappings, e.g. U+00DF (ß)

  • String.toUpperCase(Locale.ROOT, "ß") → "SS"
  • Case folding produces "ss", matching Unicode caseless comparison rules.
jshell> "\u00df".equalsIgnoreCase("ss")
$22 ==> false

jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
$24 ==> true

Motivation

Adding a direct API in the JDK aligns Java with other languages and makes Unicode-compliant case-less matching simpler and more efficient.

  • Unicode-compliant full case folding.
  • Simpler, stable and more efficient case-less matching without workarounds.
  • Consistency with other programming languages/libraries (Python str.casefold(), Perl fc(), icu4j UCharacter.foldCase etc.).

The New API

/**
 * Returns a case-folded copy of this {@code String}, using the Unicode
 * case folding mappings defined in
 * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">
 * Unicode Case Folding Properties</a>.
 *
 * <p>Case folding is a locale-independent, language-neutral form of
 * case mapping, primarily intended for case-insensitive matching.
 * Unlike {@link #toLowerCase()} or {@link #toUpperCase()}, which are
 * designed for locale-sensitive or display-oriented transformations,
 * case folding provides a stable and consistent mapping across all
 * environments. It may include one-to-many mappings; for example,
 * the German sharp s ({@code U+00DF}) folds to the sequence
 * {@code "ss"}.
 *
 * <p>This method performs the "Full" case folding as defined in the
 * Unicode CaseFolding data file. The result is suitable for use in
 * case-insensitive string comparison, searching, or indexing.
 *
 * @apiNote
 * Case folding is intended for caseless matching, not for locale-sensitive
 * presentation. For example:
 *
 * <pre>{@code
 * String a = "Maße";
 * String b = "MASSE";
 * if (a.toCaseFold().equals(b.toCaseFold())) {
 *     // true, matches according to Unicode case-insensitive rules
 * }
 * }</pre>
 *
 * @return a {@code String} containing the case-folded form of this string
 * @see #toLowerCase()
 * @see #toUpperCase()
 * @since 26
 */
public String toCaseFold();

/**
 * A Comparator that orders {@code String} objects as by
 * {@link #compareToCaseFold(String) compareToCaseFold}.
 *
 * @since 26
 */
public static final Comparator<String> CASE_FOLD_ORDER;

/**
 * Compares two strings lexicographically using Unicode case folding.
 * <p>
 * This method returns an integer whose sign is that of calling {@code compareTo}
 * on the case folded versions of the strings.  Unicode Case folding eliminates
 * differences in case according to the Unicode Standard, using the mappings
 * defined in
 * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
 * including one-to-many mappings, such as {@code"ß"} → {@code }"ss"}.
 * <p>
 * Note that this method does <em>not</em> take locale into account, and may
 * produce results that differ from locale-sensitive ordering. For locale-aware
 * comparisons, use {@link java.text.Collator}.
 * @param   str   the {@code String} to be compared.
 * @return  a negative integer, zero, or a positive integer as the specified
 *          String is greater than, equal to, or less than this String,
 *          ignoring case considerations by case folding.
 * @see     java.text.Collator
 * @see     #toCaseFold()
 * @since   26
 */
public int compareToCaseFold(String str);

Usage Examples

// Sharp s (U+00DF) case-folds to "ss"
"straße".toCaseFold().equals("strasse");  // true

// Greek sigma variants fold consistently
"ΜΙΚΡΟΣ Σ".toCaseFold().equals("μικροσ σ");  // true

// Kelvin sign (U+212A) is case-folded to "k"
"K".toCaseFold().equals("k"); // true

Performance

A JMH microbenchmark has been added (StringToCaseFold.java) to compare toCaseFold() with the commonly used toUpperCase().toLowerCase() pattern.

Results (Latin-1, BMP, and surrogate-containing inputs) show that toCaseFold() is both faster and more memory-efficient, since it requires a single pass over the data.

Benchmark                        (dataset)   Mode  Cnt      Score     Error   Units
StringToCaseFold.caseFold            LATIN  thrpt   25   6064.804 ± 219.175  ops/ms
StringToCaseFold.caseFold              BMP  thrpt   25   8603.312 ±  10.762  ops/ms
StringToCaseFold.caseFold    SUPPLEMENTARY  thrpt   25  10607.604 ± 146.350  ops/ms
StringToCaseFold.caseFold            MIXED  thrpt   25  13587.844 ± 172.257  ops/ms
StringToCaseFold.upperLower          LATIN  thrpt   25   6974.160 ±  29.734  ops/ms
StringToCaseFold.upperLower            BMP  thrpt   25   6217.914 ±  53.833  ops/ms
StringToCaseFold.upperLower  SUPPLEMENTARY  thrpt   25   7619.315 ±  29.023  ops/ms
StringToCaseFold.upperLower          MIXED  thrpt   25   2823.679 ± 163.521  ops/ms

Refs

Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt

Other Languages

Python string.casefold()

The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

Perl’s fc()

Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].

ICU4J UCharacter.foldCase (Java)

Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

u_strFoldCase (C/C++)

A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8365675: Add String.toCaseFold() to support Unicode Case-Folding (Enhancement - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892
$ git checkout pull/26892

Update a local copy of the PR:
$ git checkout pull/26892
$ git pull https://git.openjdk.org/jdk.git pull/26892/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26892

View PR using the GUI difftool:
$ git pr show -t 26892

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26892.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 22, 2025

👋 Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 22, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot changed the title 8365675: Add String.toCaseFold() to support Unicode case-folding 8365675: Add String.toCaseFold() to support Unicode Case-Folding Aug 22, 2025
@openjdk
Copy link

openjdk bot commented Aug 22, 2025

@xuemingshen-oracle The following labels will be automatically applied to this pull request:

  • build
  • core-libs
  • i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added build build-dev@openjdk.org core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org labels Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build-dev@openjdk.org core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

1 participant