Skip to Content
ftui-textNormalization

Unicode Normalization

Unicode is generous with ways to encode “the same” string. é can be one codepoint (U+00E9, precomposed) or two (U+0065 e + U+0301 COMBINING ACUTE ACCENT, decomposed). Both render identically. Neither compares equal to the other under ==. Normalization is the step that picks a canonical form and turns this problem into a non-problem.

ftui_text::normalization wraps the unicode-normalization crate with a small, opinionated surface.

The four forms

pub enum NormForm { Nfc, // Canonical Composition (default for storage) Nfd, // Canonical Decomposition (default for processing) Nfkc, // Compatibility Composition Nfkd, // Compatibility Decomposition }
FormSemanticsTypical use
NFCCompose precomposed glyphs where possiblestorage, display
NFDDecompose into base + combining markstext processing
NFKCNFC + compatibility substitutions (e.g. fi → fi)search, indexing
NFKDNFD + compatibility substitutionscase-folding pipelines

Visualized

Input: café (with either precomposed é or e + ◌́) NFC: c a f é ← single-codepoint é (U+00E9) NFD: c a f e ◌́ ← base + combining mark (U+0065 U+0301) NFKC/NFKD: identical in this case; differences show up on fi, ff, ①, etc.

The API

pub fn normalize(s: &str, form: NormForm) -> String; pub fn is_normalized(s: &str, form: NormForm) -> bool; pub fn normalize_for_search(s: &str) -> String; // NFKC + case-fold pub fn eq_normalized(a: &str, b: &str, form: NormForm) -> bool; pub fn nfc_iter(s: &str) -> impl Iterator<Item = char> + '_; pub fn nfd_iter(s: &str) -> impl Iterator<Item = char> + '_; pub fn nfkc_iter(s: &str) -> impl Iterator<Item = char> + '_; pub fn nfkd_iter(s: &str) -> impl Iterator<Item = char> + '_;
  • normalize — allocate once and return the normalized string.
  • is_normalized — O(n) check without allocation; cheap fast path.
  • eq_normalized — normalize both sides then compare; the correct way to ask “are these the same user-perceived string?”.
  • normalize_for_search — NFKC + case-folding. Use this for search-index keys and fuzzy-search inputs.
  • *_iter — streaming normalization when you don’t want the String allocation.

When to normalize where

Storage → NFC

Store text in NFC when it lands in a rope, a persistent database, or a file. It’s the most compact and the most compatible with OS APIs.

Processing → NFD

When you need to inspect combining marks, strip accents, or analyze scripts, decompose to NFD first. Then the base character and each mark are separate codepoints you can examine independently.

Search indices should not distinguish café from cafe\u{0301} nor file from file. Canonical equivalence is too strict; compatibility equivalence plus case folding is the intuitive “same word” relation.

Shaping → NFC first

The shaping layer works best with composed forms because OpenType font tables are usually keyed on precomposed glyphs. Normalize to NFC before building TextRuns.

Worked example

search_index.rs
use ftui_text::normalization::{normalize_for_search, eq_normalized, NormForm}; // Building a search index let docs = ["cafe", "café", "file", "file"]; let index: std::collections::HashMap<String, usize> = docs .iter() .enumerate() .map(|(i, d)| (normalize_for_search(d), i)) .collect(); // User searches for "cafe" let query = normalize_for_search("cafe"); assert!(index.contains_key(&query)); // Strict canonical equality ignores the e/é split assert!(eq_normalized("café", "cafe\u{0301}", NormForm::Nfc));

Fast-path: skip normalization when already normal

fast_path.rs
use ftui_text::normalization::{is_normalized, normalize, NormForm}; fn store(s: &str) -> String { if is_normalized(s, NormForm::Nfc) { s.to_owned() // cheap } else { normalize(s, NormForm::Nfc) // rebuild } }

For mostly-ASCII content, is_normalized is extremely fast and lets you avoid the allocation in the common case.

Pitfalls

Normalization is not case folding. normalize("Café", NFC) is still "Café". If you want case-insensitive equality, use normalize_for_search or explicitly fold case afterward.

NFKC loses information. becomes 1, becomes fi. That’s great for search, terrible for display or storage. Don’t NFKC your rope.

Normalization changes length. After NFC, the char count, byte count, and grapheme count can all shift. If you cache indices into the string, recompute them after normalizing.

Where to go next