Unicode normalization forms NFC / NFD / NFKC / NFKD: filename comparison, search, and identifier equality

4 min read

“Same but the comparison says they’re different.” “Created the file on macOS, copied it to Linux, and ls | grep won’t match it.” Both are Unicode normalization-form mismatches. This article works through what each of the four forms does, which to use for which task, and where the cross-platform mismatches actually bite.

The four forms

Unicode lets the same visible character be represented by multiple different code-point sequences. For example, has two encodings:

が (U+304C)               ← single code point (composed)
が = か (U+304B) + ◌゙ (U+3099)   ← two code points (decomposed)

Normalization eliminates this ambiguity. Four forms are defined in UAX #15:

FormOperationAbove exampleLength tendency
NFCcanonical decomp + canonical recomp (1 cp)usually shorter
NFDcanonical decomp onlyが = か + ◌゙ (2 cp)usually longer
NFKCcompatibility decomp + canonical recompNFC + variant foldingsimilar to / shorter than NFC
NFKDcompatibility decomp onlyNFD + variant foldingtypically longest

Two axes:

  • Canonical vs Compatibility: compatibility folds visually-distinct variants into common forms (fullwidth A → ASCII A, ㈱ → (株)).
  • Composed vs Decomposed: composed (C) is shorter; decomposed (D) is longer but exposes the constituent parts.

NFC: the storage and transit default

For most use cases, this is the answer.

  • W3C recommends NFC for HTML, XML, URLs, and any web content.
  • Most database collations (PostgreSQL, SQL Server) expect NFC by default.
  • JSON string values and HTTP header values are safest in NFC.

If your code emits NFC, downstream systems can assume it’s normalized.

NFD: macOS filenames

macOS HFS+ and APFS store filenames in NFD — strictly speaking, an Apple-extended NFD called NFD-MP. This is the classic cross-platform interop trap:

Create "がっこう.txt" on macOS
→ filesystem byte sequence: か + ◌゙ + っ + こ + う (NFD)

rsync to Linux
→ ext4 stores filenames as raw bytes
→ typing "がっこう.txt" in your terminal (IME emits NFC) names a *different* file

The same problem hits GitHub, rsync, zip/tar, Docker images — anywhere filenames travel.

Fixes

  • Set git config core.precomposeUnicode true (Git’s macOS option, on by default in newer versions).
  • rsync with --iconv=utf-8-mac,utf-8.
  • Bulk-convert: find . -depth -execdir convmv -f utf8-mac -t utf8 --notest {} \;.

NFKC: search indexes and identifier comparison

Compatibility folding makes NFKC the right choice when you want to collapse visual variants:

Café   (NFC: 4 cp)
Cafe + ◌́ (NFD: 5 cp, decomposed)
Café   ← fullwidth cafe might also arrive

Search engines (Elasticsearch, MeiliSearch, Algolia) typically tokenize with NFKC so that “Café,” “Cafe,” and “cafe” all match the same query.

Concrete NFKC use cases

  • Homograph attack prevention on usernames and emails (visually-similar characters used to impersonate accounts).
  • Hashtag / slug uniqueness.
  • Searching across kanji variant selectors (IVS) — NFKC doesn’t unify all of them but does fold the common compatibility pairs.

NFKC pitfall: information loss

NFKC destroys the original presentation information. Treating as identical to (株) is exactly what NFKC does, but it also erases the writer’s intent in choosing . Don’t use NFKC on canonical stored data; use it for derived structures (search indexes, comparison keys).

NFKD: accent stripping and ASCII-fold search

Converting Café to Cafe:

str
	.normalize('NFKD') // 'Cafe' + '◌́'
	.replace(/p{M}/gu, '') // strip combining marks
	.toLowerCase(); // 'cafe'

NFKD splits combining marks off; the \p{M} (Mark category) regex then drops them. Useful as preprocessing for old-style ASCII LIKE searches.

Same input, all four forms

Input Café㈱がA:

FormResultCharsUTF-8 bytes
NFCCafé㈱がA612
NFDCafé㈱がA (é and が decomposed)814
NFKCCafé(株)がA811
NFKDCafé(株)がA (further decomposed)1013

NFKC turns into (株) and fullwidth A into ASCII A. NFD/NFKD increase character count; NFKC’s byte count drops because of the ASCII folding.

Cheat-sheet by use case

SituationForm
Storing user inputNFC
Generating HTTP bodies / JSON valuesNFC
Search index tokenizationNFKC
Username / slug uniquenessNFKC
Accent strippingNFKD + \p{M} removal
macOS-vs-Linux filename comparisonNormalize both sides to NFC before comparing
Regex preprocessing (so \p{M} works)NFD (combining marks are otherwise hidden)
Strings being hashed for cryptoPer-spec, defaults to NFC if unspecified

API in each language

// JavaScript
'café'.normalize('NFC');
# Python
import unicodedata
unicodedata.normalize('NFC', 'café')
// Java
java.text.Normalizer.normalize(s, Normalizer.Form.NFC);
// Go (golang.org/x/text/unicode/norm)
norm.NFC.String("café")
// Rust (unicode-normalization crate)
use unicode_normalization::UnicodeNormalization;
"café".nfc().collect::<String>()

Note standard-library status varies: Go ships normalization in golang.org/x/text (a sibling, not stdlib); Rust requires an external crate.

Summary

Picking a normalization form is two decisions: preserve presentation or collapse variants, and composed or decomposed. The defaults are NFC for storage, NFKC for search indexes, NFKD for accent stripping. Most cross-platform “same string but they don’t match” bugs trace to NFC vs NFD mismatch, often originating on macOS.

The Unicode normalizer tool shows all four forms side by side for any input.