Unicode normalization forms NFC / NFD / NFKC / NFKD: filename comparison, search, and identifier equality
“Same が but the comparison says they’re different.” “Created the file on macOS, copied it to Linux, and ls | grep won’t match it.” Both are Unicode normalization-form mismatches. This article works through what each of the four forms does, which to use for which task, and where the cross-platform mismatches actually bite.
The four forms
Unicode lets the same visible character be represented by multiple different code-point sequences. For example, が has two encodings:
が (U+304C) ← single code point (composed)
が = か (U+304B) + ◌゙ (U+3099) ← two code points (decomposed) Normalization eliminates this ambiguity. Four forms are defined in UAX #15:
| Form | Operation | Above example | Length tendency |
|---|---|---|---|
| NFC | canonical decomp + canonical recomp | が (1 cp) | usually shorter |
| NFD | canonical decomp only | が = か + ◌゙ (2 cp) | usually longer |
| NFKC | compatibility decomp + canonical recomp | NFC + variant folding | similar to / shorter than NFC |
| NFKD | compatibility decomp only | NFD + variant folding | typically longest |
Two axes:
- Canonical vs Compatibility: compatibility folds visually-distinct variants into common forms (fullwidth A → ASCII A, ㈱ → (株)).
- Composed vs Decomposed: composed (C) is shorter; decomposed (D) is longer but exposes the constituent parts.
NFC: the storage and transit default
For most use cases, this is the answer.
- W3C recommends NFC for HTML, XML, URLs, and any web content.
- Most database collations (PostgreSQL, SQL Server) expect NFC by default.
- JSON string values and HTTP header values are safest in NFC.
If your code emits NFC, downstream systems can assume it’s normalized.
NFD: macOS filenames
macOS HFS+ and APFS store filenames in NFD — strictly speaking, an Apple-extended NFD called NFD-MP. This is the classic cross-platform interop trap:
Create "がっこう.txt" on macOS
→ filesystem byte sequence: か + ◌゙ + っ + こ + う (NFD)
rsync to Linux
→ ext4 stores filenames as raw bytes
→ typing "がっこう.txt" in your terminal (IME emits NFC) names a *different* file The same problem hits GitHub, rsync, zip/tar, Docker images — anywhere filenames travel.
Fixes
- Set
git config core.precomposeUnicode true(Git’s macOS option, on by default in newer versions). - rsync with
--iconv=utf-8-mac,utf-8. - Bulk-convert:
find . -depth -execdir convmv -f utf8-mac -t utf8 --notest {} \;.
NFKC: search indexes and identifier comparison
Compatibility folding makes NFKC the right choice when you want to collapse visual variants:
Café (NFC: 4 cp)
Cafe + ◌́ (NFD: 5 cp, decomposed)
Café ← fullwidth cafe might also arrive Search engines (Elasticsearch, MeiliSearch, Algolia) typically tokenize with NFKC so that “Café,” “Cafe,” and “cafe” all match the same query.
Concrete NFKC use cases
- Homograph attack prevention on usernames and emails (visually-similar characters used to impersonate accounts).
- Hashtag / slug uniqueness.
- Searching across kanji variant selectors (IVS) — NFKC doesn’t unify all of them but does fold the common compatibility pairs.
NFKC pitfall: information loss
NFKC destroys the original presentation information. Treating ㈱ as identical to (株) is exactly what NFKC does, but it also erases the writer’s intent in choosing ㈱. Don’t use NFKC on canonical stored data; use it for derived structures (search indexes, comparison keys).
NFKD: accent stripping and ASCII-fold search
Converting Café to Cafe:
str
.normalize('NFKD') // 'Cafe' + '◌́'
.replace(/p{M}/gu, '') // strip combining marks
.toLowerCase(); // 'cafe' NFKD splits combining marks off; the \p{M} (Mark category) regex then drops them. Useful as preprocessing for old-style ASCII LIKE searches.
Same input, all four forms
Input Café㈱がA:
| Form | Result | Chars | UTF-8 bytes |
|---|---|---|---|
| NFC | Café㈱がA | 6 | 12 |
| NFD | Café㈱がA (é and が decomposed) | 8 | 14 |
| NFKC | Café(株)がA | 8 | 11 |
| NFKD | Café(株)がA (further decomposed) | 10 | 13 |
NFKC turns ㈱ into (株) and fullwidth A into ASCII A. NFD/NFKD increase character count; NFKC’s byte count drops because of the ASCII folding.
Cheat-sheet by use case
| Situation | Form |
|---|---|
| Storing user input | NFC |
| Generating HTTP bodies / JSON values | NFC |
| Search index tokenization | NFKC |
| Username / slug uniqueness | NFKC |
| Accent stripping | NFKD + \p{M} removal |
| macOS-vs-Linux filename comparison | Normalize both sides to NFC before comparing |
Regex preprocessing (so \p{M} works) | NFD (combining marks are otherwise hidden) |
| Strings being hashed for crypto | Per-spec, defaults to NFC if unspecified |
API in each language
// JavaScript
'café'.normalize('NFC'); # Python
import unicodedata
unicodedata.normalize('NFC', 'café') // Java
java.text.Normalizer.normalize(s, Normalizer.Form.NFC); // Go (golang.org/x/text/unicode/norm)
norm.NFC.String("café") // Rust (unicode-normalization crate)
use unicode_normalization::UnicodeNormalization;
"café".nfc().collect::<String>() Note standard-library status varies: Go ships normalization in golang.org/x/text (a sibling, not stdlib); Rust requires an external crate.
Summary
Picking a normalization form is two decisions: preserve presentation or collapse variants, and composed or decomposed. The defaults are NFC for storage, NFKC for search indexes, NFKD for accent stripping. Most cross-platform “same string but they don’t match” bugs trace to NFC vs NFD mismatch, often originating on macOS.
The Unicode normalizer tool shows all four forms side by side for any input.