Unicode normalization forms NFC / NFD / NFKC / NFKD: filename comparison, search, and identifier equality

May 2, 2026 4 min read

“Same が but the comparison says they’re different.” “Created the file on macOS, copied it to Linux, and ls | grep won’t match it.” Both are Unicode normalization-form mismatches. This article works through what each of the four forms does, which to use for which task, and where the cross-platform mismatches actually bite.

The four forms

Unicode lets the same visible character be represented by multiple different code-point sequences. For example, が has two encodings:

が (U+304C)               ← single code point (composed)
が = か (U+304B) + ◌゙ (U+3099)   ← two code points (decomposed)

Normalization eliminates this ambiguity. Four forms are defined in UAX #15:

Form	Operation	Above example	Length tendency
NFC	canonical decomp + canonical recomp	`が` (1 cp)	usually shorter
NFD	canonical decomp only	`が = か + ◌゙` (2 cp)	usually longer
NFKC	compatibility decomp + canonical recomp	NFC + variant folding	similar to / shorter than NFC
NFKD	compatibility decomp only	NFD + variant folding	typically longest

Two axes:

Canonical vs Compatibility: compatibility folds visually-distinct variants into common forms (fullwidth Ａ → ASCII A, ㈱ → (株)).
Composed vs Decomposed: composed (C) is shorter; decomposed (D) is longer but exposes the constituent parts.

NFC: the storage and transit default

For most use cases, this is the answer.

W3C recommends NFC for HTML, XML, URLs, and any web content.
Most database collations (PostgreSQL, SQL Server) expect NFC by default.
JSON string values and HTTP header values are safest in NFC.

If your code emits NFC, downstream systems can assume it’s normalized.

NFD: macOS filenames

macOS HFS+ and APFS store filenames in NFD — strictly speaking, an Apple-extended NFD called NFD-MP. This is the classic cross-platform interop trap:

Create "がっこう.txt" on macOS
→ filesystem byte sequence: か + ◌゙ + っ + こ + う (NFD)

rsync to Linux
→ ext4 stores filenames as raw bytes
→ typing "がっこう.txt" in your terminal (IME emits NFC) names a *different* file

The same problem hits GitHub, rsync, zip/tar, Docker images — anywhere filenames travel.

Fixes

Set git config core.precomposeUnicode true (Git’s macOS option, on by default in newer versions).
rsync with --iconv=utf-8-mac,utf-8.
Bulk-convert: find . -depth -execdir convmv -f utf8-mac -t utf8 --notest {} \;.

NFKC: search indexes and identifier comparison

Compatibility folding makes NFKC the right choice when you want to collapse visual variants:

Café   (NFC: 4 cp)
Cafe + ◌́ (NFD: 5 cp, decomposed)
Café   ← fullwidth ｃａｆｅ might also arrive

Search engines (Elasticsearch, MeiliSearch, Algolia) typically tokenize with NFKC so that “Café,” “Cafe,” and “ｃａｆｅ” all match the same query.

Concrete NFKC use cases

Homograph attack prevention on usernames and emails (visually-similar characters used to impersonate accounts).
Hashtag / slug uniqueness.
Searching across kanji variant selectors (IVS) — NFKC doesn’t unify all of them but does fold the common compatibility pairs.

NFKC pitfall: information loss

NFKC destroys the original presentation information. Treating ㈱ as identical to (株) is exactly what NFKC does, but it also erases the writer’s intent in choosing ㈱. Don’t use NFKC on canonical stored data; use it for derived structures (search indexes, comparison keys).

NFKD: accent stripping and ASCII-fold search

Converting Café to Cafe:

str
	.normalize('NFKD') // 'Cafe' + '◌́'
	.replace(/p{M}/gu, '') // strip combining marks
	.toLowerCase(); // 'cafe'

NFKD splits combining marks off; the \p{M} (Mark category) regex then drops them. Useful as preprocessing for old-style ASCII LIKE searches.

Same input, all four forms

Input Café㈱がＡ:

Form	Result	Chars	UTF-8 bytes
NFC	`Café㈱がＡ`	6	12
NFD	`Café㈱がＡ` (é and が decomposed)	8	14
NFKC	`Café(株)がA`	8	11
NFKD	`Café(株)がA` (further decomposed)	10	13

NFKC turns ㈱ into (株) and fullwidth Ａ into ASCII A. NFD/NFKD increase character count; NFKC’s byte count drops because of the ASCII folding.

Cheat-sheet by use case

Situation	Form
Storing user input	NFC
Generating HTTP bodies / JSON values	NFC
Search index tokenization	NFKC
Username / slug uniqueness	NFKC
Accent stripping	NFKD + `\p{M}` removal
macOS-vs-Linux filename comparison	Normalize both sides to NFC before comparing
Regex preprocessing (so `\p{M}` works)	NFD (combining marks are otherwise hidden)
Strings being hashed for crypto	Per-spec, defaults to NFC if unspecified

API in each language

// JavaScript
'café'.normalize('NFC');

# Python
import unicodedata
unicodedata.normalize('NFC', 'café')

// Java
java.text.Normalizer.normalize(s, Normalizer.Form.NFC);

// Go (golang.org/x/text/unicode/norm)
norm.NFC.String("café")

// Rust (unicode-normalization crate)
use unicode_normalization::UnicodeNormalization;
"café".nfc().collect::<String>()

Note standard-library status varies: Go ships normalization in golang.org/x/text (a sibling, not stdlib); Rust requires an external crate.

Summary

Picking a normalization form is two decisions: preserve presentation or collapse variants, and composed or decomposed. The defaults are NFC for storage, NFKC for search indexes, NFKD for accent stripping. Most cross-platform “same string but they don’t match” bugs trace to NFC vs NFD mismatch, often originating on macOS.

The Unicode normalizer tool shows all four forms side by side for any input.