Unicode and emoji: why one emoji is often several code points
Working with emoji-bearing text quickly produces surprises: "👨👩👧".length returns 8, one emoji takes multiple bytes in storage, and string slicing produces broken characters. This article walks through Unicode’s emoji structure.
Unicode basics: code points
Unicode assigns each character a unique number (code point) up to U+10FFFF (~1.1 million).
A= U+0041あ= U+3042🍎(apple emoji) = U+1F34E
UTF-16 and surrogate pairs
JavaScript strings use UTF-16 internally. UTF-16 stores most characters in one 16-bit unit, but anything above U+FFFF (most emoji) takes two units — a surrogate pair.
'🍎'.length; // 2 (two 16-bit units) length counts UTF-16 units, not characters, so emoji break the intuitive count.
Counting emoji correctly
[...'🍎'].length; // 1 (iterator yields code points)
[...'👨👩👧'].length; // 5 (ZWJ sequence, see below) [...str] or Array.from(str) iterates by code point, not UTF-16 unit. But ZWJ sequences still don’t collapse to one.
ZWJ sequences: combining multiple emoji
👨👩👧 (family emoji) is five code points:
👨 (U+1F468) + ZWJ (U+200D) + 👩 (U+1F469) + ZWJ + 👧 (U+1F467) ZWJ (Zero-Width Joiner, U+200D) signals “render these as one combined glyph”. A capable renderer shows a family emoji; a non-capable one shows three separate emoji.
To collapse ZWJ sequences into single units, use Intl.Segmenter:
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...segmenter.segment('👨👩👧')].length; // 1 Skin tone modifiers: Fitzpatrick scale
Person emoji can specify skin tone via a modifier:
👍 (base) + 🏽 (medium, U+1F3FD) = 👍🏽 Five modifiers in the range U+1F3FB to U+1F3FF (Fitzpatrick scale). Place one immediately after a base emoji, and a capable renderer combines them.
Flags: regional indicator symbols
Country flags are two regional indicator symbols combined:
🇯🇵 = 🇯 (U+1F1EF, Regional Indicator J) + 🇵 (U+1F1F5, Regional Indicator P) “J” + “P” = the flag of Japan. Renderers that know the ISO 3166-1 two-letter country code show the flag.
Variation selectors: emoji vs text presentation
Some characters (☎, ☂, etc.) can render as either emoji or plain glyph depending on context:
☎ ← renderer choice
☎️ (☎ + U+FE0F) ← force emoji presentation
☎︎ (☎ + U+FE0E) ← force text presentation U+FE0F and U+FE0E are the variation selectors.
Byte counts: 4 bytes in UTF-8
In UTF-8 storage:
ASCII letter: 1 byte
Latin extended: 2 bytes
CJK characters: 3 bytes
Most emoji: 4 bytes Systems that limit a column by bytes rather than characters can silently truncate emoji-bearing input.
MySQL’s utf8 is actually a 3-byte subset — storing 4-byte emoji throws errors. You need utf8mb4 (default in modern MySQL).
Things to watch in code
1. Counting characters
A “max 50 characters” check using str.length undercounts emoji:
// ❌ one emoji counts as 2+
if (str.length > 50) reject();
// ✅ count grapheme clusters
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...seg.segment(str)].length;
if (count > 50) reject(); 2. String truncation
str.slice(0, 20) cuts UTF-16 units, potentially mid-emoji:
'🍎🍎🍎'.slice(0, 5); // '🍎🍎�' — broken Use Intl.Segmenter or [...str].slice(0, 5).join('').
3. Regular expressions
Emoji matching needs the u flag:
/🍎/u.test('🍎'); // true
/🍎/.test('🍎'); // can fail 4. Database storage
Postgres / SQLite / modern MySQL handle this fine. Legacy MySQL utf8 (3-byte) does not.
Summary
- Emoji often span two UTF-16 units (surrogate pair).
- ZWJ sequences combine multiple code points into one visual character.
- Skin tones, flags, and variation selectors are all modifier patterns.
- Use
Intl.Segmenterto count what the user sees.
To explore emoji and grab their code points, the emoji picker on this site lets you copy and inspect them.