Unicode and emoji: why one emoji is often several code points

Apr 26, 2026 3 min read

Working with emoji-bearing text quickly produces surprises: "👨‍👩‍👧".length returns 8, one emoji takes multiple bytes in storage, and string slicing produces broken characters. This article walks through Unicode’s emoji structure.

Unicode basics: code points

Unicode assigns each character a unique number (code point) up to U+10FFFF (~1.1 million).

A = U+0041
あ = U+3042
🍎 (apple emoji) = U+1F34E

UTF-16 and surrogate pairs

JavaScript strings use UTF-16 internally. UTF-16 stores most characters in one 16-bit unit, but anything above U+FFFF (most emoji) takes two units — a surrogate pair.

'🍎'.length; // 2 (two 16-bit units)

length counts UTF-16 units, not characters, so emoji break the intuitive count.

Counting emoji correctly

[...'🍎'].length; // 1 (iterator yields code points)
[...'👨‍👩‍👧'].length; // 5 (ZWJ sequence, see below)

[...str] or Array.from(str) iterates by code point, not UTF-16 unit. But ZWJ sequences still don’t collapse to one.

ZWJ sequences: combining multiple emoji

👨‍👩‍👧 (family emoji) is five code points:

👨 (U+1F468) + ZWJ (U+200D) + 👩 (U+1F469) + ZWJ + 👧 (U+1F467)

ZWJ (Zero-Width Joiner, U+200D) signals “render these as one combined glyph”. A capable renderer shows a family emoji; a non-capable one shows three separate emoji.

To collapse ZWJ sequences into single units, use Intl.Segmenter:

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...segmenter.segment('👨‍👩‍👧')].length; // 1

Skin tone modifiers: Fitzpatrick scale

Person emoji can specify skin tone via a modifier:

👍 (base) + 🏽 (medium, U+1F3FD) = 👍🏽

Five modifiers in the range U+1F3FB to U+1F3FF (Fitzpatrick scale). Place one immediately after a base emoji, and a capable renderer combines them.

Flags: regional indicator symbols

Country flags are two regional indicator symbols combined:

🇯🇵 = 🇯 (U+1F1EF, Regional Indicator J) + 🇵 (U+1F1F5, Regional Indicator P)

“J” + “P” = the flag of Japan. Renderers that know the ISO 3166-1 two-letter country code show the flag.

Variation selectors: emoji vs text presentation

Some characters (☎, ☂, etc.) can render as either emoji or plain glyph depending on context:

☎     ← renderer choice
☎️ (☎ + U+FE0F) ← force emoji presentation
☎︎ (☎ + U+FE0E) ← force text presentation

U+FE0F and U+FE0E are the variation selectors.

Byte counts: 4 bytes in UTF-8

In UTF-8 storage:

ASCII letter:  1 byte
Latin extended: 2 bytes
CJK characters: 3 bytes
Most emoji:     4 bytes

Systems that limit a column by bytes rather than characters can silently truncate emoji-bearing input.

MySQL’s utf8 is actually a 3-byte subset — storing 4-byte emoji throws errors. You need utf8mb4 (default in modern MySQL).

Things to watch in code

1. Counting characters

A “max 50 characters” check using str.length undercounts emoji:

// ❌ one emoji counts as 2+
if (str.length > 50) reject();

// ✅ count grapheme clusters
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...seg.segment(str)].length;
if (count > 50) reject();

2. String truncation

str.slice(0, 20) cuts UTF-16 units, potentially mid-emoji:

'🍎🍎🍎'.slice(0, 5); // '🍎🍎�' — broken

Use Intl.Segmenter or [...str].slice(0, 5).join('').

3. Regular expressions

Emoji matching needs the u flag:

/🍎/u.test('🍎'); // true
/🍎/.test('🍎'); // can fail

4. Database storage

Postgres / SQLite / modern MySQL handle this fine. Legacy MySQL utf8 (3-byte) does not.

Summary

Emoji often span two UTF-16 units (surrogate pair).
ZWJ sequences combine multiple code points into one visual character.
Skin tones, flags, and variation selectors are all modifier patterns.
Use Intl.Segmenter to count what the user sees.

Emoji are just Unicode code points at the data layer, but ZWJ-joined compounds (family emojis like 👨‍👩‍👧) render inconsistently across platforms because the joiner sequences vary. The emoji picker on this site lets you browse by category and copy with code-point info displayed. For apps doing serious emoji-aware text handling, working at the grapheme-cluster level is the right unit — Intl.Segmenter is the standard JavaScript API.