Unicode and emoji: why one emoji is often several code points

3 min read

Working with emoji-bearing text quickly produces surprises: "👨‍👩‍👧".length returns 8, one emoji takes multiple bytes in storage, and string slicing produces broken characters. This article walks through Unicode’s emoji structure.

Unicode basics: code points

Unicode assigns each character a unique number (code point) up to U+10FFFF (~1.1 million).

  • A = U+0041
  • = U+3042
  • 🍎 (apple emoji) = U+1F34E

UTF-16 and surrogate pairs

JavaScript strings use UTF-16 internally. UTF-16 stores most characters in one 16-bit unit, but anything above U+FFFF (most emoji) takes two units — a surrogate pair.

'🍎'.length; // 2 (two 16-bit units)

length counts UTF-16 units, not characters, so emoji break the intuitive count.

Counting emoji correctly

[...'🍎'].length; // 1 (iterator yields code points)
[...'👨‍👩‍👧'].length; // 5 (ZWJ sequence, see below)

[...str] or Array.from(str) iterates by code point, not UTF-16 unit. But ZWJ sequences still don’t collapse to one.

ZWJ sequences: combining multiple emoji

👨‍👩‍👧 (family emoji) is five code points:

👨 (U+1F468) + ZWJ (U+200D) + 👩 (U+1F469) + ZWJ + 👧 (U+1F467)

ZWJ (Zero-Width Joiner, U+200D) signals “render these as one combined glyph”. A capable renderer shows a family emoji; a non-capable one shows three separate emoji.

To collapse ZWJ sequences into single units, use Intl.Segmenter:

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...segmenter.segment('👨‍👩‍👧')].length; // 1

Skin tone modifiers: Fitzpatrick scale

Person emoji can specify skin tone via a modifier:

👍 (base) + 🏽 (medium, U+1F3FD) = 👍🏽

Five modifiers in the range U+1F3FB to U+1F3FF (Fitzpatrick scale). Place one immediately after a base emoji, and a capable renderer combines them.

Flags: regional indicator symbols

Country flags are two regional indicator symbols combined:

🇯🇵 = 🇯 (U+1F1EF, Regional Indicator J) + 🇵 (U+1F1F5, Regional Indicator P)

“J” + “P” = the flag of Japan. Renderers that know the ISO 3166-1 two-letter country code show the flag.

Variation selectors: emoji vs text presentation

Some characters (, , etc.) can render as either emoji or plain glyph depending on context:

☎     ← renderer choice
☎️ (☎ + U+FE0F) ← force emoji presentation
☎︎ (☎ + U+FE0E) ← force text presentation

U+FE0F and U+FE0E are the variation selectors.

Byte counts: 4 bytes in UTF-8

In UTF-8 storage:

ASCII letter:  1 byte
Latin extended: 2 bytes
CJK characters: 3 bytes
Most emoji:     4 bytes

Systems that limit a column by bytes rather than characters can silently truncate emoji-bearing input.

MySQL’s utf8 is actually a 3-byte subset — storing 4-byte emoji throws errors. You need utf8mb4 (default in modern MySQL).

Things to watch in code

1. Counting characters

A “max 50 characters” check using str.length undercounts emoji:

// ❌ one emoji counts as 2+
if (str.length > 50) reject();

// ✅ count grapheme clusters
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...seg.segment(str)].length;
if (count > 50) reject();

2. String truncation

str.slice(0, 20) cuts UTF-16 units, potentially mid-emoji:

'🍎🍎🍎'.slice(0, 5); // '🍎🍎�' — broken

Use Intl.Segmenter or [...str].slice(0, 5).join('').

3. Regular expressions

Emoji matching needs the u flag:

/🍎/u.test('🍎'); // true
/🍎/.test('🍎'); // can fail

4. Database storage

Postgres / SQLite / modern MySQL handle this fine. Legacy MySQL utf8 (3-byte) does not.

Summary

  • Emoji often span two UTF-16 units (surrogate pair).
  • ZWJ sequences combine multiple code points into one visual character.
  • Skin tones, flags, and variation selectors are all modifier patterns.
  • Use Intl.Segmenter to count what the user sees.

To explore emoji and grab their code points, the emoji picker on this site lets you copy and inspect them.