Text and bytes: ASCII, UTF-8, and the gap between characters and bytes

4 min read

“Text” is what people read; “bytes” is what computers store. Character encoding is the bridge. This article walks through ASCII through UTF-8 and the practical traps.

ASCII: 7 bits for English

ASCII (1963) represents letters, digits, and symbols in 7 bits (128 values):

  • 0–31: control characters (tab, newline, ESC, …)
  • 32: space
  • 33–126: printable characters
  • 127: DEL

Sample values:

CharDecimalHexBinary
A650x4101000001
Z900x5A01011010
a970x6101100001
0480x3000110000
space320x2000100000
\n100x0A00001010

One byte per character (the high bit stays 0).

ASCII’s limit: only English

128 values can’t cover:

  • East Asian scripts (Japanese, Chinese, Korean).
  • Many European letters (ç, ü, ñ).
  • Emoji or special symbols.

The world responded with dozens of incompatible encodings:

  • ISO-8859-1 (Western Europe)
  • Shift_JIS, EUC-JP (Japanese)
  • GB2312, Big5 (Chinese)
  • KOI8-R (Russian)

Mixing them produced “encoding hell” — data corrupted across system boundaries.

Unicode: a number for every character

Unicode assigns every character a single code point:

  • Up to U+10FFFF (~1.1 million).
  • One number per character.
  • Language-independent.
A  = U+0041
あ = U+3042
🍎  = U+1F34E

Unicode is just numbering — how those numbers become bytes is a separate concern (encoding).

UTF-8: variable-length encoding

UTF-8 stores Unicode in 1 to 4 bytes:

Code point rangeUTF-8 sizeBit pattern
U+0000–U+007F1 byte0xxxxxxx
U+0080–U+07FF2 bytes110xxxxx 10xxxxxx
U+0800–U+FFFF3 bytes1110xxxx 10xxxxxx 10xxxxxx
U+10000–U+10FFFF4 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Properties:

  • ASCII-compatible — code points 0–127 are one byte, identical to ASCII.
  • Self-synchronizing — you can find character boundaries from any starting point.
  • Universal — every Unicode code point is representable.

Examples:

  • A = 01000001 (1 byte)
  • = 11100011 10000001 10000010 (3 bytes)

UTF-16 and UTF-32

Other encodings:

  • UTF-16 — 2 or 4 bytes per character. Used internally by Windows, Java, JavaScript.
  • UTF-32 — always 4 bytes. Simple but wasteful in storage.

UTF-16 represents code points outside the BMP (most emoji) as surrogate pairs — 4 bytes total. This is why '🍎'.length === 2 in JavaScript.

Byte order marks (BOM)

A few-byte marker that may appear at the start of a file:

  • UTF-8 BOM: EF BB BF (optional).
  • UTF-16 LE: FF FE.
  • UTF-16 BE: FE FF.

UTF-8 doesn’t need a BOM, but Windows tools sometimes add one, which breaks shell scripts (#!/bin/bash won’t be recognized after a BOM).

Detecting an encoding is unreliable

You cannot determine an encoding from raw bytes alone with certainty:

  • Pure ASCII bytes are valid ASCII, UTF-8, Shift_JIS, and ISO-8859-1.
  • Japanese files require statistical guessing to distinguish UTF-8 from Shift_JIS.

Libraries (chardet, ICU) guess but aren’t 100%. Always declare the encoding in metadata:

  • HTML: <meta charset="utf-8">
  • HTTP: Content-Type: text/html; charset=utf-8
  • Files: extension conventions or BOM.

Common pitfalls

1. Character count vs byte count

'あ'.length; // 1 (characters)
new Blob(['あ']).size; // 3 (UTF-8 bytes)

Database column limits drift between “characters” and “bytes” interpretations.

2. Combining characters

can be either:

  • One code point (U+304C), or
  • Two code points: (U+304B) + (U+3099).

They render identically but compare as different. Normalize before comparison:

'が'.normalize('NFC'); // → 'が' (U+304C)

3. Invalid UTF-8 bytes

Lone surrogates and other malformed sequences need handling at trust boundaries. Validate before processing.

4. ASCII-only legacy systems

Email headers (RFC 5321) and URLs (RFC 3986) are ASCII-only. Non-ASCII content needs MIME encoding or Punycode.

Summary

  • ASCII = 7 bits, English only.
  • Unicode assigns numbers; encoding turns them into bytes.
  • UTF-8 is variable-length, ASCII-compatible, today’s standard.
  • Character count and byte count are different.
  • Always declare encoding in metadata.

To inspect text as binary in different bases, the text-to-binary tool on this site converts both ways.