Text and bytes: ASCII, UTF-8, and the gap between characters and bytes
“Text” is what people read; “bytes” is what computers store. Character encoding is the bridge. This article walks through ASCII through UTF-8 and the practical traps.
ASCII: 7 bits for English
ASCII (1963) represents letters, digits, and symbols in 7 bits (128 values):
- 0–31: control characters (tab, newline, ESC, …)
- 32: space
- 33–126: printable characters
- 127: DEL
Sample values:
| Char | Decimal | Hex | Binary |
|---|---|---|---|
A | 65 | 0x41 | 01000001 |
Z | 90 | 0x5A | 01011010 |
a | 97 | 0x61 | 01100001 |
0 | 48 | 0x30 | 00110000 |
| space | 32 | 0x20 | 00100000 |
\n | 10 | 0x0A | 00001010 |
One byte per character (the high bit stays 0).
ASCII’s limit: only English
128 values can’t cover:
- East Asian scripts (Japanese, Chinese, Korean).
- Many European letters (ç, ü, ñ).
- Emoji or special symbols.
The world responded with dozens of incompatible encodings:
- ISO-8859-1 (Western Europe)
- Shift_JIS, EUC-JP (Japanese)
- GB2312, Big5 (Chinese)
- KOI8-R (Russian)
Mixing them produced “encoding hell” — data corrupted across system boundaries.
Unicode: a number for every character
Unicode assigns every character a single code point:
- Up to U+10FFFF (~1.1 million).
- One number per character.
- Language-independent.
A = U+0041
あ = U+3042
🍎 = U+1F34E Unicode is just numbering — how those numbers become bytes is a separate concern (encoding).
UTF-8: variable-length encoding
UTF-8 stores Unicode in 1 to 4 bytes:
| Code point range | UTF-8 size | Bit pattern |
|---|---|---|
| U+0000–U+007F | 1 byte | 0xxxxxxx |
| U+0080–U+07FF | 2 bytes | 110xxxxx 10xxxxxx |
| U+0800–U+FFFF | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000–U+10FFFF | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Properties:
- ASCII-compatible — code points 0–127 are one byte, identical to ASCII.
- Self-synchronizing — you can find character boundaries from any starting point.
- Universal — every Unicode code point is representable.
Examples:
A=01000001(1 byte)あ=11100011 10000001 10000010(3 bytes)
UTF-16 and UTF-32
Other encodings:
- UTF-16 — 2 or 4 bytes per character. Used internally by Windows, Java, JavaScript.
- UTF-32 — always 4 bytes. Simple but wasteful in storage.
UTF-16 represents code points outside the BMP (most emoji) as surrogate pairs — 4 bytes total. This is why '🍎'.length === 2 in JavaScript.
Byte order marks (BOM)
A few-byte marker that may appear at the start of a file:
- UTF-8 BOM:
EF BB BF(optional). - UTF-16 LE:
FF FE. - UTF-16 BE:
FE FF.
UTF-8 doesn’t need a BOM, but Windows tools sometimes add one, which breaks shell scripts (#!/bin/bash won’t be recognized after a BOM).
Detecting an encoding is unreliable
You cannot determine an encoding from raw bytes alone with certainty:
- Pure ASCII bytes are valid ASCII, UTF-8, Shift_JIS, and ISO-8859-1.
- Japanese files require statistical guessing to distinguish UTF-8 from Shift_JIS.
Libraries (chardet, ICU) guess but aren’t 100%. Always declare the encoding in metadata:
- HTML:
<meta charset="utf-8"> - HTTP:
Content-Type: text/html; charset=utf-8 - Files: extension conventions or BOM.
Common pitfalls
1. Character count vs byte count
'あ'.length; // 1 (characters)
new Blob(['あ']).size; // 3 (UTF-8 bytes) Database column limits drift between “characters” and “bytes” interpretations.
2. Combining characters
が can be either:
- One code point
が(U+304C), or - Two code points:
か(U+304B) +゛(U+3099).
They render identically but compare as different. Normalize before comparison:
'が'.normalize('NFC'); // → 'が' (U+304C) 3. Invalid UTF-8 bytes
Lone surrogates and other malformed sequences need handling at trust boundaries. Validate before processing.
4. ASCII-only legacy systems
Email headers (RFC 5321) and URLs (RFC 3986) are ASCII-only. Non-ASCII content needs MIME encoding or Punycode.
Summary
- ASCII = 7 bits, English only.
- Unicode assigns numbers; encoding turns them into bytes.
- UTF-8 is variable-length, ASCII-compatible, today’s standard.
- Character count and byte count are different.
- Always declare encoding in metadata.
To inspect text as binary in different bases, the text-to-binary tool on this site converts both ways.