Text and bytes: ASCII, UTF-8, and the gap between characters and bytes

Apr 26, 2026 4 min read

“Text” is what people read; “bytes” is what computers store. Character encoding is the bridge. This article walks through ASCII through UTF-8 and the practical traps.

ASCII: 7 bits for English

ASCII (1963) represents letters, digits, and symbols in 7 bits (128 values):

0–31: control characters (tab, newline, ESC, …)
32: space
33–126: printable characters
127: DEL

Sample values:

Char	Decimal	Hex	Binary
`A`	65	0x41	01000001
`Z`	90	0x5A	01011010
`a`	97	0x61	01100001
`0`	48	0x30	00110000
space	32	0x20	00100000
`\n`	10	0x0A	00001010

One byte per character (the high bit stays 0).

ASCII’s limit: only English

128 values can’t cover:

East Asian scripts (Japanese, Chinese, Korean).
Many European letters (ç, ü, ñ).
Emoji or special symbols.

The world responded with dozens of incompatible encodings:

ISO-8859-1 (Western Europe)
Shift_JIS, EUC-JP (Japanese)
GB2312, Big5 (Chinese)
KOI8-R (Russian)

Mixing them produced “encoding hell” — data corrupted across system boundaries.

Unicode: a number for every character

Unicode assigns every character a single code point:

Up to U+10FFFF (~1.1 million).
One number per character.
Language-independent.

A  = U+0041
あ = U+3042
🍎  = U+1F34E

Unicode is just numbering — how those numbers become bytes is a separate concern (encoding).

UTF-8: variable-length encoding

UTF-8 stores Unicode in 1 to 4 bytes:

Code point range	UTF-8 size	Bit pattern
U+0000–U+007F	1 byte	`0xxxxxxx`
U+0080–U+07FF	2 bytes	`110xxxxx 10xxxxxx`
U+0800–U+FFFF	3 bytes	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000–U+10FFFF	4 bytes	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Properties:

ASCII-compatible — code points 0–127 are one byte, identical to ASCII.
Self-synchronizing — you can find character boundaries from any starting point.
Universal — every Unicode code point is representable.

Examples:

A = 01000001 (1 byte)
あ = 11100011 10000001 10000010 (3 bytes)

UTF-16 and UTF-32

Other encodings:

UTF-16 — 2 or 4 bytes per character. Used internally by Windows, Java, JavaScript.
UTF-32 — always 4 bytes. Simple but wasteful in storage.

UTF-16 represents code points outside the BMP (most emoji) as surrogate pairs — 4 bytes total. This is why '🍎'.length === 2 in JavaScript.

Byte order marks (BOM)

A few-byte marker that may appear at the start of a file:

UTF-8 BOM: EF BB BF (optional).
UTF-16 LE: FF FE.
UTF-16 BE: FE FF.

UTF-8 doesn’t need a BOM, but Windows tools sometimes add one, which breaks shell scripts (#!/bin/bash won’t be recognized after a BOM).

Detecting an encoding is unreliable

You cannot determine an encoding from raw bytes alone with certainty:

Pure ASCII bytes are valid ASCII, UTF-8, Shift_JIS, and ISO-8859-1.
Japanese files require statistical guessing to distinguish UTF-8 from Shift_JIS.

Libraries (chardet, ICU) guess but aren’t 100%. Always declare the encoding in metadata:

HTML: <meta charset="utf-8">
HTTP: Content-Type: text/html; charset=utf-8
Files: extension conventions or BOM.

Common pitfalls

1. Character count vs byte count

'あ'.length; // 1 (characters)
new Blob(['あ']).size; // 3 (UTF-8 bytes)

Database column limits drift between “characters” and “bytes” interpretations.

2. Combining characters

が can be either:

One code point が (U+304C), or
Two code points: か (U+304B) + ゛ (U+3099).

They render identically but compare as different. Normalize before comparison:

'が'.normalize('NFC'); // → 'が' (U+304C)

3. Invalid UTF-8 bytes

Lone surrogates and other malformed sequences need handling at trust boundaries. Validate before processing.

4. ASCII-only legacy systems

Email headers (RFC 5321) and URLs (RFC 3986) are ASCII-only. Non-ASCII content needs MIME encoding or Punycode.

Summary

ASCII = 7 bits, English only.
Unicode assigns numbers; encoding turns them into bytes.
UTF-8 is variable-length, ASCII-compatible, today’s standard.
Character count and byte count are different.
Always declare encoding in metadata.

Inspecting text as binary or hex comes up when debugging encoding issues, parsing network protocols, or learning low-level programming. The text-to-binary tool on this site shows ASCII text in multiple bases at once — A lands at 0x41 and you can see it immediately. Multi-byte characters (Japanese, etc.) appear as their UTF-8 byte sequences, so character count and byte count diverge once you leave ASCII.