Counting characters and words: rules differ between English and Japanese
“5 sheets of 400-character manuscript paper”, “under 2,000 words”, “140-character limit” — counting rules vary by context. This article surveys the common ones.
Characters vs words
The default unit differs:
- English — typically counted in words.
- Japanese — typically counted in characters.
“Hello world” is 2 words in English, 11 characters (spaces included or not, debated).
Counting words (English)
Whitespace-separated split is the default:
"The quick brown fox jumps over the lazy dog."
→ 9 words Edge cases:
- Hyphenated —
well-being— one word? Two? - Apostrophes —
it's— one word? - Number-with-unit —
100m— one or two?
Word’s defaults: hyphen-joined is one, apostrophe-joined is one.
Counting characters (Japanese)
Three levels of granularity:
Bytes
File-size units. In UTF-8:
- ASCII — 1 byte.
- Half-width kana, Latin extensions — 2 bytes.
- Kanji, hiragana, katakana — 3 bytes.
- Most emoji — 4 bytes.
“あいう” is 9 bytes.
Code points
Unicode scalar values. String.length in JavaScript is UTF-16 code units, which differ from code points:
- “あいう” — 3 code points.
- ”🍎” — 1 code point but 2 UTF-16 code units.
Grapheme clusters (visible characters)
What a human reads as “one character”:
- “が” (か + combining dakuten) — 1 grapheme cluster (2 code points).
- ”👨👩👧” (family emoji) — 1 grapheme cluster (5 code points).
Usually what users mean by “character count”.
JavaScript’s character count
JavaScript counts UTF-16 code units by default:
'hello'.length; // 5
'あいう'.length; // 3
'🍎'.length; // 2 ← surprise
'👨👩👧'.length; // 8 ← surrogate pairs + ZWJ “Count emoji as one character”:
[...'🍎'].length; // 1 (iterator splits by code point)
[...'👨👩👧'].length; // 5 (still splits the ZWJ-joined sequence) For grapheme clusters use Intl.Segmenter:
const seg = new Intl.Segmenter('ja', { granularity: 'grapheme' });
[...seg.segment('👨👩👧')].length; // 1 Twitter (X) character count
Twitter’s own rules:
- Full-width characters (CJK, hiragana, etc.) — count as 2.
- Half-width (ASCII) — count as 1.
- Limit is 280 = 140 full-width = 280 half-width.
URLs are auto-shortened to a fixed 23 characters (via t.co).
Images and videos don’t count.
SMS
SMS encoding controls the limit:
- GSM 7-bit (ASCII-compatible) — 160 chars / message.
- UCS-2 (incl. Japanese, emoji) — 70 chars / message.
- Past the limit, the message splits into concatenated SMS parts.
“Japanese SMS limit of 70 characters” is the typical case.
Word’s “character count”
Microsoft Word’s count dialog reports:
- Words (whitespace-separated).
- Characters (no spaces).
- Characters (with spaces).
- Paragraphs.
- Lines.
For “under 2,000 characters” in academic or business writing, the “characters with spaces” number is usually expected.
Publishing / printing (Japan)
- 400-character manuscript paper — 400 characters per sheet.
- 20 columns × 20 rows.
- Punctuation occupies one cell.
“5 sheets” = 2,000 characters. Common assignment unit for magazines and books.
SEO and content length
Per Google’s quality guidance:
- Very short (under 300 chars in Japanese, under 200 words in English) — weak SEO signal.
- Sweet spot — 1,500–2,500 characters / 800–1,500 words.
- Too long — readers don’t finish.
“Length fits the content.” Padding hurts.
Translation pricing
Translation industry conventions:
- Source-language character price — JP→EN priced per Japanese source character.
- Target-language word price — priced per English target word.
¥10/character for JP→EN is common — a 1,000-character translation costs ¥10,000.
Database column lengths
VARCHAR(255) -- 255 of what? In most modern DBs, this means characters, not bytes. But MySQL allocates byte space based on charset:
utf8mb4— 255 × 4 = 1,020 bytes.latin1— 255 × 1 = 255 bytes.
Designs need to be clear about character vs byte units.
Reading speed
Rules of thumb:
- Japanese — 400–600 chars/min (normal pace).
- English — 200–250 words/min.
- Trained speed-reader — 1,000–1,500 chars/min.
“5-minute read” in Japanese ≈ 2,000–3,000 characters.
Newlines and paragraphs
Whether they count varies by tool:
- Newline
\nas 1 character? - Empty lines included?
- Paragraphs reported separately?
When a “character limit X” is specified, verify the rule.
Summary
- English defaults to word count, Japanese to character count.
- Character count has three layers — bytes, code points, grapheme clusters.
- Twitter — full-width 2, half-width 1.
- SMS — 70 or 160 chars depending on encoding.
- Translation and publishing have their own conventions.
For arbitrary text counting, the word counter on this site reports characters, words, and lines.