Counting characters and words: rules differ between English and Japanese

4 min read

“5 sheets of 400-character manuscript paper”, “under 2,000 words”, “140-character limit” — counting rules vary by context. This article surveys the common ones.

Characters vs words

The default unit differs:

  • English — typically counted in words.
  • Japanese — typically counted in characters.

“Hello world” is 2 words in English, 11 characters (spaces included or not, debated).

Counting words (English)

Whitespace-separated split is the default:

"The quick brown fox jumps over the lazy dog."
→ 9 words

Edge cases:

  • Hyphenatedwell-being — one word? Two?
  • Apostrophesit's — one word?
  • Number-with-unit100m — one or two?

Word’s defaults: hyphen-joined is one, apostrophe-joined is one.

Counting characters (Japanese)

Three levels of granularity:

Bytes

File-size units. In UTF-8:

  • ASCII — 1 byte.
  • Half-width kana, Latin extensions — 2 bytes.
  • Kanji, hiragana, katakana — 3 bytes.
  • Most emoji — 4 bytes.

“あいう” is 9 bytes.

Code points

Unicode scalar values. String.length in JavaScript is UTF-16 code units, which differ from code points:

  • “あいう” — 3 code points.
  • ”🍎” — 1 code point but 2 UTF-16 code units.

Grapheme clusters (visible characters)

What a human reads as “one character”:

  • “が” (か + combining dakuten) — 1 grapheme cluster (2 code points).
  • ”👨‍👩‍👧” (family emoji) — 1 grapheme cluster (5 code points).

Usually what users mean by “character count”.

JavaScript’s character count

JavaScript counts UTF-16 code units by default:

'hello'.length; // 5
'あいう'.length; // 3
'🍎'.length; // 2 ← surprise
'👨‍👩‍👧'.length; // 8 ← surrogate pairs + ZWJ

“Count emoji as one character”:

[...'🍎'].length; // 1 (iterator splits by code point)
[...'👨‍👩‍👧'].length; // 5 (still splits the ZWJ-joined sequence)

For grapheme clusters use Intl.Segmenter:

const seg = new Intl.Segmenter('ja', { granularity: 'grapheme' });
[...seg.segment('👨‍👩‍👧')].length; // 1

Twitter (X) character count

Twitter’s own rules:

  • Full-width characters (CJK, hiragana, etc.) — count as 2.
  • Half-width (ASCII) — count as 1.
  • Limit is 280 = 140 full-width = 280 half-width.

URLs are auto-shortened to a fixed 23 characters (via t.co). Images and videos don’t count.

SMS

SMS encoding controls the limit:

  • GSM 7-bit (ASCII-compatible) — 160 chars / message.
  • UCS-2 (incl. Japanese, emoji) — 70 chars / message.
  • Past the limit, the message splits into concatenated SMS parts.

“Japanese SMS limit of 70 characters” is the typical case.

Word’s “character count”

Microsoft Word’s count dialog reports:

  • Words (whitespace-separated).
  • Characters (no spaces).
  • Characters (with spaces).
  • Paragraphs.
  • Lines.

For “under 2,000 characters” in academic or business writing, the “characters with spaces” number is usually expected.

Publishing / printing (Japan)

  • 400-character manuscript paper — 400 characters per sheet.
  • 20 columns × 20 rows.
  • Punctuation occupies one cell.

“5 sheets” = 2,000 characters. Common assignment unit for magazines and books.

SEO and content length

Per Google’s quality guidance:

  • Very short (under 300 chars in Japanese, under 200 words in English) — weak SEO signal.
  • Sweet spot — 1,500–2,500 characters / 800–1,500 words.
  • Too long — readers don’t finish.

“Length fits the content.” Padding hurts.

Translation pricing

Translation industry conventions:

  • Source-language character price — JP→EN priced per Japanese source character.
  • Target-language word price — priced per English target word.

¥10/character for JP→EN is common — a 1,000-character translation costs ¥10,000.

Database column lengths

VARCHAR(255)  -- 255 of what?

In most modern DBs, this means characters, not bytes. But MySQL allocates byte space based on charset:

  • utf8mb4 — 255 × 4 = 1,020 bytes.
  • latin1 — 255 × 1 = 255 bytes.

Designs need to be clear about character vs byte units.

Reading speed

Rules of thumb:

  • Japanese — 400–600 chars/min (normal pace).
  • English — 200–250 words/min.
  • Trained speed-reader — 1,000–1,500 chars/min.

“5-minute read” in Japanese ≈ 2,000–3,000 characters.

Newlines and paragraphs

Whether they count varies by tool:

  • Newline \n as 1 character?
  • Empty lines included?
  • Paragraphs reported separately?

When a “character limit X” is specified, verify the rule.

Summary

  • English defaults to word count, Japanese to character count.
  • Character count has three layers — bytes, code points, grapheme clusters.
  • Twitter — full-width 2, half-width 1.
  • SMS — 70 or 160 chars depending on encoding.
  • Translation and publishing have their own conventions.

For arbitrary text counting, the word counter on this site reports characters, words, and lines.