Punycode and internationalized domain names: what happens behind a Unicode URL

3 min read

When you use a domain like 日本.jp, the URL bar shows Japanese, but DNS receives an ASCII representation called Punycode. This article walks through how that conversion works and the security implications it brings.

DNS only handles ASCII

DNS was designed for ASCII:

  • Hostname characters: letters, digits, hyphens, dots.
  • Case-insensitive.
  • Up to 63 characters per label.

You cannot put Unicode (Japanese, Cyrillic, accented Latin, …) directly into a hostname. IDNA (Internationalizing Domain Names in Applications) is the workaround, and Punycode is the encoding it relies on.

Punycode: encode Unicode using ASCII

Punycode (RFC 3492) is a reversible encoding that expresses any Unicode string in ASCII letters, digits, and hyphens.

Original domainPunycode
日本.jpxn--wgv71a.jp
münchen.dexn--mnchen-3ya.de
пример.рфxn--e1afmkfd.xn--p1ai

The xn-- prefix marks a Punycode-encoded label.

Encoding outline

Punycode runs three steps:

  1. Copy any ASCII characters to the left.
  2. Encode the non-ASCII characters by code point order plus position.
  3. Wrap the result with xn--.

For mixed input like mañana.com, the man, ana ASCII parts stay; the position and code point of ñ are encoded compactly.

Homograph attacks: same shape, different character

Punycode enables a class of attack: register a domain that looks visually identical to a real one.

  • apple.com (Latin)
  • аpple.com (Cyrillic а instead of Latin a)

The second registers as something like xn--pple-43d.com. A user who thinks they’re visiting Apple ends up somewhere else.

Browsers defend against this by showing Punycode form when a single label mixes scripts.

How browsers decide what to display

A simplified version of Chrome / Firefox rules:

  • Pure ASCII, or one script throughout (all Japanese, all Cyrillic) → show Unicode.
  • Multiple scripts mixed within one label → show Punycode.
  • Some specific safe combinations (Japanese + ASCII digits, etc.) → show Unicode.

That’s why 日本.jp displays as 日本.jp, but аpple.com (Cyrillic а) is shown as xn--pple-43d.com.

Email Address Internationalization (EAI)

Internationalized email addresses (RFC 6530) follow the same pattern:

  • Local part (left of @) — Unicode allowed.
  • Domain part — IDNA / Punycode.

EAI-aware mail servers are still uncommon, so many systems stick to ASCII.

Where this comes up in code

1. Domain input forms

If users type 日本.jp, the value you store usually needs to be Punycode (xn--wgv71a.jp). Browser URL objects expose the Punycode form on most properties.

2. Email delivery

Sending mail to a Unicode address requires the domain to be Punycode-encoded for SMTP.

3. TLS certificates

Let’s Encrypt and other ACME-based CAs issue certificates against the Punycode form of IDN domains.

4. Search engine indexing

Google can treat the IDN and Punycode forms of the same page as separate URLs — set canonical to one form to avoid duplication.

Summary

  • DNS only carries ASCII; Punycode encodes Unicode into ASCII.
  • The xn-- prefix marks a Punycode label.
  • Homograph attacks led browsers to display mixed-script labels as Punycode.
  • Code that handles domains, emails, TLS, and SEO all touches Punycode somewhere.

To convert between Unicode and Punycode forms, the Punycode converter on this site does both directions.