Converting between Markdown and HTML: GFM, sanitization, attribute loss
“Generate blog HTML from Markdown” / “Migrate existing HTML articles to Markdown” — bidirectional conversion comes up routinely. Both directions have pitfalls.
Markdown → HTML: not lossless
Markdown lacks features that HTML has, so conversion can drop information. And strict Markdown variants behave differently for edge cases.
1. Line breaks
Line 1
Line 2 - CommonMark:
<p>Line 1\nLine 2</p>(newline becomes whitespace) - GFM (GitHub Flavored Markdown): optionally produces
<p>Line 1<br>Line 2</p>
Force a break with two trailing spaces or an explicit <br>.
2. Tables (GFM-only)
| Col 1 | Col 2 |
| ----- | ----- |
| a | b | CommonMark’s pure spec has no table syntax — only GFM-style extensions render tables. Verify your renderer supports GFM.
3. Inline HTML
Markdown lets you embed HTML directly:
This is **bold** and <span style="color: red">red</span> - Most tools: pass the
<span>through to output - Strict CMS comment areas: strip or escape HTML
- Static site generators: typically pass through
4. Auto-linking
https://example.com ← GFM auto-links this
[text](https://example.com) ← explicit link GFM converts bare URLs to <a>; CommonMark leaves them as plain text.
5. Math and diagrams (extensions)
LaTeX ($E = mc^2$) and Mermaid are renderer-extension features. GitHub, Notion, and custom sites support different subsets.
HTML → Markdown: more is lost
The reverse is structurally lossier because HTML is more expressive.
1. Cannot be expressed
| HTML | Markdown |
|---|---|
Inline <style> | Not representable |
<script> | Not representable |
<iframe> | Usually not (embed shortcodes only) |
<form> | Not representable |
| Class names, IDs | Not representable (some extensions add this) |
<table> colspan / rowspan | Not in GFM tables |
2. Partially preserved
| HTML | Markdown |
|---|---|
<strong> <b> | **bold** |
<em> <i> | *italic* |
<a href="..."> | [text](URL) |
<img src="..."> |  |
<code> | `code` |
<pre><code> | Fenced code block |
<ul> <ol> | - 1. |
<blockquote> | > |
<h1>–<h6> | # – ###### |
3. Always lost
target="_blank" on links, width / height on images — most converters drop these. If you must preserve them, embed raw HTML.
Sanitization: handling untrusted Markdown
User-submitted Markdown is dangerous if inline HTML is allowed:
This is <script>alert('XSS')</script> an attack Defenses:
- Always sanitize (DOMPurify, sanitize-html, etc.)
- Pick a Markdown renderer that forbids raw HTML (marked’s
mangle: trueetc.) - Allow-list image URLs (
data:URIs can carry SVG XSS payloads) - Add
noopener noreferrerto outbound links automatically
GitHub, Reddit, Stack Overflow — most public-facing services sanitize server-side before storing.
Behavior across renderers
| Renderer | GFM | Raw HTML | Sanitized | Math |
|---|---|---|---|---|
| GitHub | ✅ | △ (some) | Server-side | KaTeX |
| Notion | ✅ | ❌ | Auto | LaTeX |
| Obsidian | ✅ | ✅ | No (local) | KaTeX |
| Hugo / Jekyll | ✅ | ✅ | Off by default | Plugin |
| marked.js | ✅ | Configurable | Separate | Separate |
| markdown-it | ✅ | Configurable | Separate | Plugin |
Picking the direction
Markdown → HTML
- Choose Markdown when authoring comfort matters
- Reach for raw HTML only when pixel-level control is required
HTML → Markdown
- Common in blog migrations (WordPress → Hugo, etc.)
- Full automation isn’t realistic; expect manual cleanup
- Libraries like
turndownget you ~80% of the way
Summary
- Markdown vs HTML information content is asymmetric (HTML carries more)
- GFM extension support changes conversion results
- Always sanitize untrusted Markdown
- HTML → Markdown is lossy by design — automate then review
For ad-hoc conversion in either direction, the Markdown-to-HTML and HTML-to-Markdown tools on this site handle GFM and run entirely in-browser, so internal content stays local while you experiment.