Converting between Markdown and HTML: GFM, sanitization, attribute loss

4 min read

“Generate blog HTML from Markdown” / “Migrate existing HTML articles to Markdown” — bidirectional conversion comes up routinely. Both directions have pitfalls.

Markdown → HTML: not lossless

Markdown lacks features that HTML has, so conversion can drop information. And strict Markdown variants behave differently for edge cases.

1. Line breaks

Line 1
Line 2
  • CommonMark: <p>Line 1\nLine 2</p> (newline becomes whitespace)
  • GFM (GitHub Flavored Markdown): optionally produces <p>Line 1<br>Line 2</p>

Force a break with two trailing spaces or an explicit <br>.

2. Tables (GFM-only)

| Col 1 | Col 2 |
| ----- | ----- |
| a     | b     |

CommonMark’s pure spec has no table syntax — only GFM-style extensions render tables. Verify your renderer supports GFM.

3. Inline HTML

Markdown lets you embed HTML directly:

This is **bold** and <span style="color: red">red</span>
  • Most tools: pass the <span> through to output
  • Strict CMS comment areas: strip or escape HTML
  • Static site generators: typically pass through

4. Auto-linking

https://example.com ← GFM auto-links this
[text](https://example.com) ← explicit link

GFM converts bare URLs to <a>; CommonMark leaves them as plain text.

5. Math and diagrams (extensions)

LaTeX ($E = mc^2$) and Mermaid are renderer-extension features. GitHub, Notion, and custom sites support different subsets.

HTML → Markdown: more is lost

The reverse is structurally lossier because HTML is more expressive.

1. Cannot be expressed

HTMLMarkdown
Inline <style>Not representable
<script>Not representable
<iframe>Usually not (embed shortcodes only)
<form>Not representable
Class names, IDsNot representable (some extensions add this)
<table> colspan / rowspanNot in GFM tables

2. Partially preserved

HTMLMarkdown
<strong> <b>**bold**
<em> <i>*italic*
<a href="...">[text](URL)
<img src="...">![alt](URL)
<code>`code`
<pre><code>Fenced code block
<ul> <ol>- 1.
<blockquote>>
<h1><h6>#######

3. Always lost

target="_blank" on links, width / height on images — most converters drop these. If you must preserve them, embed raw HTML.

Sanitization: handling untrusted Markdown

User-submitted Markdown is dangerous if inline HTML is allowed:

This is <script>alert('XSS')</script> an attack

Defenses:

  • Always sanitize (DOMPurify, sanitize-html, etc.)
  • Pick a Markdown renderer that forbids raw HTML (marked’s mangle: true etc.)
  • Allow-list image URLs (data: URIs can carry SVG XSS payloads)
  • Add noopener noreferrer to outbound links automatically

GitHub, Reddit, Stack Overflow — most public-facing services sanitize server-side before storing.

Behavior across renderers

RendererGFMRaw HTMLSanitizedMath
GitHub△ (some)Server-sideKaTeX
NotionAutoLaTeX
ObsidianNo (local)KaTeX
Hugo / JekyllOff by defaultPlugin
marked.jsConfigurableSeparateSeparate
markdown-itConfigurableSeparatePlugin

Picking the direction

Markdown → HTML

  • Choose Markdown when authoring comfort matters
  • Reach for raw HTML only when pixel-level control is required

HTML → Markdown

  • Common in blog migrations (WordPress → Hugo, etc.)
  • Full automation isn’t realistic; expect manual cleanup
  • Libraries like turndown get you ~80% of the way

Summary

  • Markdown vs HTML information content is asymmetric (HTML carries more)
  • GFM extension support changes conversion results
  • Always sanitize untrusted Markdown
  • HTML → Markdown is lossy by design — automate then review

For ad-hoc conversion in either direction, the Markdown-to-HTML and HTML-to-Markdown tools on this site handle GFM and run entirely in-browser, so internal content stays local while you experiment.