</>MDTooltools
·6 min read·By MDTool Editorial Team

How to Convert Messy HTML to Clean Markdown

Word exports, Confluence pages, and WYSIWYG editors produce HTML full of inline styles and nested divs. Here's what gets cleaned up automatically, and what doesn't.

Close-up of a code editor showing deeply nested div tags on screen

What Makes HTML Messy

Most "messy" HTML comes from a small set of repeat offenders, and recognizing which one you're dealing with helps set expectations for how clean the conversion will actually be.

Inline styles on every element. Word, Outlook, Confluence, and most WYSIWYG editors apply style="..." directly to each tag rather than using a separate stylesheet, since the exported document needs to render correctly without an external CSS file attached.

Div soup and span soup. Deeply nested <div> wrappers with no semantic meaning are common output from visual page builders and editors that wrap every element in multiple layout containers, often three or four levels deep for a single paragraph.

Word-generated HTML specifically. Microsoft Word's "Save as HTML" output is notorious even among messy-HTML sources: proprietary mso--prefixed CSS properties, conditional comments (<!--[if gte mso 9]>), and a redundant <span> wrapped around every few words, none of which carry semantic meaning.

Redundant or duplicate formatting. A bold <span> sitting inside a paragraph that's already wrapped in <strong>, or empty <p></p> tags left behind from editing in a WYSIWYG tool, are common but harmless leftovers.

What Turndown Does to Clean It Up

MDTool's converter, turndown, walks the actual rendered DOM structure rather than pattern-matching raw text, so it naturally ignores presentation noise and extracts only the semantic content: text, headings, links, and list or table structure.

| Before (messy HTML) | After (Markdown) | Why | |---|---|---| | <p style="color:#333;font-size:14px">Text</p> | Text | Inline style attributes are dropped entirely, regardless of what CSS they specify, since there's no Markdown syntax to preserve them in | | <span style="color:red">Text</span> (no <strong>/<em> inside) | Text | Redundant spans with no semantic tag collapse to plain text, and the color attribute and wrapper are discarded | | <p></p> (empty paragraph) | (collapsed) | Empty paragraphs collapse to normal spacing rather than being preserved as stray blank elements | | <!--[if gte mso 9]>...<!--[endif]--> | (absent) | Conditional comments and Word-specific metadata aren't part of the visible content tree, so they're never processed as content |

What Still Needs Manual Cleanup After Conversion

Being honest about the remaining 10% saves more time than pretending it doesn't exist:

Tips for the Cleanest Output

For HTML that's messy specifically because it came out of a CMS export rather than Word, see our guide on converting HTML to Markdown for CMS migration, which covers front matter and image-path cleanup as well.

Ready to try it on your own document? The HTML to Markdown converter runs entirely in your browser: paste or upload your messy HTML and see the cleaned-up Markdown immediately.

Frequently Asked Questions

Q: Why does my converted Markdown have backslashes before some characters?

Markdown uses certain characters (*, _, [, ]) as syntax for emphasis and links. If your source HTML contains those characters as literal text, the converter escapes them with a backslash so they render as plain text instead of being misread as Markdown syntax. This is correct, expected behavior, not an error.

Q: Does cleaning up messy HTML lose any actual content?

No. Only presentation (inline styles, redundant wrapper tags, CSS classes) is removed. Visible text, links, headings, and list or table structure are preserved exactly. If text seems to be missing, check whether it was actually presentation-only markup, like a decorative icon span, rather than real content.

Q: Why didn't my Word document's headings convert to Markdown headings?

If Word applied a heading look using font size and bold styling rather than an actual Heading 1/2/3 style, the exported HTML contains a styled <p> or <span>, not a real <h1> to <h6> tag, and the converter can only convert what's actually marked up as a heading. Apply real heading styles in Word before exporting, or add the # syntax manually afterward.

Q: Can I convert HTML copied directly from a webpage, not just a file?

Yes. Paste HTML copied via Copy from a rendered page, View Source, or Inspect Element the same way you'd paste an exported file. The converter processes whatever HTML is in the paste field regardless of where it came from.

Q: Will conditional comments and Word-specific metadata show up in my Markdown?

No. Conditional comments and Word-specific metadata aren't part of the visible content tree, so the converter doesn't process or output them at all.

Q: Is there a way to automate cleanup for many messy HTML files at once?

MDTool is built for one-at-a-time, paste-and-preview conversions. For batch cleanup of many files, scripting the underlying turndown library directly, with custom rules for your specific source format, is more practical than pasting files one by one.

Try it yourself, free

Convert your Markdown to a perfect PDF right now. No signup, no watermark.

Open Markdown to PDF Converter →