Clean up Word to HTML with Dewordify

I attended a session from the BCIT online course development team at the recent ETUG conference where they demonstrated a tool they created to make the conversion of Word to HTML documents a cleaner process.

If you have ever tried converting Microsoft Word to HTML, you know that it can be a nasty business to try to get clean HTML from a Word document. The conversion process in Word where you save to HTML or save to filtered HTML still creates HTML documents with tons of cruft and extraneous code that needs to be further cleaned.

There are tools that do this to varying degrees of success. I have often used Dreamweaver to do this and their clean Word tool works well. But Dreamweaver is a bit overkill, proprietary and expensive for simply getting clean HTML from Word.

There are online conversion tools like  HTML Cleaner and the cut and paste strip Word code functionality available in most LMS’s interfaces that work well enough for small pieces of content. But if you are doing a lot of converting and want to automate or batch process many Word documents, these tools aren’t really up for the job. For those types of bigger bulk conversion jobs, you can automate a workflow using a tool like Pandoc, an open source tool that is the swiss army knife of document conversion tools.

Now there is another option created by developers at BCIT to help with converting Word to HTML called Dewordify, a Node.js application. I downloaded and tested it out on my computer and it worked very well at cleaning up a simple Word document I tossed at it. To give you a comparison, here is the Word document I gave it (a piece I wrote on last fall’s EDUCAUSE conference). It’s a fairly simply Word document using level 1 headers (in yellow) and images with captions (in red).

Now, if I save this as a filtered document in Word, I get a file with a ton of crufty CSS & HTML attributes in it

.

You have likely seen this kind of code before. This can cause problems when putting into other tools, like an LMS. The CSS code at the beginning of the document itself goes on for close to 200 lines. That is 200 lines of unnecessary stuff that needs to be further cleaned out.

Ideally, what we want to have is clean code with none of those 200 lines of CSS at the top of the HTML document. Here is the same HTML document code after the original Word doc was run through Dewordify.

That is fairly clean code from a Word document. Gone are the lines and lines of CSS and Word attributes in the HTML tags.

Now, there are a few things about the Dewordify that is pretty specific to the BCIT course development workflow. For example, the CSS and JS files link back to files at BCIT that are specific for them. And there are some CSS and HTML attributes that are added to the code, like the div class=”container” that again are specific to the BCIT workflow. But the brief conversation I had with Kyle Hunter after their session leads me to believe that this can be changed and customized. I haven’t dug into the Dewordify code yet to know how easy or difficult this is.

I also noticed that the captions for the images didn’t come over and, like most document conversion processes, GIGO applies, meaning the better formatted your Word document is going in, the better the tool works. This is the biggest culprit I have discovered when working with converting ANY document – learn to use the tool you are using correctly. If you are using Word, use heading tags to make headings as opposed to using the change font size and colour to make your own visual header. Use the list tool to make lists, as opposed to type the number 1 and then hitting the space bar 5 times or <shudder> using the tab button. Basic rules, yes. But for people who spend a lot of time converting and cleaning documents, simple steps like this go a long way to making content more convertible to other file types.

Or, avoid using Word in the first place and build your content thinking web first. That means, build in HTML or Markdown and avoid having to do converting like this at all. But if you do find yourself converting documents from Word to HTML, Dewordify might be a useful tool for you.

Photo: Meat Grinders Peter Miller CC-BY-NC-ND

 

 

Clint Lalonde

 

Leave a Reply

Your email address will not be published. Required fields are marked *