Now that I’m getting closer to going live, I have begun testing moving MS Word docs directly into the site. Most of my original research and analysis stuff is done in Word or Excel because I’m pretty much a power user in quick writing, assembly and formatting. If I could convert docs in a relatively straightforward way to AI3, that would be a real boon. However, as I discovered, there are major, major problems and issues with moving Word documents.
The Big Transfer
My first test involved a beast of an analysis piece I had done on the $3 trillion value of U.S. enterprise document assets. (It was eventually posted here.) In its long form, it is 42 pages long, with many tables, a few figures and more than 100 citations and links. It is over 1 MB (1070 KB) in its original form.
Most of us have created Web pages directly from a Word document, and I tried this first. I did the conversion with the filter option, then pasted the result directly into the editor, and attempted to update the site. The transfer seemed to take forever and then the server hung. My suspicion was that the Word HTML code was too complex
Cleaning the Word HTML
Before packing it in and splitting the original doc into multiple pieces so that my site could choke it down, I decided to do a bit of investigation on alternaltive clean up utilities and approaches. One good review I found was by Laurie Rowell on ‘Clean HTML from Word: Can it be Done?’. I recommend the four-parter.
Laurie’s review suggested some improvements could be made from the MS Web page with filtering option using third-party tools, but they did not appear enough to enable me to proceed without splitting my files. Nonetheless, after following that advice, mostly using the MS Web page with filtering, I again attempted a transfer with results as before.
After this initial cleaning, I used both Word and Composer (the Mozilla HTML editor) to do some search-and-replace removal of further HTML tags. Using pattern replacements, esp. with Word, it is possible to also replace line breaks and tab characters, so long as the base file does not have an expected Word extension such as .doc, .rtf or .html (for convention, I always use .txt). This was goling well in reducing the sizes of the files (the best I was able to achieve on the 1 MB file was about 450 KB), but the process was laborious even with global search-and-replace. Furthermore, WordPress was not choking down the smaller files. And unbeknownst at this stage, other issues were being introduced into the files that would make later steps even more difficult.
Splitting Files
I then split the document into six parts, the largest being about 250 kB. As before, they were cut-and-pasted into the editor and then posted. I again got server errors and time-outs. With the assistance of Kevin Klawonn of BrightPlanet, he was able to determine that the Apache server was timing out after 30 sec. With a minor parameter change, we were able to get all files uploaded.
However, while the system was now choking down the files, they looked terrible! Line breaks were totally messed up and being able to edit them within the Xinha editor was close to hopeless. Clearly, and unfortunately, the code was still not clean enough.
Problems with Composer
A natural assumption was that these open source editors were "buggy" and unable to handle more commercial strength requirements. As a natural response, I turned to Composer, a standard HTML utility in my Mozilla browser.
I had not used Composer much before, but found much to like. It has nice toggles between HTML source and WYSIWYG. It offers menu options for most "standard" activities I would undertake with a Web page. In short, I liked working with it and thought it might become an offline (at least from my blog) standard for doing HTML WYSIWG editing of large imports. I actually started becoming familiar with the app and its controls and features.
However, upon actual incoprporaton of the results, I found a nasty truth. Composer introduces forced line breaks at about margin 70. As a basis to incorporate into other apps — all of which need to work nicely together — this was fatal. So much for Composer. I was sad …..
Problems with Xinha
I have observed line breaks being introduced by Xinha, but have not been able to reproduce the actual steps. In general, the system seems to be OK about not introducing spurious breaks, including when editing moves from full to smaller screen.
Cleaning the Word HTML II – Textism
The entire process of using files in multiple applications with mutliple behaviors had worked to create a total HTML nightmare in my test file baseline. Remembering one of the options in Laurie Rowell’s piece (above), I decided to break my normal rule against paid products and check out the Textism site using the Word Clean utility. Dean Allen actually has an interesting pricing model which mixes aspects of free, seduction and low cost.
This is a superior and professional offering:
- It creates files less than one-third the size of already cleaned MS Word HTML
- It has an innovative pricing model — beginning with free and moving on to short or annual uses. For 5 euros (about $6.20) the system can be used unlimitedly for 24 hrs; payment is by Paypal. For a full-year use, perhaps likely after being hooked on the day version, the cost is 20 euros (about $25) though with a limit of 75 conversions per month (about three per work day; no constraint for a very aggressive personal use.)
- The resulting cleaned code, while produced on the server side and needing copying and then pasting locally in order to maintain a saved copy, is extremely clean and formatted well for later management with other utilities. Paragraphs are break separated, items such as bullets are line breaked, and some characters are indented
- The system is capabile of handling considerable complexity. My test from hell had about 20 tables, 100 citations and endnotes, a table of contents, embedded images, etc. All code produced was correct and very clean and spare.
In short, a total recommendation. Any user needing to move a few files per month from Word to their blog should defintely consider this service.
Final Adopted Process
Depending on the length of the original Word document and its complexity, I recommend one of two approaches given current tools (at least the ones I have tested.)
For shorter Word documents, those with little complexity, or internal or external references:
- Save the document as a .txt file, and then
- Cut-and-paste into the online editor and re-establish formatting.
For Word documents that do not meet these conditions, the path is tortuous and onerous:
- Make sure the Word file is absolutely complete — you will not want to return to this step!
- Save the file with the Word Save As ‘save as type’ using the Web page, filtered option
- [If images are used, separately locate them, give them logical names, and later embed in the way you normally handle in-line images with your posts]
- Submit the HTML file created by WS Word to the Textism utilitiy. Granted, this utility costs, but the effort saved in its clean HTML procedures is well worth it
- Now, with the re-saved clean version, invoke a standard editor that has two capabilities: 1) no enforced word wraps or line breaks; 2) the ability to display and search-and-replace on formatting characters such as line breaks (^p), tabs (^t), etc. (Word can perform these functions when files are named .txt prior to input, but use of a standard editor with these features may be preferable.) With this editor, you will do some additonal code clean-up:
- Removing unnecessary style definitions
- Formatting the file so that paragraphs split
- Removing any other recurring HTML code patterns that have nothing to do with the eventual display of your document on your site.
- Now, paste your fully cleaned code into your editor for posting on the blog, and
- Should you encounter major problems, select all of the code in your blog editor, re-paste it in your standard editor, and do any global replaces and clean-ups.
I know this sounds like a pain, and it is. You should also keep saved versions of interim steps above to have fallbacks if necessary.
Note: There are instances when the size of the file and the degree of final HTML editing and clean-up may suggest offline editing because server-side editing is slow, updated posts may take forever or experience server time-outs, or they may simply crash the server. If offline editing is necessary, do make sure an HTLM editor is used that does not insert those insidious line breaks. If it does, you will spend hours of frustration trying to get everything clean again.
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb