.docx Considered Stupid
Recently I received an email with a ".docx" file attached. This is the new format for Microsoft Word 2007 documents. I won't even start about how annoying it is to get Word documents attached to emails. Unfortunately it's something you have to put up with.
Being a Mac user and not having the latest and *cough* greatest word processor from Microsoft, I had to figure out how to read this document. I found out the .docx file format is actually a zipped directory tree containing xml files.
Ok, first step was to unzip the .docx file via the command line. This resulted in eleven files in a handful of directories. So far, so good.
Next step was to find which file contains the actual text of the document and not just metadata. Looking through the filenames, I found one called "document.xml" which appeared promising. So I opened it up in my trusty text editor, TextMate. Suddenly my computer began grinding to a crawl as the file was loaded. It turned out that the entire file consists of two lines: the first line containing the xml version, and the second line contained the entire xml markup for the document! No wonder TextMate struggled, since it dutifully created a 70,000+ character line for the document. Why couldn't the xml file have newlines to make it more manageable? Anyway, on to the final step...
My plan was to write a regexp search-and-replace to strip out all the xml markup so I could read the content of the document. But then I discovered that the markup is peppered with <w:proofErr w:type="spellStart"/> tags around almost every single word! I should mention that the contents of the document were in a foreign language, hence all the spelling "errors". For some bizarre reason, Microsoft Word marks up spelling mistakes in docx files, not just on-screen. Why? Shouldn't it be left to the individual application (and platform) loading the document to decide whether or not words are misspelled? I can accept all the other hassles with the docx format: zipped xml files and incredibly long lines, but the encoding of spelling errors is crazy stuff.
After all that I gave up trying cleaning up the xml to read the file. Luckily, I found a web site that offers free conversion of docx files: zamzar.com.
PS: It turns out that Word documents created using MS Office 2007 do not conform with their own OOXML standard!:
Being a Mac user and not having the latest and *cough* greatest word processor from Microsoft, I had to figure out how to read this document. I found out the .docx file format is actually a zipped directory tree containing xml files.
Ok, first step was to unzip the .docx file via the command line. This resulted in eleven files in a handful of directories. So far, so good.
Next step was to find which file contains the actual text of the document and not just metadata. Looking through the filenames, I found one called "document.xml" which appeared promising. So I opened it up in my trusty text editor, TextMate. Suddenly my computer began grinding to a crawl as the file was loaded. It turned out that the entire file consists of two lines: the first line containing the xml version, and the second line contained the entire xml markup for the document! No wonder TextMate struggled, since it dutifully created a 70,000+ character line for the document. Why couldn't the xml file have newlines to make it more manageable? Anyway, on to the final step...
My plan was to write a regexp search-and-replace to strip out all the xml markup so I could read the content of the document. But then I discovered that the markup is peppered with <w:proofErr w:type="spellStart"/> tags around almost every single word! I should mention that the contents of the document were in a foreign language, hence all the spelling "errors". For some bizarre reason, Microsoft Word marks up spelling mistakes in docx files, not just on-screen. Why? Shouldn't it be left to the individual application (and platform) loading the document to decide whether or not words are misspelled? I can accept all the other hassles with the docx format: zipped xml files and incredibly long lines, but the encoding of spelling errors is crazy stuff.
After all that I gave up trying cleaning up the xml to read the file. Luckily, I found a web site that offers free conversion of docx files: zamzar.com.
PS: It turns out that Word documents created using MS Office 2007 do not conform with their own OOXML standard!:
OOXML and Office 2007 Conformance: a Smoke Test
Labels: inefficiency, Microsoft, technology