A couple of months ago my wife completed and defended her Ph.D. thesis in archaeology. To our surprise she received a proposition to turn it into a book. Probably I don’t need to write how excited we are, especially since this kind of thing happens rarely in her environment.

Unfortunately it also means lots of work. The thesis is about 1 thousand pages long is written in MS Word (docx format). Now, my wife must once again go through the whole text and edit it to meet the print standards of the publisher. One of the things that needs to be modified is the format of citations. Currently those look the following

(M. Mouse, 1901; D. Duck, 1999)

and need to become

(MOUSE, 1901; DUCK, 1999)

All citations were done manually in the original text, i.e. were not handled by any sort of bibliography manager (I’m not sure how handy, useful or even if such thing is included in MS Word, but that’s a different story).

After some experimentation I’ve managed to write a simple python script beeing able to find and modify citations so they look as desired. I’m going to show you how to get started solving this or a similar problem in python.

The toolbox

The whole maneuver was possible thanks to the fact that thesis was saved in the docx format. Essentially it’s a zip file with a bunch of xml files in it. The one with the document text is named (surprise, surprise) document.xml. So all we have to do is unzip the docx file, use python to parse and modify text in this file to our needs a xml, and then zip it back.

As you can see our toolbox is very simple – so far it is python and zip/unzip commands. In principle we could try to ditch the external zip commands and use python own zipfile module, but this seems a minor overkill, as the number of zip/unzip operations we will need to perform is not that large.

The last thing needed is a xml formatter or pretty printer. The file we want to modify (document.xml) is essentially one line long, so any form of ‘manual’ inspection, e.g. performing a diff, would be impractical in such form. Xml formatter will add line breaks and indentation (i.e. make the file human readable), so it will be possible to visually check what changes were done. For this I was using xmllint:

xmllint --format old_file.xml > new_file.xml

Once again it would be possible to perform this within python (you can google a solution, e.g. with lxml, easily) but I will stick to the external tool as it worked fine for me. It is worth noting that using mentioned cli tools does not mean manual operations since you can incorporate them into your script (os.system call is good enough for the job).

Start small

My general advice is to start small. If you plan on modifying a long document (as I did), create a new one with a couple of pages copied and pasted into it. Then unzip, parse the document.xml file, then save to a new file without any modifications. Complete first iteration by building a new docx file (i.e. overwrite original document.xml file with the freshly created one). My initial script is below, and surprisingly – it wasn’t working properly:

import xml.etree.ElementTree as ET

tree = ET.parse('document.xml.org')
root = tree.getroot()
for element in root.iter():
    pass

tree.write(open('document.xml', 'wb'), encoding='utf-8')

The resulting docx file was opening OK in libreoffice. It was also ok in gmail preview. But MS Word wasn’t happy with the result and refused to open the file. A quick look at the original and the new document.xml files show where the problem is:

<!-- beginning of the orignal document.xml file -->
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
  <w:body>
    <w:p w:rsidR="00E961ED" w:rsidRPr="00AE4808" w:rsidRDefault="00AE4808">
<!-- remaining part ommited for brevity -->
<!-- beginning of the document.xml file obtained in the first try-->
<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <ns0:body>
    <ns0:p ns0:rsidR="00E961ED" ns0:rsidRDefault="00AE4808" ns0:rsidRPr="00AE4808">
<!-- remaining part ommited for brevity -->

If you scroll both listings sideways you will notice, that the second file lacks most of the namespace mappings present in the first (original) file. It seems, that ElementTree is not especially user-friendly when it comes to handling namespaces (see this stackoverflow question for details). Fortunately, in our case fix is quite easy – use a different library to handle xml parsing and creation. Bellow you can find a working snippet, this time using lxml:

from lxml import etree as ET
tree = ET.parse('document.xml.org')
root = tree.getroot()

for element in root.iter():
    pass

tree.write('document.xml', xml_declaration = True, encoding = "UTF-8", method = "xml", standalone = "yes")

File created with the above snippet doesn’t differ from the original one, so resulting docx file is correctly opened inside Word. Yay!

Regex or bust!

Finally we are on track to tackle the problem. Since citations come in a coherent format (which we want to change by deleting groups of characters or making them upper case) this seems a natural place for regex. Unfortunately regex won’t work for us out of the box, since text is scattered across multiple xml elements. We need to gather it for the whole document and somehow keep track of link between given letter and element it belongs to. On top of that we need to be able to mark a letter for deletion or to be made upper case. This is achieved with the following class:

KILL = 0
CAPS = 1
class ElementsWithText(object):
    def __init__(self):
        self.elements = []
        self.commands = {}

    def append(self, element):
        self.elements.append(element)

    def __unicode__(self):
        return u"".join([x.text for x in self.elements])

    def set_command(self, index, command):
        self.commands[index] = command

    def finalize(self):
        iletter = -1
        for element in self.elements:
            final_text_for_this_element = u""
            for letter in element.text:
                iletter += 1
                if iletter not in self.commands:
                    final_text_for_this_element += letter
                else:
                    if self.commands[iletter] == KILL:
                        continue
                    elif self.commands[iletter] == CAPS:
                        final_text_for_this_element += letter.upper()
            element.text = final_text_for_this_element

Essentially, this is a container for all elements with some utility methods. The __unicode__ method builds a complete text from all elements that were stored inside the object. This, as mentioned earlier, is crucial if we want to use regex. The set_command method stores desired action to be performed (here KILL or CAPS) on a given letter (i.e. with given “global” index or position inside text). Finally, the finalize method, goes through all elements and modifies their text in accordance with instructions encoded inside self.commands instance data.

The above class can be used in the following way:

from lxml import etree as ET
import re

tree = ET.parse('document.xml.org')
root = tree.getroot()

elements_data = ElementsWithText()
for element in root.iter():
    if element.text:
        elements_data.append(element)

str_text = unicode(elements_data)
for match in re.finditer("\(([^ ]+?\. )(\w+)", str_text):
    for i in xrange(match.start(1), match.end(1)):
        elements_data.set_command(i, KILL)

    for i in xrange(match.start(2), match.end(2)):
        elements_data.set_command(i, CAPS)

elements_data.finalize()

tree.write('document.xml', xml_declaration = True, encoding = "UTF-8", method = "xml", standalone = "yes")

After parsing the document.xml file with lxml, we store the elements with having non-empty text inside an instance of our class (i.e. ElementsWithText). Then we build a complete text (str_text variable above), on which we can run regex matching. The regular expression we use allows marking which parts of text should be omitted and which should be capitalized. Call to the finalize method performs those modifications. At the end we land with modified document.xml file we can put inside a new docx file.

Summary

We have learned how to modify text inside a docx (the latest MS Word format) file. A crucial part of the process was understanding how to write a document.xml file conformant with the docx format. It was also necessary to code some additional infrastructure in order to be able to use regex.

I have played with the code that was basis for this post for a couple of hours before obtaining a final solution (since there were some special cases or exceptions with respect to the general rule of what and how to change). Was it useful for fixing the citations? An honest answer would be “partially”. It turned out, that setting letters to upper case was not enough – MS Word has a special style called “Small Caps”, that makes things look slightly nicer. So the “delete part” of the program was OK, the “upper case” not fully. And since at the time I had no possibility to work further on the problem (and the deadline for another manuscript version was close) part of this task had to be performed in a tedious, manual way.


Unittesting print statements Multiprocessing and exceptions - some batteries not included

  1. We’ll check those. I’m genuinely interested how it will work out as the number of different types of references is astonishing. In text those look coherent, as I wrote in the post above. In the bibliography part – it’s completely different story. For example – cite a 100-year-old archive placed somewhere in a museum in Russia, add transliteration (i.e. write the name in our alphabet, apart from original name in cyryllic) and maybe mention that large parts of it are actually in Germany or Sweden (due to rich European history in XX century). But jokes aside – I really hope those tools will manage to work OK in this case.

    As for regex search directly inside MS Word – is it possible to apply different actions to different groups inside an expression inside a standard search&replace dialog (in our case – delete the first group, capitalize second)? I guess one must go for macros here.

Leave a Reply

Your email address will not be published.