Optimizing and Refactoring a Python Function

Revisiting old problems/solutions can be very instructive. Because Python is so quick to write and easy to read, you can refactor with minimal effort. You should not be afraid to throw away or rewrite entire functions or scripts. The following is a little story of when I refactored and refactored a function I knew could be better.

At work, we have large textbooks written in Docbook XML which we translate into HTML when a user wants to read them. At some point while transcribing to XML, poorly formed Figure IDs were introduced and when they were referenced by XREF tags (which are translated into HTML anchors) we would get broken links because when translated to a HTML ID it was not valid. Instead of rewriting the XML-to-HTML translation code (which is big and hairy by itself), I decided to just parse the books and fix the offending links. The script would have to be careful to preserve the relationship between links and their figures.

Version 1

I started with a straightforward approach and made two loops: (a) find the bad links and (b) fix links and related figures. I never intended to use this algorithm in production; I just like to write an implementation quickly to be sure I have thought of all the details and edge-cases.

def fix_links_and_figures(xmlroot, book=None):
    docids = book._all_ids if book else {}
    xrefs = xmlroot.findall("xref")
    xmem = {}  # Cache the broken IDs with their related Link tags

    for xref in xrefs:
        link = xref.get('linkend')
        if proper_link_format(link):
            continue
        if link in xmem:
            xmem[link].append(xref)
        else:
            xmem[link] = [xref]
    
    for link, taglist in xmem.iteritems():
        newid = generate_unique_id(xmlroot, docids)
        for figure in xmlroot.xpath("//figure[@id=%s]" % link):
            figure.set('id', newid)
        for tag in taglist:
            tag.set('linkend', newid)

Version 2

Of course this algorithm is terribly inefficient. To make it faster, I could (a) cache the result of testing each link so it can skip testing duplicates and (b) check links and fix tags within the same loop.

def fast_fix(xmlroot, book=None):
    docids = book._all_ids if book else {}
    xrefs = xmlroot.findall("xref")

    linkmem = {}  # Links we have tested and the result
    xmem = {}     # Map Old IDs to New IDs

    for xref in xrefs:
        elementid = xref.get('linkend')
        if elementid in linkmem:
            if linkmem[elementid] is True:  # If ID previously tested correct, skip it
                continue
        else:
            proper = proper_id_format(elementid)
            linkmem[elementid] = proper
            if proper:
                continue
        if elementid not in xmem:  # If a new ID has not been generated for this bad ID, do it.
            xmem[elementid] = generate_unique_id(xmlroot, docids)
        newid = xmem[elementid]
        xref.set('linkend', newid)

        for figure in xmlroot.xpath("//*[@id='%s']" % elementid):
            figure.set('id', newid)

I was so proud of myself after this magical caching beast, that I left it alone thinking it (I) was awesome. Then I looked at the function again after a short break and smacked myself in the forehead. When making this algorithm, my brain had been stuck in the mindset of “Fix the broken Links” which is why I parsed the XREF tags and needed a complicated caching system to handle duplicate tag references and remember which links I had tested and updated.

Version 3

When I looked at the problem with fresh eyes, I realized the correct mindset was “Fix the broken Figure IDs”. This time, I started with the Figure tags, and if an ID was incorrect it was changed along with any corresponding links. By starting with Figure tags, I could assume that each ID was unique (or else the XML document would be invalid) and drop the caching dicts.

def figure_id_fix(xmlroot, book=None):
    """
    An even more efficient method of fixing Figure IDs and XRef Links.
    """
    docids = book._all_ids if book else {}
    figures = xmlroot.findall("figure")

    for figure in figures:
        elementid = figure.get('id')
        if proper_id_format(elementid):  # Check the ID and skip if it is correctly formed.
            continue

        newid = generate_unique_id(xmlroot, docids)  # ID is improper, so give it a new one.
        figure.set('id', newid)

        for xref in xmlroot.xpath("//*[@linkend='%s']" % elementid):  # Fix all related Links
            xref.set('linkend', newid)

Conclusion

If you were counting, you know that was three rewrites. The funniest part was that I was sure I had the best version of my function after the second rewrite. Of course, looking back it is obvious that the function was not optimal (wrongheaded in general) and even now it is not perfect. But just because you solved a problem once, does not mean you can’t learn from it again later.

Advertisements
Optimizing and Refactoring a Python Function