For a language that likes XML so much, it was rather difficult to find an efficient and elegant way to parse it.

There’s more than one way to parse XML?

There are 2 ways to work with XML documents, read the full document and build a full DOM object representing the document or read the document as stream extracting what you need from the stream. The DOM method is fine for small documents like configuration files but completely falls apart when working with documents of any real size.

When it comes to stream parsing in Java the SAX parser seems to be the most common choice. Most Stack Overflow Answers and Tutorials about parsing large XML files in Java point to the SAX parser. The only problem with the SAX parser is it’s event driven API is very awkward to work with. Defining an event listener (Handler) and waiting for it to be called removes your control of the stream. I like to have control of the execution of my program and have always preferred a pull based API when working with streams.

Fortunately, like most things in programming, I’m not the first person to experience these pain points. The StAX API was created to address this pain, by providing pull based API for a XML stream.

The advantages of the approach are best described by the JavaDocs:

Pull parsing provides several advantages over push parsing when working with XML streams:

  • With pull parsing, the client controls the application thread, and can call methods on the parser when needed. By contrast, with push processing, the parser controls the application thread, and the client can only accept invocations from the parser.
  • Pull parsing libraries can be much smaller and the client code to interact with those libraries much simpler than with push libraries, even for more complex documents.
  • Pull clients can read multiple documents at one time with a single thread.
  • A StAX pull parser can filter XML documents such that elements unnecessary to the client can be ignored, and it can support XML views of non-XML data.

Show me this magic!

So let’s use the wikipedia dataset as an example, below is a sample of the data:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>testwiki</dbname>
    <base>https://test.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.30.0-wmf.5</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      ...
    </namespaces>
  </siteinfo>
  <page>
    <title>Wikipedia:Administrators</title>
    <ns>4</ns>
    <id>10</id>
    <revision>
      ...
      <sha1>9sq7h8po7k7dsucxoohgg597z6t0jod</sha1>
    </revision>
  </page>
  ...
</mediawiki>

And here’s the code to extract the name and id of each page:

package me.martinrichards.wikipedia;

import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.zip.ZipInputStream;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

public class WikipediaDataParser {
    private static final String ELEMENT_PAGE = "page";
    private static final String ELEMENT_NAME = "name";
    private static final String ELEMENT_ID = "id";

    private final String file;
    private final XMLInputFactory factory = XMLInputFactory.newInstance();
    private final List<WikipediaPage> pages;

    public WikipediaDataParser(final String file, final List<WikipediaPage> pages) {
        this.file = file;
        this.pages = pages;
    }

    public void parse() throws IOException, XMLStreamException {
        try(final InputStream stream = this.getClass().getResourceAsStream(file)) {
            try(final ZipInputStream zip = new ZipInputStream(stream)) {
                final XMLEventReader reader = factory.createXMLEventReader(zip);
                while (reader.hasNext()) {
                    final XMLEvent event = reader.nextEvent();
                    if (event.isStartElement() && event.asStartElement().getName()
                            .getLocalPart().equals(ELEMENT_PAGE)) {
                        parsePage(reader);
                    }
                }
            }
        }
    }

    private void parsePage(final XMLEventReader reader) throws XMLStreamException {
        String name = null;
        String id = null;
        while (reader.hasNext()) {
            final XMLEvent event = reader.nextEvent();
            if (event.isEndElement() && event.asEndElement().getName().getLocalPart().equals(ELEMENT_PAGE)) {
                return;
            }
            if (event.isStartElement()) {
                final StartElement element = event.asStartElement();
                final String elementName = element.getName().getLocalPart();
                switch (elementName) {
                    case ELEMENT_NAME:
                        name = reader.getElementText();
                        break;
                    case ELEMENT_ID:
                        id = reader.getElementText();
                        break;
                }
            }
        }
        final WikipediaPage page = new WikipediaPage(id, name);
        pages.add(page);
    }
}

Simple…

Wait! What just happened?

You can see the full project here, the first little bit extracts the xml file from a zip archive stored in the resources folder.

public void parse() throws IOException, XMLStreamException {
    try(final InputStream stream = this.getClass().getResourceAsStream(file)) {
        try(final ZipInputStream zip = new ZipInputStream(stream)) {
        ...
        }
    }
}

The next part uses the XMLInputFactory to create a StAX reader from the stream and starts reading the file. The reader provides a pull API over the stream to pull XML elements from the stream. This first loop goes through the document looking for <page> start element’s and then passing over control to the parsePage function to handle the processing of the page element.

final XMLEventReader reader = factory.createXMLEventReader(zip);
while (reader.hasNext()) {
    final XMLEvent event = reader.nextEvent();
    if (event.isStartElement() && event.asStartElement().getName()
          .getLocalPart().equals(ELEMENT_PAGE)) {
        parsePage(reader);
    }
}

The parsePage function continues to loop through the events in the stream. First we check if we’ve reach the end of the page element and return, as we’ve read the page and completed the scope of this function.

private void parsePage(final XMLEventReader reader) throws XMLStreamException {
    String name = null;
    String id = null;
    while (reader.hasNext()) {
        final XMLEvent event = reader.nextEvent();
        if (event.isEndElement() && event.asEndElement().getName()
                .getLocalPart().equals(ELEMENT_PAGE)) {
            return;
        }
        if (event.isStartElement()) {
            final StartElement element = event.asStartElement();
            final String elementName = element.getName().getLocalPart();
            switch (elementName) {
                case ELEMENT_NAME:
                    name = reader.getElementText();
                    break;
                case ELEMENT_ID:
                    id = reader.getElementText();
                    break;
            }
        }
    }
    final WikipediaPage page = new WikipediaPage(id, name);
    pages.add(page);
}

Next the start element is searched for and action taken based off which one has been reached. This allows us to quickly iterate over all the elements contained within the page element, extracting what is needed and ignoring the rest. An additional case could easily be added to the switch statement to handle the revision, calling an additional function to handle the details of parsing that element.

Why was that so simple?

Writing a SAXParser feels like instructing a clown how to juggle. It’s awkward and makes what should be a relatively simple task of reading an XML file unnecessarily complicated. The StAX parser flips this on it’s head, instead been thrown elements with no control, control remains with the caller. Instead of surrendering control and providing functions to be called when events happen. Events are requested and handled as necessary, leaving control with the consumer.

Java is a great language, when used well, unfortunately that is not always the case. The SAXParser is one of those examples, while it does serve a purpose, that purpose is niche. It should not be the default answer for efficient XML parsing. Why it still is when a much better solution exists in the standard library boggles my mind.