MiniDOM is a tiny subset of the DOM APIs.

PullDOM is a really simple API for working with DOM objects in a streaming (efficient!) manner rather than as a monolithic tree.

The biggest problem with SAX is that it is inconvenient and to a certain extent, pretty complicated. You have to fill in a lot of methods with predetermined signatures, go through a few pre-parse incantations and organize your code in a callback pattern. You have to take complete control of any state you need to keep. If you don't you won't even be able to differentiate characters in a title from characters in an emphasis. I'm not saying its brutally complex, I'm just saying that it isn't as easy as PullDOM. xmllib has most of the same issues.

The biggest problem with the standard DOM is that you must parse the whole document into a random access structure which typically means that you must have a lot of RAM to process a very big document. I get nervous writing software when I know that a big document to crash it.

PullDOM has 80% of the speed of SAX and 80% of the convenience of the DOM. There are still circumstances where you might need SAX (speed freak!) or DOM (complete random access). But IMO there are a lot more circumstances where the PullDOM middle ground is exactly what you need.

import fileinput

for line in fileinput.input(["abc.txt"]):
    process(line)

I believe that XML processing should be as close to that simple as possible. Only, instead of pulling lines, you are pulling elements, text and processing instruction events.

import pulldom

events=pulldom.parse( "file.xml" )
for (event,node) in events:
   process( node )

Events are one of:

START_ELEMENT
END_ELEMENT
COMMENT
START_DOCUMENT
END_DOCUMENT
CHARACTERS
PROCESSING_INSTRUCTION
IGNORABLE_WHITESPACE

You will seldom see the last two and can safely ignore them for most applications. The nodes are DOM nodes. If you know the DOM API, they'll be obvious. If you don't, it's a pretty simple API, all in all.

Of course PullDOM is supposed to be easy, but it must also be very efficient. So that events "list" is not really a list. It's a special object that lazily fetches information from the XML document. If you haven't got to a node in the iteration, it hasn't parsed it yet. If you don't store the node away in some data structure of your choosing, PullDOM will just forget it (just as if you were reading lines from a file!).

Pulldom never has the entire document in memory unless you ask it to. In fact, it never builds any tree unless you ask it to. Nodes do not typically know about their children or their siblings. They do know about their parents because this can be implemented pretty efficiently. If you want a node to know about its children, you can "expand" it:

import pulldom
events = pulldom.parse( "file.xml" )
for (event,node) in events:
    if event=="START_ELEMENT" and node.tagName=="table":
        events.expandNode( node )

As you can see, the events stream object is pretty sophisticated. Still, it has limits. You can only expand the current node because events and nodes relating to any other node are probably lost in the mists of time. An XML document into a random access data structure!

That's all you need to know to use PullDOM: the DOM APIs, the parse method, the for-looping convention and the "expandNode" method.

More info is here and here.

Code is here.

You need Python 1.6 or at least a modern version of PyExpat to run this.