This is a rough summary of a few days of experimentation with Pyrex as a technology for improving the runtime performance of Python modules. The key question was whether Pyrex could generate code that was fast enough to replace hand-written C extensions where performance matters (e.g. Pyexpat, heapq, winreg, etc.). I think the answer is a qualified yes. Pyrex code is probably comparable to naively coded C code. But Pyrex code can be trounced by code that makes heavy use of macros like PyList_GET_SIZE or PyList_GET_ITEM.
The bisect module does binary searches of Python lists. It is a purely algorithmic (no IO, database, networking, GUI etc.) so it needs to run fast. The pure Python version runs bisection operations roughly three times slower than the C version on my reference computer. It does insertions roughly 1.8 times slower.
If we merely touch up the syntax (Pyrex doesn't support the new division syntax yet) and recompile with Pyrex, we get very little performance improvement: At best a few percentage points. Let's call this version the Naive Pyrex version.
The next step is to add type declarations to the Pyrex version. That will allow it to avoid creating Python objects for every loop index etc. Let's call this version the idiomatic Pyrext version. On the reference computer, this yields a substantial speedup. Instead of up to 3 times slower than the C code, we are about 1.4-1.6 times slower. Still, it would be nice to close that gap further.
The next step is to note that Pyrex generates PyObject_GetItem calls rather than the PySequence_GetItem calls used in the idiomatic C code. This is inefficient because it requires integers to be converted to PyObjects before doing the call. At the cost of some readability, we can convert item index lookups from a[b] syntax to direct calls to the Python C API function: PySequence_GetItem(a, b). This yields what we might call the obfuscated Pyrex version. It may be ugly but it performs quickly. It has run times of 1.2 to 1.3 times that of the C code (i.e. 20%-30% slower).
Rather than put up with the obfuscation, I patched Pyrex to give it a native concept of "sequence object". I'm not yet an expert on the Pyrex code base so I can't vouch for the quality of that patch. It worked for my test cases. At this point I can rewrite my code back into idiomatic New Pyrex code, using my new "sequence" type declaration. This code runs at about the same speed as the last (20%-30% slower than C).
At this point we've exhausted all of the low-hanging optimizations. Readers are encouraged to experiment further if they like. If I were going to do so myself I would try to make a variety of Pyrex versions that increasingly mimicked the C code. You can call C functions and macros in Pyrex so you can get fairly close to the C code. Once that technique runs out of steam you could hack the Pyrex C output and migrate THAT even closer to the C code. When you find big differences between one version and another, you could fix the Pyrex compiler to generate the faster code. If you are lucky, there are one or two things that make most of the difference. If you are unlucky, it may be a variety of small things that add up.
I could take some wild guesses about things that may be slowing Pyrex down:
Still: the result is pretty good. the final version of the module is 90% identical to the pure Python version, manages all references transparently, is pointer safe, has some error reporting features that are better than the C code, and runs only a little bit slower.
I'll go into less detail on this one. Basically I wrote a small Pyrex module to wrap a subset of the Pyexpat API. Like Pyexpat it supports the creation of startHandlers, endHandlers. It does attributes. It handles Unicode. It uses the same string interning technique that the C parser does. I parsed a large (400MB) XML file read from disk. The Pyrex version runs at basically the same speed as the C version (actually it seems faster). The test is not 100% fair because the C version may do a little bit extra work to handle some features the Pyrex version does but looking at the code I don't see anything major. However you count it, the pyrex version is parsing 400MB of XML, recognizing 5 million elements and doing 10 million callbacks in less than 3 minutes on my 700MHZ laptop. That's pretty damn impressive.
The code of the module admittedly looks like a mix of Python and C because you are after all bridging Python to C. But it has no explicit reference counting, no mallocs, no frees, etc. The only place it "cheated" is in one use of PyDict_GetItem instead of x[y]. This comes back to the same Pyrex problem where it doesn't know that it can use a more efficient function than PyObject_GetItem. It would take only a few hours to add a "dict" type to Pyrex that would do the right thing.
The Heap Queue C code that I was trying to match is highly optimized. It uses PyList_ macros which are much more efficient than anything that Pyrex would generate naturally.
After hours of tweaking I did determine how to make a Pyrex version that is nearly as fast as the C version (around 5-10% slower for the heappush, heapreplace and heapify functions and inexplicably 40-45% slower for heappop). But it is far from idiomatic Python code. It uses many C functions directly: even one instance of Py_INCREF (but only one!). Some of the optimizations could be done by the Pyrex compiler itself if it was just a little bit smarter. For instance Pyrex should have a "list" type and use PyList_ functions for get and set items. That would make a big performance difference and would relieve most of my need for direct C APIs.
Even without those optimizations, one could argue that Pyrex code that looks C-ish is better than true C code because it does many things that would otherwise have to be coded directly: checking the return code of calls, reference counting, exception propagation, tracebacks etc. Plus it uses Python syntax which is nicer in a variety of ways (e.g. docstrings, indentation).
Pyrex users have collected several ideas for how to get Pyrex to generate more optimized code.
Joseph Koshy proposes to Intern constant strings. Pedro Rodriguez has a "hackish" patch for it.
Phillip J. Eby suggests to externalize even basic type declarations into Pyrex syntax so that new types can be optimized faster. He also wishes for the use of PyObject_GetAttr instead of PyObject_GetAttrString. I think that this comes back (sort of) to string interning because if the strings were interned they would be Python objects at load time rather than throughout runtime.
Phillip also proposes that builtins not be overridable. Then Pyrex could generate (e.g.) PyList_GETITEM for len(lst) rather than the very inefficient code it generates now.
Pyrex calls PyArg_ParseTuple even when there are no arguments or a single PyObject argument. Idiomatic C code uses the METH_O and METH_NOARGS declarations and skip the type parsing.
Joachim Saul points out that PyObject_CallObject is a slow way of creating new objects. If Pyrex knows that the object to be "called" at compile time then it could generate optimized PyObject_NEW code.