Saturday, May 30, 2009

Top gear-- how fast can we drive Python?

In order to get an idea of how well my wrapper is performing, I needed to get some idea of the theoretical limit in performance I could expect with Python. Specifically, I needed to get an idea of how fast I could generate upcalls into Python from an external library via a thread that Python knows nothing about.

Depending on how you use it, LBM can spawn a couple of threads for your process, one of which appears to be in charge of performing upcalls into application code upon receipt of messages. As these threads are outside the set spawned by Python itself, I was concerned as to whether there would be significant cost in Python bestowing the GIL to an alien thread for the upcall. So I figured I'd create a trivial extension module that performs upcalls in a couple of different ways to see how fast we can go.

The test is composed of three parts: the main line code which sets everything in motion and reports performance stats, a simple extension that does upcalls as fast as possible, and a class on which the upcalls will be performed. The method which is the upcall target does nothing, thus giving me a reasonable baseline for comparison.

I decided to have two variables that I'd modify to see what differences emerge:

  • First, I'll vary what thread we generate upcalls from. The test extension module will be able to perform upcalls from a Python-spawned thread, a thread spawned outside of Python (a raw pthread), and from the main Python thread itself.

  • Second, we'll vary the kind of object we upcall to. I'll implement a pure Python class, and then a second class that will be implemented as a Pyrex extension type. There will be a callback method on each that will do nothing but return.
Here's the test program's main, which includes the pure Python upcall target:

import time
import cUpcaller

class PyUpcallTarget(object):
def __init__(self):
self.kind = "python upcall target"

def callback(self):
return

def doit():
limit = 5000000
targets = [PyUpcallTarget(), cUpcaller.CUpcallTarget()]
ucThreadsDict = {cUpcaller.WITH_PYTHON_THREAD:"python thread",
cUpcaller.WITH_ALIEN_THREAD:"alien thread",
cUpcaller.WITH_CALLING_THREAD:"calling thread"}
waysToCall = ucThreadsDict.keys()
for target in targets:
for callHow in waysToCall:
upcaller = cUpcaller.Upcaller(callHow, target.callback, limit)
print ("Timing for %s from a %s"
% (target.kind, ucThreadsDict[callHow]))
start = time.time()
upcaller.go()
upcaller.join()
elapsed = time.time() - start
print (" %d upcalls took %f secs, averaging %f upcalls/sec"
% (limit, elapsed, limit / elapsed))

if __name__ == "__main__":
doit()
And the extension Pyrex code, which includes the extension type upcall target:

import threading
cdef extern from "pthread.h" nogil:
ctypedef unsigned long int pthread_t
cdef enum:
__SIZEOF_PTHREAD_ATTR_T = 256 #the value here isn't important
cdef union pthread_attr_t:
char __size[__SIZEOF_PTHREAD_ATTR_T]
long int __align
int pthread_create(pthread_t *__newthread, pthread_attr_t *__attr,
void *(*__start_routine) (object), object __arg)
int pthread_join(pthread_t tid, void **valPtr)

WITH_PYTHON_THREAD = 1
WITH_ALIEN_THREAD = 2
WITH_CALLING_THREAD = 3

cdef class Upcaller

cdef int upcaller1(object theUpcaller) with gil:
cdef int result
cdef Upcaller ucRouter
ucRouter = <upcaller> theUpcaller
result = ucRouter.routeUpcall()
return result

cdef void *upcaller1Agent(object theUpcaller) nogil:
cdef int quit
quit = 0
while quit == 0:
quit = upcaller1(theUpcaller)
return NULL


cdef class Upcaller:
cdef callback
cdef callHow
cdef upcallThread
cdef pthread_t alienThread
cdef public long upcallLimit
cdef public long upcallCount
def __init__(self, callHow, callback, limit):
self.callback = callback
self.callHow = callHow
self.upcallThread = None
self.upcallLimit = limit
self.upcallCount = 0

def go(self):
cdef int callResult
cdef pthread_t *alienThread
if self.callHow == WITH_PYTHON_THREAD:
self.upcallThread = threading.Thread(target=self._hurtEm,
args=())
self.upcallThread.start()
elif self.callHow == WITH_CALLING_THREAD:
self._hurtEm()
elif self.callHow == WITH_ALIEN_THREAD:
alienThread = &self.alienThread
with nogil:
callResult = pthread_create(alienThread, NULL,
upcaller1Agent, self)
if callResult != 0:
raise Exception("failed to start pthread")
else:
raise Exception("unrecognized callHow value")

def _hurtEm(self):
with nogil:
upcaller1Agent(self)

cdef int routeUpcall(self):
#indicate being all done by returning 1
self.callback()
self.upcallCount += 1
if self.upcallCount > self.upcallLimit:
return 1
else:
return 0

def join(self):
if self.callHow == WITH_PYTHON_THREAD:
self.upcallThread.join()
elif self.callHow == WITH_CALLING_THREAD:
pass #we blocked in go() so there's nothing to join
else: #must be WITH_ALIEN_THREAD
with nogil:
pthread_join(self.alienThread, NULL)


cdef class CUpcallTarget:
cdef public char *kind

def __init__(self):
self.kind = "c ext upcall target"

def callback(self):
return
A few words about the Pyrex code:

  • The second line which starts “cdef extern from ...” tells Pyrex a couple of things: first, that the following declarations can be found in the pthread.h file and therefore Pyrex will need to generate a #include for that header, and second that any functions listed in this section are to be called without the GIL. This acts as a flag to Pyrex that it's acceptable for invoke the function inside a with nogil: block.

  • The pair of functions “upcaller1()” and “upcaller1Agent()” serve as the stand-ins for the glue code to the external library and the external library itself. The upcaller1() function includes the “with gil” suffix on the function definition to tell Pyrex to generate code that acquires the GIL upon entry to the function. An analog to this function would be the upcall target for LBM in the real extension and would acquire the GIL for each upcall, making it safe to subsequently interact with Python objects. The upcaller1Agent() function in essence acts as the whole of the LBM library; it does whatever it does, and when it needs to call out to user code (for instance, to delivery a recently arrived message), it activates the extension callback function upcaller1(). Since upcaller1Agent() is a stand-in for LBM, it is marked as “nogil” to indicate that the GIL cannot be held when calling this function.

  • I probably could have gotten a bit more speed using a straight function rather than a bound method for the upcall target, but since I planned on doing away with the old “client data pointer”, a method on an object seemed like a reasonable choice. Anyway, we're really looking for a ballpark figure here, as a real implementation isn't bound to get anywhere near this performance.
I ran the test program five times and averaged the results, which are shown in the following table. These rates are upcalls/sec:


Pure Python upcall target
Python C extension class upcall target
Upcalling thread known to Python
1,827,283
3,528,114
Upcalling thread unknown to Python
948,044
1,329,958
Upcalls from main thread
2,018,659
3,330,011

The test host contains an Athlon 64 X2 dual core 3800+ processor and 3 GB of RAM.

Pretty interesting numbers, to be sure. The good news is that a user of the LBM extension would have some options if they needed better performance; it's pretty clear that by turning your callback objects into Pyrex extension types would give you a significant boost in performance (an interesting data point for Pyrex use in general, too). The bad news is the performance hit encountered when the upcalls are performed by a thread created outside of Python's knowledge (that is, not using threading but rather a raw pthread). I understand Java suffers from a similar effect with alien threads calling up to Java via the JNI.

This of course raises an API extension request for 29West. It would be great to expose an optional interface for the user to plug in their own “thread factory”. The default factory would simply be a call to pthread() to start up a thread of control at some identified function. However, a user-supplied factory could use whatever means it desired to spawn a thread. In the case of Python, it would be a simple matter to create a new thread with “threading” and have it invoke a “nogil” function that would then execute the 29West thread entry point. This way Python won't have to do all the work it otherwise must face whenever an alien thread tries to acquire the GIL, and thus allow such code to run much faster.

Now I have some idea of what to aspire to.

No comments:

Post a Comment