diary of a wrap: Top gear-- how fast can we drive Python?

In order to get an idea of how well my wrapper is performing, I needed to get some idea of the theoretical limit in performance I could expect with Python. Specifically, I needed to get an idea of how fast I could generate upcalls into Python from an external library via a thread that Python knows nothing about.

Depending on how you use it, LBM can spawn a couple of threads for your process, one of which appears to be in charge of performing upcalls into application code upon receipt of messages. As these threads are outside the set spawned by Python itself, I was concerned as to whether there would be significant cost in Python bestowing the GIL to an alien thread for the upcall. So I figured I'd create a trivial extension module that performs upcalls in a couple of different ways to see how fast we can go.

The test is composed of three parts: the main line code which sets everything in motion and reports performance stats, a simple extension that does upcalls as fast as possible, and a class on which the upcalls will be performed. The method which is the upcall target does nothing, thus giving me a reasonable baseline for comparison.

I decided to have two variables that I'd modify to see what differences emerge:

First, I'll vary what thread we generate upcalls from. The test extension module will be able to perform upcalls from a Python-spawned thread, a thread spawned outside of Python (a raw pthread), and from the main Python thread itself.

Second, we'll vary the kind of object we upcall to. I'll implement a pure Python class, and then a second class that will be implemented as a Pyrex extension type. There will be a callback method on each that will do nothing but return.

Here's the test program's main, which includes the pure Python upcall target:

import time
import cUpcaller

class PyUpcallTarget(object):
   def __init__(self):
       self.kind = "python upcall target"
 
   def callback(self):
       return
 
def doit():
   limit = 5000000
   targets = [PyUpcallTarget(), cUpcaller.CUpcallTarget()]
   ucThreadsDict = {cUpcaller.WITH_PYTHON_THREAD:"python thread",
                    cUpcaller.WITH_ALIEN_THREAD:"alien thread",
                    cUpcaller.WITH_CALLING_THREAD:"calling thread"}
   waysToCall = ucThreadsDict.keys()
   for target in targets:
       for callHow in waysToCall:
           upcaller = cUpcaller.Upcaller(callHow, target.callback, limit)
           print ("Timing for %s from a %s"
                  % (target.kind, ucThreadsDict[callHow]))
           start = time.time()
           upcaller.go()
           upcaller.join()
           elapsed = time.time() - start
           print ("  %d upcalls took %f secs, averaging %f upcalls/sec"
                  % (limit, elapsed, limit / elapsed))
 
if __name__ == "__main__":
   doit()

And the extension Pyrex code, which includes the extension type upcall target:

import threading
cdef extern from "pthread.h" nogil:
   ctypedef unsigned long int pthread_t
   cdef enum:
       __SIZEOF_PTHREAD_ATTR_T = 256 #the value here isn't important
   cdef union pthread_attr_t:
       char __size[__SIZEOF_PTHREAD_ATTR_T]
       long int __align
   int pthread_create(pthread_t *__newthread, pthread_attr_t *__attr,
                      void *(*__start_routine) (object), object __arg)
   int pthread_join(pthread_t tid, void **valPtr)

WITH_PYTHON_THREAD = 1
WITH_ALIEN_THREAD = 2
WITH_CALLING_THREAD = 3

cdef class Upcaller

cdef int upcaller1(object theUpcaller) with gil:
   cdef int result
   cdef Upcaller ucRouter
   ucRouter = <upcaller> theUpcaller
   result = ucRouter.routeUpcall()
   return result

cdef void *upcaller1Agent(object theUpcaller) nogil:
   cdef int quit
   quit = 0
   while quit == 0:
       quit = upcaller1(theUpcaller)
   return NULL


cdef class Upcaller:
   cdef callback
   cdef callHow
   cdef upcallThread
   cdef pthread_t alienThread
   cdef public long upcallLimit
   cdef public long upcallCount
   def __init__(self, callHow, callback, limit):
       self.callback = callback
       self.callHow = callHow
       self.upcallThread = None
       self.upcallLimit = limit
       self.upcallCount = 0
  
   def go(self):
       cdef int callResult
       cdef pthread_t *alienThread
       if self.callHow == WITH_PYTHON_THREAD:
           self.upcallThread = threading.Thread(target=self._hurtEm,
                                                args=())
           self.upcallThread.start()
       elif self.callHow == WITH_CALLING_THREAD:
           self._hurtEm()
       elif self.callHow == WITH_ALIEN_THREAD:
           alienThread = &self.alienThread
           with nogil:
               callResult = pthread_create(alienThread, NULL,
                                           upcaller1Agent, self)
           if callResult != 0:
               raise Exception("failed to start pthread")
       else:
           raise Exception("unrecognized callHow value")
 
   def _hurtEm(self):
       with nogil:
           upcaller1Agent(self)
             
   cdef int routeUpcall(self):
       #indicate being all done by returning 1
       self.callback()
       self.upcallCount += 1
       if self.upcallCount > self.upcallLimit:
           return 1
       else:
           return 0

   def join(self):
       if self.callHow == WITH_PYTHON_THREAD:
           self.upcallThread.join()
       elif self.callHow == WITH_CALLING_THREAD:
           pass  #we blocked in go() so there's nothing to join
       else: #must be WITH_ALIEN_THREAD
           with nogil:
               pthread_join(self.alienThread, NULL)


cdef class CUpcallTarget:
   cdef public char *kind
 
   def __init__(self):
       self.kind = "c ext upcall target"
  
   def callback(self):
       return

A few words about the Pyrex code:

The second line which starts “cdef extern from ...” tells Pyrex a couple of things: first, that the following declarations can be found in the pthread.h file and therefore Pyrex will need to generate a #include for that header, and second that any functions listed in this section are to be called without the GIL. This acts as a flag to Pyrex that it's acceptable for invoke the function inside a with nogil: block.

The pair of functions “upcaller1()” and “upcaller1Agent()” serve as the stand-ins for the glue code to the external library and the external library itself. The upcaller1() function includes the “with gil” suffix on the function definition to tell Pyrex to generate code that acquires the GIL upon entry to the function. An analog to this function would be the upcall target for LBM in the real extension and would acquire the GIL for each upcall, making it safe to subsequently interact with Python objects. The upcaller1Agent() function in essence acts as the whole of the LBM library; it does whatever it does, and when it needs to call out to user code (for instance, to delivery a recently arrived message), it activates the extension callback function upcaller1(). Since upcaller1Agent() is a stand-in for LBM, it is marked as “nogil” to indicate that the GIL cannot be held when calling this function.

I probably could have gotten a bit more speed using a straight function rather than a bound method for the upcall target, but since I planned on doing away with the old “client data pointer”, a method on an object seemed like a reasonable choice. Anyway, we're really looking for a ballpark figure here, as a real implementation isn't bound to get anywhere near this performance.

I ran the test program five times and averaged the results, which are shown in the following table. These rates are upcalls/sec:

	Pure Python upcall target	Python C extension class upcall target
Upcalling thread known to Python	1,827,283	3,528,114
Upcalling thread unknown to Python	948,044	1,329,958
Upcalls from main thread	2,018,659	3,330,011

The test host contains an Athlon 64 X2 dual core 3800+ processor and 3 GB of RAM.

Pretty interesting numbers, to be sure. The good news is that a user of the LBM extension would have some options if they needed better performance; it's pretty clear that by turning your callback objects into Pyrex extension types would give you a significant boost in performance (an interesting data point for Pyrex use in general, too). The bad news is the performance hit encountered when the upcalls are performed by a thread created outside of Python's knowledge (that is, not using threading but rather a raw pthread). I understand Java suffers from a similar effect with alien threads calling up to Java via the JNI.

This of course raises an API extension request for 29West. It would be great to expose an optional interface for the user to plug in their own “thread factory”. The default factory would simply be a call to pthread() to start up a thread of control at some identified function. However, a user-supplied factory could use whatever means it desired to spawn a thread. In the case of Python, it would be a simple matter to create a new thread with “threading” and have it invoke a “nogil” function that would then execute the 29West thread entry point. This way Python won't have to do all the work it otherwise must face whenever an alien thread tries to acquire the GIL, and thus allow such code to run much faster.

Now I have some idea of what to aspire to.

diary of a wrap

Saturday, May 30, 2009

Top gear-- how fast can we drive Python?

No comments:

Post a Comment

Python Love

About Me

Links

Blog Archive