Wednesday, July 8, 2009

Incoming, Part 3-- back to Pythonland

I guess I should just stop apologizing for late posts-- I'm up against various deadlines, and we all know that that means, so I'll just get on with it and hope there's someone still reading.

So the last time around, I talked about building the C receiver that acts as the bridge from C to Python, and putting structures in place to find the correct Python object to call quickly. Now it's time to make it back to Python code, and more importantly user code, and it turns out there's lots to be considered when you hit this realm. I have to say that I was surprised how much the Python code, even trivial stuff, slowed things down, and it underscored for me how important it is that for maximum performance the Python receiver get implemented as a Pyrex object as well. But that was only one of the implementation challenges.

One of the design goals I stated early on was that I wanted to have few restrictions as to what sort of callable could be used by a client to handle message callbacks. This means that I couldn't call directly into the user's callback from the _rcvEventCallbck() C function; there were a number of potential variations to be accounted for, and taking care of them all seemed to be best served in the various Receiver extension types. So I decided to first route callback control through a C method in the Receiver, and then make further decisions as to what to do there.

The entry point for callbacks into the Receiver object is _routeMsgCallback(), and it's defined like this in the AbstractReceiver class in the Pyrex pxd file:

cdef class AbstractReceiver(coption.OptionMgr):
cdef object msgCallback
cdef _routeMsgCallback(self, message.Message msg)
And to recap, here's how that method gets invoked from the C callback function:

cdef int _rcvEventCallback(lbmh.lbm_rcv_t *rcv, lbmh.lbm_msg_t *msg,
object clientd) with gil:
cdef AbstractReceiver receiver
cdef message.Message pyMessage
receiver = <AbstractReceiver> clientd
if receiver is not None: #still a little paranoid
pyMessage = Message()._setMsg(msg)
receiver._routeMsgCallback(pyMessage)
return lbmh.LBM_OK
Finally, here are three relevant methods from the implementation of Receiver, the basic derived class of AbstractReceiver:

cdef class Receiver(AbstractReceiver):
def __init__(self, myFactory, msgCallback=None):
if msgCallback is None:
msgCallback = self.__class__.handleMsg
if msgCallback is not None and not callable(msgCallback):
raise LBMWException("msgCallback isn't callable; "
"no way to get messages out")
super(Receiver, self).__init__(myFactory, msgCallback)

cdef _routeMsgCallback(self, message.Message msg):
self.msgCallback(self, msg)

def handleMsg(self, message.Message msg):
return
There's some subtle things going on here to support both user-specified callback functions as well as derived classes of Receiver. Recall from this post that when working with the receiver factory the user can specify either an eventCallback callable or a receiverClass callable which yields Receiver instances; either of these object are called when messages arrive. Therefore the Receiver class figure out which kind of callback is being used and has to set up its msgCallback attribute in __init__() to be the function or unbound method to call whenever _routeMsgCallback() is invoked from the C callback function. For subclasses of Receiver, I call the unbound handleMsg method because I can then invoke it just as if I was invoking a callback function, and thus no additional logic is needed to determine what kind of object is being called. It probably has the benefit of being slightly faster since no bound method object has to be created for each callback.

So when _routeMsgCallback() calls self.msgCallback(self, msg), that's a call out to user code. There, the user needs to examine the provided Message object (another extension type) and figure out what they want to do based on the message's type. This was the first point where I encountered pure-Python's slowness; checking the type and extracting the data from the message took a surprising amount of time. When I had a derived class's handleMsg() call do no work and return immediately, things sped up appreciably. I needed to provide more assistance in order for the user's code to perform well.

One thing that I could do for the user is determine what type of message they received and provide a way to more finely route the callback to a method. This resulted in a new abstraction, the AbstractRoutingReceiver. This class establishes a protocol for passing messages to specific methods matched to each type of message. This means that I have the opportunity in C to detect the message type and select the right method to invoke which I could do much faster than in pure Python. Here's a bit of that code to give you an idea of what's going on:

cdef class AbstractRoutingReceiver(AbstractReceiver):
def __init__(self, myFactory, msgCallback=None):
if self.__class__ is AbstractRoutingReceiver:
raise LBMWException("You can't directly make instances of "
"AbstractRoutingReceiver, only its subclasses")
if msgCallback is not None:
if not typecheck(msgCallback,
callbackMixins.RoutingReceiverMixin):
raise LBMWException("The provided msgCallback isn't an "
"instance of RoutingReceiverMixin")
else:
msgCallback = self
super(AbstractRoutingReceiver, self).__init__(myFactory,
msgCallback)

def dataMsg(self, msg):
return

def endOfSourceMsg(self, msg):
return

def requestMsg(self, msg):
return
...and so on. The __init__() method here is very similar to the one in Receiver, except that if a msgCallback is provided, it can't be a plain function-- it must a subclass of the RoutingReceiverMixin class (the typecheck() function is a Pyrex addition that tests types in a manner similar to isinstance(), and is the preferred way to verify type in Pyrex code). The RoutingReceiverMixin class defines all of the per-message type methods that a callback object for this class must support. The _routeMsgCallback() method (defined in derived classes of AbstractRoutingReceiver) then makes the decision as to which method to invoke.

However, finding the right method is another potential time-sink. Not only is method lookup in Python costly, it involves the creation of a new bound method object every time you look up a method. This is well-defined Python behavior, but not the best for high performance apps. So I created two derived classes of AbstractRoutingReceiver to give users a choice regarding flexibility and speed: the first, DynamicRoutingReceiver, fully adheres to the dynamic nature of method lookups in Python, and thus operates more slowly:

cdef class DynamicRoutingReceiver(AbstractRoutingReceiver):
cdef _routeMsgCallback(self, message.Message msg):
getattr(self, _msgTypeToMethodMap[msg.type])(msg)
The variable _msgTypeToMethodMap is a dict that maps LBM message types to the corresponding name of the handling method; this is then the attribute name that is then looked up with getattr() on self, and the message is then dispatched to the discovered method. This retains all of Python's method lookup semantics (that is, a method could dynamically change between invocations), at the expense of some performance.

To provide better fine-grained method dispatching, I created the StaticRoutingReceiver class. This class trades away a bit of dynamism for better lookup and dispatch performance. It does this by computing a dispatch table during __init__(), capturing the bound method at init time for each message handler in a list. The order of the items in the list arranged so that the most frequently received message types (most importantly, LBM_MSG_DATA) are at the front of the list. The definition of the class in the pxd file is:

cdef class StaticRoutingReceiver(AbstractRoutingReceiver):
cdef list dispatchList
cdef _routeMsgCallback(self, message.Message msg)
And the implementation in the pyx looks like this:

cdef class StaticRoutingReceiver(AbstractRoutingReceiver):

def __init__(self, myFactory, msgCallback=None):
super(StaticRoutingReceiver, self).__init__(myFactory,
msgCallback)
self.dispatchList = []
for i in range(len(_pyMsgTypeList)):
self.dispatchList.append( getattr(self.msgCallback,
_msgTypeToMethodMap[_pyMsgTypeList[i]]) )

cdef _routeMsgCallback(self, message.Message msg):
cdef int i
cdef int mtype
mtype = msg.type
for 0 <= i < 12:
if mtype == _msgTypeList[i]:
self.dispatchList[i](msg)
return
The variables _pyMsgTypeList and _msgTypeList contain the prioritized list of LBM messages, one containing Python objects and the other C data. Having the C data type list avoids having to do any Python object creation when looking up values in the list. The handler methods for each message type live at the same index in self.dispatchList; so for the message type stored at _msgTypeList[i], the handling method for that type can be found at self.dispatchList[i] for any specific value of i.

So before people get out the pitchforks and torches to run me out of town, let me address the linear search in _routeMsgCallback(). First of all, the syntax of the for loop is a special Pyrex notation that is supposed to provide the best processing performance. But why the for at all? Why not a lookup structure such as as dict? Conventional wisdom holds that if your search list consists of around 10 items, it's pretty hard to come up with a lookup algorithm that performs better on average than linear search, and measurements with an earlier implementation using dicts bears this out. With a list composed of 12 items, the number of possible message types fits comfortably within range of this rule of thumb. I've further improved the search's performance by placing the most commonly occurring message types at the front of _msgTypeList, so that it only requires a compare or two to find the index of the proper handling method in the majority of cases. This change provided a significant performance increase in the test code that used this as a Receiver base class.

Still, even though I bought significant improvement with this static method routing approach, Python processing in the callbacks was still holding overall performance back. So the final approach was to Pyrex-ize the callback class itself, turning it into a extension type. This would have the benefit of using Python's native C API for the objects it knew the type of, while maintain “plug compatibility” with the pure Python class it replaced.

Here's a snippet from a Pyrex Receiver class that's used to count messages, bytes, and various other statistics on the received messages:

cdef class StatsCatcherReceiver(RoutingReceiverMixin):
cdef public int msgCount, byteCount, unrecCount, burstLoss
cdef public int totalMsgCount, totalByteCount
cdef public int totalUnrecCount, totalBurstLoss
cdef object owner
cdef object verbose
def __init__(self, owner, verbose=False):
self.msgCount = 0
self.byteCount = 0
self.unrecCount = 0
self.burstLoss = 0
self.totalMsgCount = 0
self.totalByteCount = 0
self.totalUnrecCount = 0
self.totalBurstLoss = 0
self.verbose = verbose
self.owner = owner

def dataMsg(self, msg):
self.msgCount += 1
self.byteCount += msg.len
if self.verbose is True:
gwInfo = self.getGWInfo(msg)
if gwInfo:
print "[%s][%s][%u], via gateway [%s][%u], %u bytes" \
% (msg.topicName, msg.source, msg.seqnum,
gwInfo.source, gwInfo.seqnum, msg.len)
else:
print "[%s][%s][%u], %u bytes" \
% (msg.topicName, msg.source, msg.seqnum, msg.len)
This class was created by porting a pure Python class to Pyrex, giving the Pyrex class the same name, methods, and functionality, thus allowing it to be plug-compatible within the program in which the original class was used. The message receipt rate increased by almost 100% when this extension type was used, putting the overall program's performance levels at about 50% of the equivalent C program's on the same test rig (around 440K msgs/sec). This is in large part because Pyrex turns all self.attr references into struct member accesses, providing the corresponding speed gains. This gets me to a reasonably happy place performance-wise, at least for now.

It's worthwhile to consider a few of the obvious opportunities that still exist for further increases in message receipt performance:

  • In StaticRoutingReceiver, I could do away with the linear search and look into a direct indexing scheme where the message type itself would index into the list where the proper handling method could be found. This would eliminate the loop setup and few compares even in the best cases with the current algorithm. The LBM message types lend themselves well to this approach-- they are small consecutive integers that start at zero, and so would make fine array indicies. I considered this approach, but didn't like the fact that I was leveraging intimate details regarding the nature of what was otherwise supposed to be an opaque symbol, although to be honest I suppose I still am now. 29West would most likely not change these values (in fact, they apparently go to great pains to remain backwards compatible), but it still rubs me the wrong way. Nonetheless, an experimental implementation to see performance differences is definitely in order.

  • Again for StaticRoutingReceiver, it would probably turn out to provide a bigger gain if I changed dispatchList from a Python list to a C array of Python objects. This would save considerable time by avoiding the list's C APIs. Coupled with direct indexing, this could provide some significant improvement.

  • Finally, it might be useful to pass a different object as the clientd to LBM's lbm_rcv_create() function. Instead of handing over a Python object on which I need to do an attribute lookup to find the _routeMsgCallback() method with each received message, the better thing to do would probably be to pass the bound _routeMsgCallback() method itself. That is, instead of calling:

    result = lbmh.lbm_rcv_create(rcv, context, cTopic,
    <lbmh.lbm_rcv_cb_proc> _rcvEventCallback,
    pyReceiver, eq)
    It might be better to call it like:

    result = lbmh.lbm_rcv_create(rcv, context, cTopic,
    <lbmh.lbm_rcv_cb_proc> _rcvEventCallback,
    pyReceiver._routeMsgCallback, eq)
    And then directly call the object passed into _rcvEventCallback(). Depending on the inheritance structure backing the class of pyReceiver, this could save several dictionary lookups. However, it isn't entirely clear that this will be a win at all in the long run. The _routeMsgCallback() method is defined as a C function in Pyrex, and since we cast the returned object to an AbstractReceiver in _rcvEventCallback(), we may very well get to do a direct call to the underlying C function and bypass Python's attribute lookup in this case. Additionally, it isn't clear what kind of object you get with pyReceiver._routeMsgCallback; it could be a bound method, or it could be a pointer to a function. If a bound method, there's a good chance that it will actually be slower to call than in the current approach, especially since it would entail Python's method invocation protocol, which involves building an argument tuple and calling a Python C API function to invoke the method. If, on the other hand, pyReceiver._routeMsgCallback results in a function pointer, it may be no faster then dereferencing an AbstractReceiver pointer to find the _routeMsgCallback member that's a pointer to a function. The only way to know this, of course, is to look at the generated Pyrex code.
This investigation into improved performance also demonstrated a very worthwhile development process: create pure Python subclasses of StaticRoutingReceiver or RoutingReceiverMixin to hold the data that arrives in the callbacks during development, and when the structures that use the objects have settled down, re-implement these classes using Pyrex in order to get an easy speed boost.

It's not a bad application strategy for an LBM/Python app, either: put the message handling into extension types whose instances can respond rapidly incoming messages, and use pure Python to organize the operation of these high-bandwidth objects.

So that's the whole of the message receipt stack in my LBM wrapper. As with such things, it's an ever-evolving beast, and even in the course of drafting these posts I've come up with new ideas to try to make the receipt of messages run faster. But that's enough for this part of the system for now; next time it'll be onto topics anew.