Writing Python modules in C with SWIG
The Python C API, while very well put together, is still C and therefore requires a lot of time an effort to understand, let alone use. But every now and then, a programmer comes across something that is either impossible in Python or so prohibitively expensive performance-wise that it is worth dusting off the old C textbook.
I recently began a seemingly simple project that involves counting the number of each character in a potentially large string. The program only needs to consider ascii characters, with a few significant exceptions- smart quotes and ticks in Word documents must be counted as ascii single and double quotes, respectively. Happily, Python’s re is UTF-aware and so can make these substitutions for us. However, iteration in Python, especially over a potentially large string, is expensive, and using regular expressions for this sort of thing is not to be thought of. Regular expressions are expensive to begin with, especially if we are performing 128 of them (one for each ascii character that could potentially appear in a string), each over the entire string. The best performance will be had by iterating over each character in the string and incrementing a dictionary/list/array/whatever containing counts of each character.
I tried this in pure Python. The performance was average. I used the The Project Gutenberg free online copy of War and Peace. On a quad-core G5 PowerMac, it took several seconds just to iterate through the string, let alone access a dictionary to store counts. Using a Python iterator did not increase performance significantly.
So I wrote a simple C module with a function that accepts a string pointer, iterates through each character, incrementing an array (using the int cast of the character as the character’s index in the array). It was as fast as can be expected from C, processing the entire file in under one tenth of a second. Here is the code (without the test function that iterated over the file itself- that will not be necessary from Python). Included is a helper function, count_of_char, which is only there to make access to the TALLY struct more simple from Python (tally.c):
#include #include #include #define CHARS 128 typedef struct { char *text; int counts[CHARS]; } TALLY; TALLY init(char *str); void count_chars(TALLY *t); int count_of_char(TALLY *t, char c); TALLY init(char *str) { int i; TALLY t; t.text = str; for (i = 0; i < CHARS; i++) { t.counts[i] = 0; } count_chars(&t); return t; } void count_chars(TALLY *t) { int i; for (i = 0; i < strlen(t->text); i++) { t->counts[(int) t->text[i]]++; } } int count_of_char(TALLY *t, char c) { return t->counts[(int) c]; }
Now, how to make this accessible to Python? It took me a matter of an hour or so to write the C library. Writing the Python API should not take longer than the library itself, so I decided to use SWIG to generate the necessary code. SWIG uses an interface file rather than parsing the actual C library. Here is the simple one I used for the library above (tally.i):
%module tally
%{
#define CHARS 128
typedef struct
{
char *text;
int counts[CHARS];
} TALLY;
extern TALLY init(char *str);
extern int count_of_char(TALLY *t, char c);
extern TALLY init_from_file(char *file_path);
%}
typedef struct
{
char *text;
int counts[CHARS];
} TALLY;
extern TALLY init(char *str);
extern int count_of_char(TALLY *t, char c);
extern TALLY init_from_file(char *file_path);
That was straight out of the example in the SWIG docs for Python. It doesn’t take much. Often, an interface file is not even required. I then ran swig -python tally.i
. This created a few files (tally_wrapper.c and tally.py). Now, since we are using Python and have access to distutils, we can write a simple setup.py file to do the work of compiling and linking for us:
import distutils from distutils.core import setup, Extension setup(name = "Tally", version = "1.0", ext_modules = [Extension("_tally", ["tally.i","tally.c"])])
Note the naming conventions here. The extension library must end up named _python setup.py install
to compile and install our module. From there, we can use a simple Python CGI (or FastCGI or mod_python app, depending on what your host supports) that reads in the string from a web form and uses our new library, tally, to count the characters.
from operator import itemgetter import tally def tally_string(str, min=0, sort=False): t = tally.init(str) chars = {} for i in xrange(0, 127): x = tally.count_of_char(t, i) if x >= min: chars[i] = x if sort: chars = sorted(chars.items(), key=itemgetter(1), reverse=True) return chars
Note the use of itemgetter to sort the dictionary we created by value. The reason that I decided to create the Python storage from Python rather than C was that I was trying to avoid using the Python API in my library at all to maintain its usefulness in other applications. Now I don’t have to change the source code at all to use this library in another app. Additionally, I had never used SWIG before and did not want to overcomplicate matters by messing with lower-level stuff, especially since the big performance problem (the string iteration) had been solved, and I now had cycles to spare.
To see the code in action, here is a working version. This is a work in progress, obviously, but it demonstrates the speed gain.