Faster string iteration using unpack

In a previous entry, we used a simple C program and SWIG to extend Python. This was because iteration over Python’s strings, being more complex constructs than C’s character arrays, was not speedy enough for large strings. newLisp, being a high level, interpreted language like Python, also suffers from the same problem. However, newLisp’s unpack function provides another option.

Unpack unpacks a binary structure into a variable using a string format to define the pattern of data in the binary structure. There is a predefined syntax for these patterns. This is typically used to extract values from a C struct or union (unpack’s sister function, pack, does the reverse, and uses the same pattern syntax). It can also be used to break apart internal data structures such as the string type as newLisp’s creator, Lutz Mueller, was kind enough to point out to me. Looking at the documentation, we see that “b” is the pattern for an 8-bit, unsigned integer (C represents an ascii character using its ascii code), which is what we wish to extract from the string. To use this pattern to explode a string referred to by ‘text:

(unpack (dup "b" (length text)) text)

Note that dup is used to repeat the pattern as many times as is necessary to get all characters of out ‘text, using (length text) to determine the number of characters in ‘text.

This gives us a list of ascii codes. We can then use the function count to count the instances of each ascii code in the list.

(count (sequence 0 127) (unpack (dup "b" (length text)) text))

This works, but it’s quite slow. Count is in the process of being rewritten to make this type of operation more efficient. Until then, however, we can rely on the same technique employed in the C extension we created using SWIG. Create an array with a length of 128 and initial values of 0 for each index. Iterate over the character list, using the character’s value as the index in the counts array, and increment the value at that index.

To set the array index, we use nth-set. One of the features of nth-set is that it sets the global variable $0 to the current value of the index; this can be used to express the new value. We can use that fact in our algorithm:

(set 'char-list (unpack (dup "b" (length text)) text))
(set 'counts (array 128 (dup 0 128)))
(dolist (c char-list)
  (nth-set (counts c) (+ $0 1)))

Now we have an array where the index corresponds to an ascii value and the value of the index is the count of characters in ‘text. To turn this into something a little more useful, we can make it into a table of ‘((character count) …):

(set 'lst '())
(dotimes (i 128)
  (push (list (char i) (counts i)) list -1))

We use an iterator because mapping would not provide us with the index value of the current application, which we need to determine the character for that index. Pushing a value onto the end of a list in newLisp is optimized like cons. We now have the list we wanted, and the code performs reasonably well for such a high-level implementation – O(n).

Leave a comment | Trackback
Aug 23rd, 2007 | Posted in Programming
Tags:
No comments yet.