Using newLISP’s find-all
newLISP’s find-all
utility is exceptionally powerful, especially when coupled with rich matching functions like match
, regex
, and unify
. find-all
combines search and substitution into a fast, comprehensive function for extracting data from lists and strings.
Basic syntax
Like many newLISP sequence functions, find-all
is defined for both strings and lists. The basic syntax is:
Strings: (find-all pattern target [expression [option]]) Lists: (find-all pattern target [expression [comparison-function]])
String searches and regular expressions
The most basic string search finds occurrences of one string inside of another.
(set 'str "Now is the time for all good men to come to the aid of their country.") (find-all "the" str) ; ("the" "the" "the")
There are three occurrences (including one within the word, ‘their’). To find only occurrences of ‘the’ as a complete word, a regular expression is used:
(find-all {\bthe\b} str $0 0) ; ("the" "the")
The first parameter is a regular expression, defined in curly braces to eliminate the need to double-escape entities (like b). The final parameter, 0
, is the PCRE option (a full list of options is available in the newLISP documents for regex.
The third parameter is key to one of the more powerful features of find-all
– substitution. The expression
parameter is applied to each element found. Inside of this expression, the global variable $0
represents the entire matched element. In a regular expression search, captured matches are available via subsequently enumerated variables: $1
, $2
, etc.
For example, to convert all occurrences of ‘the’ to upper case in the resulting list:
(find-all {\bthe\b} str (upper-case $0) 0) ;("THE" "THE")
Here is a short program to count the number of occurrences of some common words in the text of War and Peace:
(set 'text (read-file "war_and_peace.txt")) (set 'words (find-all {\b\S+\b} text $0 1)) ; split into words (println (format "%12s: %6d" "Total words" (length words))) (letn ((common-words '("and" "or" "the")) (counts (map 'list common-words (count common-words words)))) (dolist (word counts) (println (format "%12s: %6d" (word 0) (word 1)))))
With the results:
Total words: 564345
and: 21023
or: 1542
the: 31702
This isn’t necessarily efficient; finding all words and then counting occurrences of three specific words in a list of more than half a million elements results in a lot of wasted time. The regular expression used also includes opening and closing punctuation in the word (i.e. a quote that begins “The…” would result in a missed occurrence of ‘the’, because the list would contain “The instead. It also does not account for case when counting occurrences.
Here is a more efficient version:
(set 'common-words '("and" "or" "the")) (set 're (format {\b(%s)\b} (join common-words "|"))) (set 'occurrences (find-all re text (lower-case $0) 1)) (println (map 'list common-words (count common-words occurrences))) ; (("and" 22267) ("or" 1582) ("the" 34619))
Note that find-all
is recursive and newLISP has a limited stack size, which can be set during execution with the -s switch:
newlisp -s 500000 common_counts.lsp
The previous example would hit the default stack limit of 2,048. This can be solved by using map instead of find-all
‘s substitution expression:
(set 'common-words '("and" "or" "the")) (set 're (format {\b(%s)\b} (join common-words "|"))) (set 'occurrences (map 'lower-case (find-all re text $0 1))) (println (map 'list common-words (count common-words occurrences))) ; (("and" 22267) ("or" 1582) ("the" 34619))
The substitution expression is also useful to cause side effects, squeezing even more efficiency out of the algorithm. Here, find-all
increments a count of each word and stores it in a dictionary.
(define counts:counts) (set 'common-words '("and" "or" "the")) (set 're (format {\b(%s)\b} (join common-words "|"))) (find-all re text (counts (lower-case $0) (+ 1 (or (counts (lower-case $0)) 0))) 1) (dolist (w common-words) (println (format "%4s: %6d" w (counts w)))) ; and: 22267 ; or: 1582 ; the: 34619
This algorithm is quite fast, although still somewhat weighty in RAM, since it stores the entire text before doing its work.
List searches
find-all
list searches are (arguably) nearly as powerful as regular expressions. By default, find-all
compares elements of target with pattern using match
, and the default expression is the entire matched element ($0
):
(set 'target '((a 1) (b 2) (c 3) (d 4 5))) (find-all '(? ?) target) ; ((a 1) (b 2) (c 3)) (find-all '(d *) target) ; ((d 4 5))
Using the substitution expression, lists may be unified or matched further:
(find-all '(? ?) target (unify '(Letter Number) $0)) ; (((Letter a) (Number 1)) ((Letter b) (Number 2)) ((Letter c) (Number 3)))
Simple list searches using other comparators than match
are straight-forward:
(find-all 4 '(1 2 3 4 5) $0 >) ; (1 2 3) (find-all 4 '(1 2 3 4 5) (+ 1 $0) >) ; (2 3 4)
find-all
is convenient for searching XML. newLISP parses XML into a tree. For example:
(xml-type-tags nil nil nil nil) (set 'xml (xml-parse (get-url "http://www.weather.gov/xml/current_obs/index.xml") (+ 1 2 4 16)))
For more info on parsing XML in newLISP, see this article and the newLISP documentation.
The XML at the url above lists weather stations. Here is an example station entry, under the root element, “wx_station_index”:
> > TAPA>> AG>> Vc Bird Intl Airport Antigua>> 17.117>> -61.783>> http://weather.noaa.gov/weather/current/TAPA.html>> http://weather.gov/xml/current_obs/TAPA.rss>> http://weather.gov/xml/current_obs/TAPA.xml> >
assoc
is useful for finding a path in a parsed XML list:
(set 'stations-xml (assoc (xml "wx_station_index")))
However, assoc
only returns the first element of the list that matches. In the XML article linked above, pop-assoc
was used to collect elements iteratively:
(set 'stations '()) (while (assoc (xml "wx_station_index" "station"))
That is a handy way of collecting all elements, especially in an irregular document. With find-all
, elements may be found just as easily and without modifying the original list:
(find-all '("station" *) (assoc (xml "wx_station_index")))
From there, it is just as simple to aggregate the values from each element:
(set 'station-pattern '("station" ("station_id" ?) ("state" ?) ("station_name" ?) ("latitude" ?) ("longitude" ?) ("html_url" ?) ("rss_url" ?) ("xml_url" ?))) (find-all '("station" *) (assoc (xml "wx_station_index")) (match station-pattern $0))
This returns a list of the node values for each station. The data can be applied to a different associative structure easily using unify
:
(find-all '("station" *) (assoc (xml "wx_station_index")) (unify '(Id State Name Lat Lon Html Rss Xml) (match station-pattern $0)))
Each element in this list this creates looks like:
((Id "TAPA") (State "AG") (Name "Vc Bird Intl Airport Antigua") (Lat "17.117") (Lon "-61.783") (Html "http://weather.noaa.gov/weather/current/TAPA.html") (Rss "http://weather.gov/xml/current_obs/TAPA.rss") (Xml "http://weather.gov/xml/current_obs/TAPA.xml"))
A completely impractical example
There are further implications of find-all
‘s ability to cause side effects in the substitution expression. To download the XML for each individual station, something like this could be done (don’t really do this – it will attempt to spawn more than 2,000 processes):
(set 'processes (find-all '("station" *) (assoc (xml "wx_station_index")) (spawn (sym (lookup "station_name" $0) 'WEATHER) (get-url (lookup "xml_url" $0))))) (sync 30000) ; wait 30 seconds for all downloads to finish
Assuming that the application is able to download the XML file for more than 2,000 stations in 30 seconds (and that there is no hard limit to forked processes, as there is in OSX), the context WEATHER
will contain the XML for all stations.
The infeasability of this example aside, it demonstrates, as an example, how easily a user-supplied XML file could be used to script a newLISP application.
Since version 9.4.7 the stack dependence of ‘find-all’ has been eliminated. Large lists or strings can be searched independent of the configured stack.