Tiny Clojure snippet
December 2, 2009
At the Cambridge Clojure meetup last night I decided to have a go at something work-related, but that I don’t usually get time for. The OSCAR3 chemistry entity extraction and text-mining tool uses N-Grams to create feature vectors to classify words. It turned out to be pretty short work for loop...recur, but then Nick Day pointed out that I’d just reimplemented partition, which turned the function into: -
(defn ngrams [word n]
(let [pad (repeat (dec n) \#))]
(partition n 1 (concat pad word pad))))
(ngrams "foobar" 3)
-> ((\# \# \f) (\# \f \o) (\f \o \o) (\o \o \b) (\o \b \a) (\b \a \r) (\a \r \#) (\r \# \#))
December 2, 2009 at 11:12 am
You are not quite there yet – the start and end padding characters should be different.
December 2, 2009 at 12:05 pm
(defn ngrams [word n]
(let [spad (repeat (dec n) \^)
epad (repeat (dec n) \$)]
(partition n 1 (concat spad word epad))))
Why does it matter if they’re chosen such that they’re never going to be in the word?