Tiny Clojure snippet

December 2, 2009

At the Cambridge Clojure meetup last night I decided to have a go at something work-related, but that I don’t usually get time for. The OSCAR3 chemistry entity extraction and text-mining tool uses N-Grams to create feature vectors to classify words. It turned out to be pretty short work for loop...recur, but then Nick Day pointed out that I’d just reimplemented partition, which turned the function into: –

(defn ngrams [word n]
  (let [pad (repeat (dec n) \#))]
    (partition  n 1 (concat pad word pad))))
(ngrams "foobar" 3)
-> ((\# \# \f) (\# \f \o) (\f \o \o) (\o \o \b) (\o \b \a) (\b \a \r) (\a \r \#) (\r \# \#))

2 Responses to “Tiny Clojure snippet”

  1. jat45 Says:

    You are not quite there yet – the start and end padding characters should be different.

  2. jimdowning Says:

    (defn ngrams [word n]
    (let [spad (repeat (dec n) \^)
    epad (repeat (dec n) \$)]
    (partition n 1 (concat spad word epad))))

    Why does it matter if they’re chosen such that they’re never going to be in the word?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: