clustering

I’ve just started experimenting with applying Mahout to analyze text in a Solr index. Mahout is a set of machine-learning tools. It is built on Apache Hadoop and consists of algorithms and utilities for clustering and classifiying text and data. Recent versions of Solr include the Carrot2 clustering engine which is very cool, but I specifically wanted to get acquainted with Hadoop and MapReduce.

There is already lots of helpful information out there on using Mahout with Solr. Grant Ingersoll’s post, Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3) got me started, but like many of the commenters, I was pining for the missing sequels. Next came Mayur Choubey’s helpful, straightforward outline Cluster Apache Solr data using Apache Mahout. Finally, the Mahout wiki page Quick tour of text analysis using the Mahout command line filled in some missing blanks.

Following are the steps and references I used to generate clusters from a BibApp Solr index.

Installation of Mahout and Hadoop was straightforward. Although, had I read through all the instructions before installing Mahout then I’d have known to skip running the tests. That would have saved a good chunk of time.

Step 1: Add termVectors to BibApp’s Solr text field in schema.xml.
  <field name="text" type="text" indexed="true" stored="false" multiValued="true" termVectors="true"/>
Then, reindex Solr. I’ve slightly modified my BibApp to take better advantage of Solr multicore for reindexing, but it’s still a “standard” Solr index:
$ cd ~/development/BibApp; bundle exec rake solr:refresh_swap_index RAILS_ENV=development

Step 2: Run Mahout against the Solr index to generate a vector file and dictionary file.
$ bin/mahout lucene.vector --dir /Users/jstirnaman/development/BibApp/vendor/bibapp-solr/cores-data/development/core2/data/index/ --output my-data/bibapp-vectors --field text --idField id --dictOut my-data/bibapp-dictionary --norm 2

Step 3: Run Mahout’s kmeans clustering algorithm against the vector file. Incidentally, this step took the longest to figure out as I don’t know anything about providing cluster centroids (the -c parameter). As it turns out, if you supply both the -k and -c parameters then kmeans will put its own random seed vectors into the -c directory. The “Quick Tour of Text Analysis…” Mahout wiki page clued me in. Phew!
$ bin/mahout kmeans -i my-data/bibapp-mahout.vec -c my-data/bibapp-kmeans-centroids -cl -o my-data/bibapp-kmeans-clusters -k 20 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Here’s the tail end of the output. 16461 records sounds about right:

12/07/29 07:37:48 INFO mapred.JobClient: Job complete: job_local_0003
12/07/29 07:37:48 INFO mapred.JobClient: Counters: 9
12/07/29 07:37:48 INFO mapred.JobClient: File Output Format Counters
12/07/29 07:37:48 INFO mapred.JobClient: Bytes Written=27229499
12/07/29 07:37:48 INFO mapred.JobClient: File Input Format Counters
12/07/29 07:37:48 INFO mapred.JobClient: Bytes Read=25709664
12/07/29 07:37:48 INFO mapred.JobClient: FileSystemCounters
12/07/29 07:37:48 INFO mapred.JobClient: FILE_BYTES_READ=228198077
12/07/29 07:37:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=142916329
12/07/29 07:37:48 INFO mapred.JobClient: Map-Reduce Framework
12/07/29 07:37:48 INFO mapred.JobClient: Map input records=16461
12/07/29 07:37:48 INFO mapred.JobClient: Spilled Records=0
12/07/29 07:37:48 INFO mapred.JobClient: Total committed heap usage (bytes)=241053696
12/07/29 07:37:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=138
12/07/29 07:37:48 INFO mapred.JobClient: Map output records=16461
12/07/29 07:37:48 INFO driver.MahoutDriver: Program took 68717 ms (Minutes: 1.1452833333333334)

Step 4: Use Clusterdump to analyze the clusters. Mind the -dt, dictionary type, parameter. Set it to “text” in our case, otherwise the command will fail, telling you that the dictionary file is not a sequence file.

$ bin/mahout clusterdump -d my-data/bibapp-dictionary -dt text -i my-data/bibapp-kmeans-clusters/clusters-2-final/part-r-00000 -o my-data/bibapp-kmeans-clusterdump -n 20 -b 100 -p my-data/bibapp-kmeans-clusters/clusteredPoints -e

And here are the Top Terms from my Clusterdump output:

{n=2370 c=[0:0.023, 00:0.001, 000:0.001, 0000:0.000, 000001:0.000, 000005:0.000, 000008:0.00
Top Terms:
a =>0.027360996873880702
patient =>0.025805192877985494
0 =>0.023186265779140857
cancer =>0.021687143137689265
age =>0.021376625634302378
p =>0.020841041796736622
were =>0.020806778810206115
diseas =>0.020105232473833737
n =>0.019341989276154777
r => 0.01866668922521172
from =>0.018322945487318585
measur =>0.018318213666784038
medicin => 0.01811168751056253
2 =>0.017878920778544888
use =>0.017718353777290936
studi =>0.017570063261710112
signific =>0.017489906172346706
health => 0.01743561515914853
increas =>0.017073525048242385
clinic => 0.01705333529210705

I need to tweak my clustering for sure, but it’s a start. I had forgotten that my text field in Solr contains truncated words for stemming. I’ll consider adding a new field to generate clusters from.

More Power Later

This'll do 'til it gets here

Solr and Mahout