I’ve just started experimenting with applying Mahout to analyze text in a Solr index. Mahout is a set of machine-learning tools. It is built on Apache Hadoop and consists of algorithms and utilities for clustering and classifiying text and data. Recent versions of Solr include the Carrot2 clustering engine which is very cool, but I specifically wanted to get acquainted with Hadoop and MapReduce.
There is already lots of helpful information out there on using Mahout with Solr. Grant Ingersoll’s post, Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3) got me started, but like many of the commenters, I was pining for the missing sequels. Next came Mayur Choubey’s helpful, straightforward outline Cluster Apache Solr data using Apache Mahout. Finally, the Mahout wiki page Quick tour of text analysis using the Mahout command line filled in some missing blanks.
Following are the steps and references I used to generate clusters from a BibApp Solr index.
Installation of Mahout and Hadoop was straightforward. Although, had I read through all the instructions before installing Mahout then I’d have known to skip running the tests. That would have saved a good chunk of time.
Step 1: Add termVectors to BibApp’s Solr text field in schema.xml.
<!--====Special Fields====-->
<!--'text' is used as default search field (see below)-->
<field name="text" type="text" indexed="true" stored="false" multiValued="true" termVectors="true"/>
Then, reindex Solr. I’ve slightly modified my BibApp to take better advantage of Solr multicore for reindexing, but it’s still a “standard” Solr index:
$ cd ~/development/BibApp; bundle exec rake solr:refresh_swap_index RAILS_ENV=development
Step 2: Run Mahout against the Solr index to generate a vector file and dictionary file.
$ bin/mahout lucene.vector --dir /Users/jstirnaman/development/BibApp/vendor/bibapp-solr/cores-data/development/core2/data/index/ --output my-data/bibapp-vectors --field text --idField id --dictOut my-data/bibapp-dictionary --norm 2
Step 3: Run Mahout’s kmeans clustering algorithm against the vector file. Incidentally, this step took the longest to figure out as I don’t know anything about providing cluster centroids (the -c parameter). As it turns out, if you supply both the -k and -c parameters then kmeans will put its own random seed vectors into the -c directory. The “Quick Tour of Text Analysis…” Mahout wiki page clued me in. Phew!
$ bin/mahout kmeans -i my-data/bibapp-mahout.vec -c my-data/bibapp-kmeans-centroids -cl -o my-data/bibapp-kmeans-clusters -k 20 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
Here’s the tail end of the output. 16461 records sounds about right:
12/07/29 07:37:48 INFO mapred.JobClient: Job complete: job_local_0003 12/07/29 07:37:48 INFO mapred.JobClient: Counters: 9 12/07/29 07:37:48 INFO mapred.JobClient: File Output Format Counters 12/07/29 07:37:48 INFO mapred.JobClient: Bytes Written=27229499 12/07/29 07:37:48 INFO mapred.JobClient: File Input Format Counters 12/07/29 07:37:48 INFO mapred.JobClient: Bytes Read=25709664 12/07/29 07:37:48 INFO mapred.JobClient: FileSystemCounters 12/07/29 07:37:48 INFO mapred.JobClient: FILE_BYTES_READ=228198077 12/07/29 07:37:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=142916329 12/07/29 07:37:48 INFO mapred.JobClient: Map-Reduce Framework 12/07/29 07:37:48 INFO mapred.JobClient: Map input records=16461 12/07/29 07:37:48 INFO mapred.JobClient: Spilled Records=0 12/07/29 07:37:48 INFO mapred.JobClient: Total committed heap usage (bytes)=241053696 12/07/29 07:37:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=138 12/07/29 07:37:48 INFO mapred.JobClient: Map output records=16461 12/07/29 07:37:48 INFO driver.MahoutDriver: Program took 68717 ms (Minutes: 1.1452833333333334)
Step 4: Use Clusterdump to analyze the clusters. Mind the -dt, dictionary type, parameter. Set it to “text” in our case, otherwise the command will fail, telling you that the dictionary file is not a sequence file.
$ bin/mahout clusterdump -d my-data/bibapp-dictionary -dt text -i my-data/bibapp-kmeans-clusters/clusters-2-final/part-r-00000 -o my-data/bibapp-kmeans-clusterdump -n 20 -b 100 -p my-data/bibapp-kmeans-clusters/clusteredPoints -e
And here are the Top Terms from my Clusterdump output:
{n=2370 c=[0:0.023, 00:0.001, 000:0.001, 0000:0.000, 000001:0.000, 000005:0.000, 000008:0.00 Top Terms: a =>0.027360996873880702 patient =>0.025805192877985494 0 =>0.023186265779140857 cancer =>0.021687143137689265 age =>0.021376625634302378 p =>0.020841041796736622 were =>0.020806778810206115 diseas =>0.020105232473833737 n =>0.019341989276154777 r => 0.01866668922521172 from =>0.018322945487318585 measur =>0.018318213666784038 medicin => 0.01811168751056253 2 =>0.017878920778544888 use =>0.017718353777290936 studi =>0.017570063261710112 signific =>0.017489906172346706 health => 0.01743561515914853 increas =>0.017073525048242385 clinic => 0.01705333529210705
I need to tweak my clustering for sure, but it’s a start. I had forgotten that my text field in Solr contains truncated words for stemming. I’ll consider adding a new field to generate clusters from.