I've finished reading the six chapters of Taming Text that are available through Manning's Early Access Progam. See one of my previous posts to find out more about the first three chapters. These are the other chapters I've read:
- Chapter 4 is about fuzzy String matching. I didn't try the examples, but I was mainly interested in the principles that are used, for instance to reply with "Did you mean ...?" when a word in your query isn't spelled out correctly.
- The same goes for chapter 5 about identifying people, places, and dates. I didn't know anything about the algorythms that can be used to find a proper noun in a text.
- I've learned about many different products in the previous chapters: Solr, Lucene, OpenNLP,... and I'm happy that I can now place those names, but for the moment, I don't need to use these products. I was only reading this MEAP out of interest (and I've learned plenty of new things). In chapter 7 (about clustering text), I learned about two other products: Carrot2 and Apache Mahout, but I had more difficulties understanding the examples, maybe because this matter is more complex.
In any case, I think this will be an excellent book for developers who want to be introduced into the world of text, more specifically if they have to implement search functionaly, if they need to match textual data (for instance: matching movies in different DVD databases), of if they programmatically have to derive meaning from documents.
As mentioned before, there's still a lot of work on the MEAP with respect to the layout, but knowing Manning, I'm confident that this won't be a problem.