This book, Programming Collective Intelligence by Toby Segaran was first published in 2007 and explains chapter after chapter each of the main machine learning algorithms that power (even today still) many of the biggest web services like Google, e-Bay and Facebook.
Despite its publicized social web focus, I can't help thinking of this book as a more general introduction to machine learning algorithms.
Because of that and also because, due to its age and success, this book already has several detailed reviews over the web, I'll make an attempt at a kind of cross-review by explaining how it could be a perfect companion book to the on-line and free Machine Learning class by Andrew Ng.
Disclaimer: I've spent the last 4 years in a small company where I was literally surrounded by machine learning experts. There, with some of my coworkers, I followed Andrew Ng's on-line courses as away to get on common grounds with the researchers, but it's also possible that, coffee breaks helping, some ideas and concepts made their way to my mind, well against my will I assure you :) . Let's just keep that in mind when I say that such and such things are easy to understand.
Same algorithmsFirst of all both the book and the class deal with approximately the same algorithms, including (but not limited to) the following:
- Bayesian classifier
- Multi-dimension linear regression
- Logistic regression
- Neural Nets
- Decision tree
- Kernel methods
- Hierarchical clustering
Different methodsThe progression of each session or chapter is actually quite different.
The Machine Learning classes, even if they have a very low requirements on math skills lean on the mathematical side of things anyway. Their progression (seen by me) is broadly as follows:
- each new problem is described via its geometrical interpretation
- then the corresponding equation is introduced and explained in a very pragmatic way
- the main steps and basic principles of the new algorithm are explained in simple words
- class attendees are then invited to experiment in this algorithm in octave
- during the conclusion a very quick reference is made to existing application of this algorithms in various domains (medical, web etc)
- definition of a social-web related problem
- arguments on the applicability and limits of the algorithms shown in earlier chapters
- introducing the new algorithm with some basic illustrations
- proposal of Python code sample to implement the new algorithm
- summary of the other cases of applications of the new algorithm
Complementary strengthsI've found both the book and the class to be very pedagogical. The readers and attendees are very gently taken by the hand, introduced to important algorithms and led to the point where they can actually implement them. However I've found in each one quite distinctive strengths (which by symmetry are also the weaknesses of the other).
The strength of the book Programming Collective Intelligence is clearly that it proposes working code samples in the Python language that is (at least today) widely used in web applications ((However I must point at the fact that several reviews over the web seem to question to correctness of the said Python code, since I only read it quickly as "pseudo-code" I have to left the careful readers be the judges)), and also to show how to integrate and use the API of well-known web-services. It's also worthwhile to say that for the math addict that 2 final chapters are dedicated to a summary of all seen algorithms and mathematical formulas.
- Learning curves should be computed by plotting the cost function against the iteration (or the number of samples or even a particular parameter) by separating the available data in 3 sets: 60% for train, 20% for cross-validation (used for optimization of parameters) and the remaining 20% is kept as a test set used to get a reliable measure of the system's performance.
- Looking at them helps detect bias (aka underfit) and variance (aka overfit)
- The learning quality can be improved with the following recipes:
- against bias, use fewer constraints and more flexible models: more features, use polynomial features, decrease the regularization parameter, use a more complex model (polynomials of higher degrees, or networks with more neurons)
- against variance, use more restrictive models or more constraints: more training data, fewer features, increase the regularization parameter, use simpler models (polynomials of lower degrees, or networks with less neurons)