The amount of clutter text present at different webpages makes the task of discovering what is important a pain. At the observatorium I’ve been using a simple Tag to Text ratio to try to extract the important sections of text from webpages. The results are good, but not great, the method is fast and it works if one has in consideration that noise exists and can’t be totally eliminated.
The other day I found another technique that I think might become my de facto standard technique for text extraction from webpages as its first results are better than what I expected. The algorithm is able to detect the meaningful sections of pages with high accuracy and also has the benefit of being truly fast.
This is derived from the paper “Boilerplate Detection using Shallow Text Features” by Christian Kohlschüster et al. that was presented at WSDM 2010. and there’s a google code repository available with the Java source and binaries to download.