Google News is my favorite news website. My day doesnt begin till I have my daily dose of the world’s happenings from Google News. Since the last month or two, I’ve noticed that stories from a particular site (www.playfuls.com) seems to appear very frequently on Google News. At times, they even have 2-3 simultaneous stories on Google News page. See screenshot below:
![]()
My guess is that they probably release the news very fast - Google News has a time decay factor in the algorithm while clustering the news.
I’m a huge fan of Google News and Techmeme - news-blog tracking services. The best part of these sites is that they offer a clustered view of stories. This clustering feature greatly enhances the end user experience - if a reader finds a particular cluster interesting, they can very easily read other,alternative articles related to the same story. Besides, it gives a great perspective of which story is currently hot etc.
So, how does this news clustering actually work ? The following are the major steps involved:
1) There is a crawler, which periodically goes and crawls through a list of news sites - this list of sites is very crucial and needs to be kept a secret to detract spammers. And for the same reasons, both Google News and Techmeme do not publicly disclose this list of sources
2) Once the crawler has finished crawling through the news sites, the stories are stored in a backend database. Then, a text summarization algorithm needs to be run against each story. The purpose of the text summarization is to fetch the keywords or a document summary from the news story. Popular technics for text summarization are binomial summarization, multinomial summarization etc. Other forms of natural language processing are done on the news story - stemming (similarity of phrases or sentences), stopping (removing stop words like ‘of’,'for’ etc.), boosting (adding weights to a sentence depending on where it occurs - eg in the title, story etc.).
3) Once the summarization is done on the news story, clustering algorithms are applied to the stories based on the summary that was obtained from step 2. Popular types of clustering algorithms are K-means algorithm and hierarchical algorithms.
Now, these algorithms generally do not include a time decay factor since they are generic clustering algorithms. Since we’re dealing with news stories that are temporally based, it is crucial to include a time factor (’freshness’) into the algorithm. While searching on the net, I came across a brilliant resource which discusses time aware clustering algorithms for news clustering. It is the Phd thesis of Antonio Gulli who now works as the Director of Advanced search products for Ask. You can find it here.
4) Once you are done running the clustering algorithm on the news summaries, your clusters are ready for consumption.
So does it mean you have to go through so much work if you need to leverage document/news clustering for your startup ?
The answer is no - fortunately you have some good open source softwares to make your life a bit easier. Albeit, combined together, they are not the perfect solution - you still need to tweak and plug in several missing aspects.
Nutch - Nutch is an open source search engine based on Jakarta Lucene
Carrot2 - Its a open source document clustering engine and includes implementations for several clustering algorithms. They even have a plugin for Nutch. Carrot2 is Java based but they claim to offer xml-rpc access for PHP and other scripting languages (although I couldnt find very good documentation about this on their site).
If you liked my post, feel free to subscribe to my rss feeds



















BlogoSquare