<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Classification and clustering [part one]</title>
	<atom:link href="http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/feed/" rel="self" type="application/rss+xml" />
	<link>http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/</link>
	<description>Teofil Achirei's official blog</description>
	<lastBuildDate>Thu, 03 Sep 2009 12:30:43 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: teofilachirei</title>
		<link>http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/#comment-43</link>
		<dc:creator>teofilachirei</dc:creator>
		<pubDate>Mon, 18 May 2009 19:23:19 +0000</pubDate>
		<guid isPermaLink="false">http://teofilachirei.wordpress.com/?p=294#comment-43</guid>
		<description>Right now, using the Thesaurus and Link classes  the crawler can only determine if a document is interesting (it&#039;s on topic) and if *some* links also look interesting and should be considering for crawling. 
The next step (and I&#039;ll post about it pretty soon) is to determine as accurate as possible the links that are worth to follow. That&#039;s the reason the thesaurus is so simple right now.
Also, the number of links extracted from a certain page will probably be limited, because I don&#039;t want to crawl only one website.
After this step is complete, I&#039;ll modify the Thesaurus class so it can be used for classifying documents (pages). Simply put: the thesaurus will have &quot;categories&quot; and each category will have a certain list of keywords. An hash map should be simply enough for this.

To resume:
- cluster or classify documents, not links
- use the thesaurus just for the Best-First algorithm
- the thesaurus will be soon improved and I&#039;ll come with some simple document classification</description>
		<content:encoded><![CDATA[<p>Right now, using the Thesaurus and Link classes  the crawler can only determine if a document is interesting (it&#8217;s on topic) and if *some* links also look interesting and should be considering for crawling.<br />
The next step (and I&#8217;ll post about it pretty soon) is to determine as accurate as possible the links that are worth to follow. That&#8217;s the reason the thesaurus is so simple right now.<br />
Also, the number of links extracted from a certain page will probably be limited, because I don&#8217;t want to crawl only one website.<br />
After this step is complete, I&#8217;ll modify the Thesaurus class so it can be used for classifying documents (pages). Simply put: the thesaurus will have &#8220;categories&#8221; and each category will have a certain list of keywords. An hash map should be simply enough for this.</p>
<p>To resume:<br />
- cluster or classify documents, not links<br />
- use the thesaurus just for the Best-First algorithm<br />
- the thesaurus will be soon improved and I&#8217;ll come with some simple document classification</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ionut Popa</title>
		<link>http://teofilachirei.wordpress.com/2009/05/17/classification-clustering-1/#comment-42</link>
		<dc:creator>Ionut Popa</dc:creator>
		<pubDate>Mon, 18 May 2009 13:38:13 +0000</pubDate>
		<guid isPermaLink="false">http://teofilachirei.wordpress.com/?p=294#comment-42</guid>
		<description>So are you going to use the classifier class to classify both the page and the links? Or if the document matches the  thesaurus you&#039;ll extract all the links from it?</description>
		<content:encoded><![CDATA[<p>So are you going to use the classifier class to classify both the page and the links? Or if the document matches the  thesaurus you&#8217;ll extract all the links from it?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
