Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document clustering/classification module
The most used techniques for link extraction (the crawling algorithm) are: Breadth-First, Best-First, PageRank, Shark-Search, and InfoSpiders. In the simple focused web spider that I’m building I am using the Best-First technique: best links that match the thesaurus are added to the URL Queue.
The second module that is often improved at a web mining system is the document clustering/classification module. There are differences between clustering and classification. Let’s point the main differences between clustering and classification.
Cluster analysis or clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure.
Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.
Clustering
- Unsupervised
- Input
- Clustering algorithm
- Similarity measure
- Number of clusters
- No specific information for each document
Classification
- Supervised
- Each document is labeled with a class
- Build a classifier that assigns documents to one of the classes
Classification: the task is to learn to assign instances to predefined classes.
Clustering: no predefined classification is required. The task is to learn a classification from the data.
Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each
other, they share certain properties
Supervised learning: classification requires supervised learning, i.e., the training data has to specify what we are trying to learn (the classes).
Unsupervised learning: clustering is an unsupervised task, i.e., the training data doesn’t specify what we are trying to learn (the clusters).
One of the most used and wide know algorithm for clustering is K-means.
Well known document classification techniques are naive Bayes classifier and kNN (K-nearest neighbor).
So, to resume, let’s point the following ideas:
- classification and clustering both work with documents
- with classification we try to classify every document by a well established criteria, and we (humans) often supervise the process
- with clustering we let the computer to determine (discover) what are the main classes from some documents and to partition the documents between these classes
I hope you get a good idea about the difference between clustering and classification by reading this article. You can also use the following resources to find out more:


So are you going to use the classifier class to classify both the page and the links? Or if the document matches the thesaurus you’ll extract all the links from it?
Right now, using the Thesaurus and Link classes the crawler can only determine if a document is interesting (it’s on topic) and if *some* links also look interesting and should be considering for crawling.
The next step (and I’ll post about it pretty soon) is to determine as accurate as possible the links that are worth to follow. That’s the reason the thesaurus is so simple right now.
Also, the number of links extracted from a certain page will probably be limited, because I don’t want to crawl only one website.
After this step is complete, I’ll modify the Thesaurus class so it can be used for classifying documents (pages). Simply put: the thesaurus will have “categories” and each category will have a certain list of keywords. An hash map should be simply enough for this.
To resume:
- cluster or classify documents, not links
- use the thesaurus just for the Best-First algorithm
- the thesaurus will be soon improved and I’ll come with some simple document classification