Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document [...]
Archive for the ‘web mining’ Category
Classification and clustering [part one]
Posted in programming, web mining, tagged best-first, classification, cluster analysis, clustering, focused crawler, k-means, K-nearest neighbor, kNN, naive Bayes classifier, web crawler, web mining, web spider on May 17, 2009 | 2 Comments »
The Thesaurus
Posted in programming, web mining, tagged crawler, focused crawler, thesaurus, web crawler, web mining, web spider on May 11, 2009 | Leave a Comment »
It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree [...]
The Link
Posted in programming, web mining, tagged crawler, focused crawler, web crawler, web mining, web spider on May 11, 2009 | Leave a Comment »
Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need
to know more info about the links – the Link class
to know our topic – the Thesaurus class
filter extracted links
filter downloaded pages
The first step is to define the Link
package ro.teo.ssc.obj;
public class Link [...]
Web Crawler Architectures
Posted in programming, web mining, tagged architecture, craw, crawler, crawler architecture, large scale web crawler, web, web crawler on April 14, 2009 | 1 Comment »
Let’s see some common web spider architectures:
the big picture
a basic web crawler
and a large-scale web spider
Evil Links
Posted in programming, web mining, tagged CSS, evil links, hidden links, JavaScript, stylesheet extraction on April 1, 2009 | 7 Comments »
Some websites use different techniques to insert in their pages links that are not for the users, but for the web bots.
Some of this techniques include CSS hiding, JavaScript hiding and JavaScript link removing.
Some CSS examples:
<a href="URL1" style="display:none">Hidden Link 1</a>
<a href="URL1" class="hiddenLink">Hidden Link 1</a>
<a href="URL2">
Hidden Link 2
</a>
<div style="display:none;">
..
<a href="URL3">Hidden [...]
A Simple Serial Focused Web Crawler 10
Posted in programming, web mining, tagged crawler, focused crawler, information retreival, web crawler, web mining on March 31, 2009 | Leave a Comment »
Let’s see what we’ve covered so far:
as long as there are addresses in URL Queue, repeat:
– get the first address from the queue
– download the page from that address and store
it in a temp location
– check [...]
Errata – New URLs Queue
Posted in programming, web mining, tagged crawler, focused crawler, information retreival, url frontier, web crawler, web mining, web spider on March 31, 2009 | Leave a Comment »
If you’ve heard about Agile Programming, Extreme Programming or you’ve been working on some real projects, you know that specifications change in time.
When I first thought about IURLQueue, I didn’t think I’ll need to know the actual number of items in the queue.
When I actually needed it, I modified the code:
package ro.teo.ssc.urlfrontier;
public interface INewUrlQueue [...]