Feeds:
Posts
Comments

Archive for the ‘web mining’ Category

Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document [...]

Read Full Post »

It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree [...]

Read Full Post »

Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need

to know more info about the links – the Link class
to know our topic – the Thesaurus class
filter extracted links
filter downloaded pages

The first step is to define the Link

package ro.teo.ssc.obj;

public class Link [...]

Read Full Post »

Let’s see some common web spider architectures:

the big picture
a basic web crawler
and a large-scale web spider

Read Full Post »

Some websites use different techniques to insert in their pages links that are not for the users, but for the web bots.
Some of this techniques include CSS hiding, JavaScript hiding and JavaScript link removing.
Some CSS examples:

<a href="URL1" style="display:none">Hidden Link 1</a>

<a href="URL1" class="hiddenLink">Hidden Link 1</a>

<a href="URL2">
Hidden Link 2
</a>

<div style="display:none;">
..
<a href="URL3">Hidden [...]

Read Full Post »

Let’s see what we’ve covered so far:

as long as there are addresses in URL Queue, repeat:
– get the first address from the queue
– download the page from that address and store
it in a temp location
– check [...]

Read Full Post »

If you’ve heard about Agile Programming, Extreme Programming or you’ve been working on some real projects, you know that specifications change in time.
When I first thought about IURLQueue, I didn’t think I’ll need to know the actual number of items in the queue.
When I actually needed it, I modified the code:

package ro.teo.ssc.urlfrontier;

public interface INewUrlQueue [...]

Read Full Post »

Older Posts »