Feeds:
Posts
Comments

Archive for the ‘programming’ Category

Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document [...]

Read Full Post »

I am glad to announce you that I am participating at Google Summer of Code ™ 2009 for the XWiki organization. My project is related with XOffice – XWiki integration with MS Office:

adding style support to XWord
porting XOffice to MS Office 2003

Remember when I said that “I am going to start a project about style-sheet [...]

Read Full Post »

It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree [...]

Read Full Post »

Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need

to know more info about the links – the Link class
to know our topic – the Thesaurus class
filter extracted links
filter downloaded pages

The first step is to define the Link

package ro.teo.ssc.obj;

public class Link [...]

Read Full Post »

Some ideas from The Cathedral and the Bazaar – Eric Steven Raymond

Every good work of software starts by scratching a developer’s personal itch.
Good programmers know what to write. Great ones know what to rewrite (and reuse).
“Plan to throw one away; you will, anyhow.” (Fred Brooks, The Mythical Man-Month, Chapter 11)
If you have the right attitude, [...]

Read Full Post »

Let’s see some common web spider architectures:

the big picture
a basic web crawler
and a large-scale web spider

Read Full Post »

Some websites use different techniques to insert in their pages links that are not for the users, but for the web bots.
Some of this techniques include CSS hiding, JavaScript hiding and JavaScript link removing.
Some CSS examples:

<a href="URL1" style="display:none">Hidden Link 1</a>

<a href="URL1" class="hiddenLink">Hidden Link 1</a>

<a href="URL2">
Hidden Link 2
</a>

<div style="display:none;">
..
<a href="URL3">Hidden [...]

Read Full Post »

Older Posts »