Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document [...]
Archive for the ‘programming’ Category
Classification and clustering [part one]
Posted in programming, web mining, tagged best-first, classification, cluster analysis, clustering, focused crawler, k-means, K-nearest neighbor, kNN, naive Bayes classifier, web crawler, web mining, web spider on May 17, 2009 | 2 Comments »
Google Summer of Code 2009
Posted in GSoC, programming, tagged 2009, Google Summer of Code, GSoC, GSoC 2009, microsoft office, microsoft word, MS Office, Office, Office 2003, Office integration, open office, XOffice, XWiki, XWord on May 11, 2009 | 2 Comments »
I am glad to announce you that I am participating at Google Summer of Code ™ 2009 for the XWiki organization. My project is related with XOffice – XWiki integration with MS Office:
adding style support to XWord
porting XOffice to MS Office 2003
Remember when I said that “I am going to start a project about style-sheet [...]
The Thesaurus
Posted in programming, web mining, tagged crawler, focused crawler, thesaurus, web crawler, web mining, web spider on May 11, 2009 | Leave a Comment »
It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree [...]
The Link
Posted in programming, web mining, tagged crawler, focused crawler, web crawler, web mining, web spider on May 11, 2009 | Leave a Comment »
Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need
to know more info about the links – the Link class
to know our topic – the Thesaurus class
filter extracted links
filter downloaded pages
The first step is to define the Link
package ro.teo.ssc.obj;
public class Link [...]
The Cathedral and the Bazaar
Posted in other, programming, tagged Bazaar, Cathedral, Eric Steven Raymond, The Cathedral and the Bazaar on April 23, 2009 | Leave a Comment »
Some ideas from The Cathedral and the Bazaar – Eric Steven Raymond
Every good work of software starts by scratching a developer’s personal itch.
Good programmers know what to write. Great ones know what to rewrite (and reuse).
“Plan to throw one away; you will, anyhow.” (Fred Brooks, The Mythical Man-Month, Chapter 11)
If you have the right attitude, [...]
Web Crawler Architectures
Posted in programming, web mining, tagged architecture, craw, crawler, crawler architecture, large scale web crawler, web, web crawler on April 14, 2009 | 1 Comment »
Let’s see some common web spider architectures:
the big picture
a basic web crawler
and a large-scale web spider
Evil Links
Posted in programming, web mining, tagged CSS, evil links, hidden links, JavaScript, stylesheet extraction on April 1, 2009 | 7 Comments »
Some websites use different techniques to insert in their pages links that are not for the users, but for the web bots.
Some of this techniques include CSS hiding, JavaScript hiding and JavaScript link removing.
Some CSS examples:
<a href="URL1" style="display:none">Hidden Link 1</a>
<a href="URL1" class="hiddenLink">Hidden Link 1</a>
<a href="URL2">
Hidden Link 2
</a>
<div style="display:none;">
..
<a href="URL3">Hidden [...]