May 17, 2009 by Teofil Achirei
Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document clustering/classification module
The most used techniques for link extraction (the crawling algorithm) are: Breadth-First, Best-First, PageRank, Shark-Search, and InfoSpiders. In the simple focused web spider that I’m building I am using the Best-First technique: best links that match the thesaurus are added to the URL Queue.
The second module that is often improved at a web mining system is the document clustering/classification module. There are differences between clustering and classification. Let’s point the main differences between clustering and classification.
Continue Reading »
Posted in programming, web mining | Tagged best-first, classification, cluster analysis, clustering, focused crawler, k-means, K-nearest neighbor, kNN, naive Bayes classifier, web crawler, web mining, web spider | 2 Comments »
May 11, 2009 by Teofil Achirei
Posted in GSoC, programming | Tagged 2009, Google Summer of Code, GSoC, GSoC 2009, microsoft office, microsoft word, MS Office, Office, Office 2003, Office integration, open office, XOffice, XWiki, XWord | 2 Comments »
May 11, 2009 by Teofil Achirei
It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree or even a graph. Take a look at The 1998 ACM Computing Classification System for an hierarchical thesaurus.
Now let’s implement our flat thesaurus:
package ro.teo.ssc.thesaurus;
import java.util.List;
public interface IThesaurus {
void addKeyword(String keyWord);
List<String> getKeywords();
}
package ro.teo.ssc.thesaurus;
import java.util.ArrayList;
import java.util.List;
public class SimpleThesaurus implements IThesaurus{
private static class SingletonHolder {
private static final SimpleThesaurus
INSTANCE=new SimpleThesaurus();
}
public static synchronized SimpleThesaurus getInstance(){
return SingletonHolder.INSTANCE;
}
private List<String> _keywords;
private SimpleThesaurus(){
_keywords=new ArrayList<String>();
}
@Override
public void addKeyword(String keyWord) {
if (keyWord!=null){
if (keyWord.length()>0){
if (!_keywords.contains(keyWord))
_keywords.add(keyWord);
}
}
}
@Override
public List<String> getKeywords() {
//we don't want the list to be modified by
//mistake, so we return a copy of it
List<String> temp=new ArrayList<String>();
temp.addAll(_keywords);
return temp;
}
}
Posted in programming, web mining | Tagged crawler, focused crawler, thesaurus, web crawler, web mining, web spider | Leave a Comment »
May 11, 2009 by Teofil Achirei
Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need
- to know more info about the links – the Link class
- to know our topic – the Thesaurus class
- filter extracted links
- filter downloaded pages
The first step is to define the Link
package ro.teo.ssc.obj;
public class Link {
private static Link _empty;
public static Link Empty(){
if (_empty==null){
_empty=new Link();
}
return _empty;
}
/**
* The URL to which the link is pointing.
*/
private String href;
/**
* The 'title' attribute of the link.
*/
private String title;
/**
* Text marked as link.
* This is usually the visible text that the user clicks on,
* Or the 'alt' attribute for the image marked as link.
*/
private String innerText;
/**
* Used for ordering links from a web page.
* This has nothing to do with ranking the page
* like PageRank or HITS algorithms.
*/
private int rank;
public Link(){
href="";
title="";
innerText="";
rank=0;
}
public Link(String href, String title, String innerHTML) {
this();
if (href!=null)
this.href = href;
if (title!=null)
this.title = title;
if (innerHTML!=null)
this.innerText = innerHTML;
rank=0;
}
public String getHref() {
return href;
}
public void setHref(String href) {
this.href = href;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getInnerHTML() {
return innerText;
}
public void setInnerHTML(String innerHTML) {
this.innerText = innerHTML;
}
public int getRank() {
return rank;
}
public void setRank(int rank) {
this.rank = rank;
}
@Override
public String toString() {
String s=" href(";
if (href.length()>0) {s+=href;}
s+=") ";
s+=" title(";
if (title.length()>0) {s+=title;}
s+=")";
s+=" innerHTML(";
if (innerText.length()>0) {s+=innerText;}
s+=") ";
return s;
}
}
Posted in programming, web mining | Tagged crawler, focused crawler, web crawler, web mining, web spider | Leave a Comment »
April 23, 2009 by Teofil Achirei
Some ideas from The Cathedral and the Bazaar – Eric Steven Raymond
- Every good work of software starts by scratching a developer’s personal itch.
- Good programmers know what to write. Great ones know what to rewrite (and reuse).
- “Plan to throw one away; you will, anyhow.” (Fred Brooks, The Mythical Man-Month, Chapter 11)
- If you have the right attitude, interesting problems will find you.
- When you lose interest in a program, your last duty to it is to hand it off to a competent successor.
- Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging.
- Release early. Release often. And listen to your customers.
- Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.
- Smart data structures and dumb code works a lot better than the other way around.
- If you treat your beta-testers as if they’re your most valuable resource, they will respond by becoming your most valuable resource.
- The next best thing to having good ideas is recognizing good ideas from your users. Sometimes the latter is better.
- Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong.
- Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.
- Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected.
- A security system is only as secure as its secret. Beware of pseudo-secrets.
- To solve an interesting problem, start by finding a problem that is interesting to you.
- Many heads are inevitably better than one.
Posted in other, programming | Tagged Bazaar, Cathedral, Eric Steven Raymond, The Cathedral and the Bazaar | Leave a Comment »
April 16, 2009 by Teofil Achirei
Are we still talking about Java? The language, not the coffee.

Preferred caffeinated beverage
Posted in Uncategorized | Tagged Preferred caffeinated beverage | Leave a Comment »
April 14, 2009 by Teofil Achirei
Let’s see some common web spider architectures:
- the big picture
- a basic web crawler
- and a large-scale web spider

Crawler infrastructure: the big picture

Basic crawler architecture

Large scale web crawler
Posted in programming, web mining | Tagged architecture, craw, crawler, crawler architecture, large scale web crawler, web, web crawler | 1 Comment »