Feeds:
Posts
Comments

Web content mining is an interesting and wide domain. Almost everyone can modify one or several modules from a simple web mining system (like the one we’re building) and create a hole different approach to a more specific problem. There are two thinks that are very often modified (improved):
- the link extractor module
- the document clustering/classification module

The most used techniques for link extraction (the crawling algorithm) are: Breadth-First, Best-First, PageRank, Shark-Search, and InfoSpiders. In the simple focused web spider that I’m building I am using the Best-First technique: best links that match the thesaurus are added to the URL Queue.

The second module that is often improved at a web mining system is the document clustering/classification module. There are differences between clustering and classification. Let’s point the main differences between clustering and classification.
Continue Reading »

I am glad to announce you that I am participating at Google Summer of Code ™ 2009 for the XWiki organization. My project is related with XOffice – XWiki integration with MS Office:

  1. adding style support to XWord
  2. porting XOffice to MS Office 2003

Remember when I said that “I am going to start a project about style-sheet extraction, CSS analyzing and CSS optimizing” here? Well, I was talking about the style support for XOffice. This page has more information about this issue.

So, by August, XWiki will have integration with various Office platforms:

  • MS Office 2007 – it exist, I’ll add style support
  • MS Office 2003 – I’ll port XOffice
  • Open Office – Cristina Scheau will do it, during Google Summer of Code too

Please visit http://xoffice.xwiki.org for more information, roadmap and current status of XOffice project

Some links

  1. Google Summer of Code site
  2. XWiki at Google Summer of Code
  3. XWiki projects for Google Summer of Code
  4. My myxwiki.org wiki

The Thesaurus

It’s time to make our focused web crawler aware about it’s topic: the thesaurus. The simplest way is to have a flat thesaurus: a list of keywords that are related to our topic. And this is how we’re going to implement it. In the future we could improve it – an hierarchical thesaurus: a tree or even a graph. Take a look at The 1998 ACM Computing Classification System for an hierarchical thesaurus.

Now let’s implement our flat thesaurus:

package ro.teo.ssc.thesaurus;
import java.util.List;

public interface IThesaurus {
	void addKeyword(String keyWord);
	List<String> getKeywords();
}
package ro.teo.ssc.thesaurus;

import java.util.ArrayList;
import java.util.List;

public class SimpleThesaurus implements IThesaurus{
	private static class SingletonHolder {
	     private static final SimpleThesaurus
	     		INSTANCE=new SimpleThesaurus();
	}

	public static synchronized SimpleThesaurus getInstance(){
		return SingletonHolder.INSTANCE;
	}

	private List<String> _keywords;

	private SimpleThesaurus(){
		_keywords=new ArrayList<String>();
	}

	@Override
	public void addKeyword(String keyWord) {
		if (keyWord!=null){
			if (keyWord.length()>0){
				if (!_keywords.contains(keyWord))
					_keywords.add(keyWord);
			}
		}
	}

	@Override
	public List<String> getKeywords() {
		//we don't want the list to be modified by
		//mistake, so we return a copy of it
		List<String> temp=new ArrayList<String>();
		temp.addAll(_keywords);
		return temp;
	}

}

The Link

Let’s come back to our simple focused crawler. It’s time to start filtering the links we extract and the pages we download.
We need

  1. to know more info about the links – the Link class
  2. to know our topic – the Thesaurus class
  3. filter extracted links
  4. filter downloaded pages

The first step is to define the Link

package ro.teo.ssc.obj;

public class Link {
	private static Link _empty;
	public static Link Empty(){
		if (_empty==null){
			_empty=new Link();
		}
		return _empty;
	}

	/**
	 * The URL to which the link is pointing.
	 */
	private String href;
	/**
	 * The 'title' attribute of the link.
	 */
	private String title;
	/**
	 * Text marked as link.
	 * This is usually the visible text that the user clicks on,
	 * Or the 'alt' attribute for the image marked as link.
	 */
	private String innerText;

	/**
	 * Used for ordering links from a web page.
	 * This has nothing to do with ranking the page
	 * like PageRank or HITS algorithms.
	 */
	private int rank;

	public Link(){
		href="";
		title="";
		innerText="";
		rank=0;
	}

	public Link(String href, String title, String innerHTML) {
		this();
		if (href!=null)
			this.href = href;
		if (title!=null)
			this.title = title;
		if (innerHTML!=null)
			this.innerText = innerHTML;
		rank=0;
	}

	public String getHref() {
		return href;
	}

	public void setHref(String href) {
		this.href = href;
	}

	public String getTitle() {
		return title;
	}

	public void setTitle(String title) {
		this.title = title;
	}

	public String getInnerHTML() {
		return innerText;
	}

	public void setInnerHTML(String innerHTML) {
		this.innerText = innerHTML;
	}

	public int getRank() {
		return rank;
	}

	public void setRank(int rank) {
		this.rank = rank;
	}

	@Override
	public String toString() {
		String s=" href(";

		if (href.length()>0) {s+=href;}
		s+=") ";

		s+=" title(";
		if (title.length()>0) {s+=title;}
		s+=")";

		s+=" innerHTML(";
		if (innerText.length()>0) {s+=innerText;}
		s+=") ";

		return s;
	}

}

Some ideas from The Cathedral and the Bazaar – Eric Steven Raymond

  1. Every good work of software starts by scratching a developer’s personal itch.
  2. Good programmers know what to write. Great ones know what to rewrite (and reuse).
  3. “Plan to throw one away; you will, anyhow.” (Fred Brooks, The Mythical Man-Month, Chapter 11)
  4. If you have the right attitude, interesting problems will find you.
  5. When you lose interest in a program, your last duty to it is to hand it off to a competent successor.
  6. Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging.
  7. Release early. Release often. And listen to your customers.
  8. Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.
  9. Smart data structures and dumb code works a lot better than the other way around.
  10. If you treat your beta-testers as if they’re your most valuable resource, they will respond by becoming your most valuable resource.
  11. The next best thing to having good ideas is recognizing good ideas from your users. Sometimes the latter is better.
  12. Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong.
  13. Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.
  14. Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected.
  15. A security system is only as secure as its secret. Beware of pseudo-secrets.
  16. To solve an interesting problem, start by finding a problem that is interesting to you.
  17. Many heads are inevitably better than one.

Preffered what?

Are we still talking about Java? The language, not the coffee.

Preferred caffeinated beverage

Preferred caffeinated beverage

Let’s see some common web spider architectures:

  1. the big picture
  2. a basic web crawler
  3. and a large-scale web spider
crawler-infrastructure-01

Crawler infrastructure: the big picture

crawler-infrastructure-02-basic

Basic crawler architecture

Large scale web crawler

Large scale web crawler

Older Posts »