Some websites use different techniques to insert in their pages links that are not for the users, but for the web bots.
Some of this techniques include CSS hiding, JavaScript hiding and JavaScript link removing.
Some CSS examples:
<a href="URL1" style="display:none">Hidden Link 1</a>
<a href="URL1" class="hiddenLink">Hidden Link 1</a>
<a href="URL2">
Hidden Link 2
</a>
<div style="display:none;"> .. <a href="URL3">Hidden Link 3</a> .. </div>
<p style="visibility:hidden;"> .. <a href="URL4">Hidden Link 4</a> </p>
<a style="color:white;background-color:white;" href="URL5">
Hidden Link 5
</a>
Some JavaScript examples
<script type="text/javascript">
document.write("<div style='display:none;'>");
</script>
<a href="URL6">Hidden Link 6</a>
<script type="text/javascript">
document.write("</div>");
</script>
<div id="aDiv">
<a href="URL7" id="aLink">Hidden Link 7</a>
</div>
<script type="text/javascript">
var aLink=document.getElementById("aLink");
var aDiv=document.getElementById("aDiv");
aDiv.removeChild(aLink);
</script>
For crawlers this is a very big issue:
- those links may increase the rank of other pages
- those links may be translated as resource consuming: network bandwidth, storage space, processing time
And there are few things we can do about.
Our little focused web crawler will not care about these issues. But it’s something we must have in mind when we will design a more elaborated crawler.
I am going to start a project about style-sheet extraction, CSS analyzing and CSS optimizing. Hopefully this project will succeed in August. It will be open source, and when I’ll finish it, I’ll publish some posts. I can’t tell you more about this. Sorry, but it is bad luck for me when I talk about not started projects.
If you have any idea regarding hidden links detection, style extraction, etc. feel free to contact me
Google and some of the other more popular engines do something like this: look at your page, identify search terms (synomis etc) the group them togheter for relavence. They do not really know what they mean but they do view them as a set. When someone else links to your site or your site links to them the crawlers compare these sets to determine the relavence of the pages betwean them, and subsquent ranking calculations. I forget the term for this sort of technique.
Well, I suppose this does not really solve your crawler problem…
@7kittens: … clustering, k-means? I’m not sure this is what you are talking about. Thanks for your comment, I’ll dig a little more about this.
Oh by the way, there is a sourcecode tag to format and color code your example code, like so:
<script type="text/javascript"> document.write("<div style='display:none'>"); </script> <a href="URL6" rel="nofollow">Hidden Link 6</a> <script type="text/javascript"> document.write("</div>"); </script>Hope this helps.
Thank you very much! I was searching for something like that for a long time. I tried code tag but it was awful with my current theme.
It really helps!
I’ll try to update my posts to replace old pre tags with sourcecode
Your welcome. Post looks great.
You might be interested in learning about Crawling Web 2.0 Applications. I have cited two papers at Wiki Article on Web Crawlers. If you get more information, do drop me an email and update the same wiki article, for the rest of the world.
[...] am going to start a project about style-sheet extraction, CSS analyzing and CSS optimizing” here? Well, I was talking about the style support for XOffice. This page has more information about this [...]