Even a chimp can write code

Saturday, November 05, 2005

RE: Simpler Urls Get Used More Than Hideous Ones

Rob Relyea blogs about how simple URLs (i.e. those without too many parameters) are more popular with people than more complex ones. Certainly, this is undeniable. I wanted to comment on his post but couldn't (didn't want to register). Besides, I like to hear myself speak as much as the next guy, so I figured a post was in order.

Apart from simple URLs being easier for humans to remember and type, I believe search engines are a big part of why simple URLs are so universally used and linked to. And since it is easy to link to simple URLs, more people link to them and they rank higher with search engines.

Search engines employ bots to locate and crawl through your web pages. These bots use the same means as do those of us using the garden variety browser to read web pages. The bots read pages and follow links on those pages and so on ad infinitum. Google, for instance, says:
We're able to index dynamically generated pages. However, because our web crawler could overwhelm and crash sites that serve dynamic content, we limit the number of dynamic pages we index. In addition, our crawlers may suspect that a URL with many dynamic parameters might be the same page as another URL with different parameters. For that reason, we recommend using fewer parameters if possible. Typically, URLs with 1-2 parameters are more easily crawlable than those with many parameters.

There is another problem with indexing dynamic content requiring multiple request parameters: bots getting lost on web sites with infinite number of links or listings. Say, I maintained a calendar application on my website where I enter my meetings and appointments. If my calendar's timeline has no upper limit to it, you can theoretically navigate from Nov 2005 to Dec 2005 to Jan 2006 and so on for centuries. Only those dates in the not so distant future will have events in them. These are palatable for search engines. Other dynamically generated pages will be nearly empty. A typical bot would be wasting considerable time mired in this black hole. So it is only natural that search engine bots are disinclined to index pages where data comes from a backend database.

Also for corporations moving their applications to the web, it makes sense to have links that aren't tightly coupled with the technology used in the backend. I have personally made decisions like this in past projects. For instance, using a URL format like http://www.amazon.com/gp/product/B00006AVRK/104-9835185-4567919 is much better than http://www.amazon.com/gp.jsp?product=B00006AVRK&id=104-9835185-4567919. This is because if you move from using JSP/Struts/Java to ASP/C# or Perl, you don't have to comb through your content and change all URLs. You'd also have to inform partners and others who link to you, notifying them of the change. Now, who wants to sign up for that job? Keeping the URLs neutral really pays off. Plus, search engine bots have no idea the information they are indexing is coming from a database. I am willing to bet this is why Amazon and others employ this form of URL rewriting.

Email this | Bookmark this