URL Quote #3: Wikipedia’s URLs a reason for their success?
Wednesday, March 7th, 2007“Wikipedia’s URL spaces are highly elegant; I suspect it’s one of the reason’s Wikipdedia is successful”
“Wikipedia’s URL spaces are highly elegant; I suspect it’s one of the reason’s Wikipdedia is successful”
“…one should take an hour or so and really think about their website’s ‘public face.’”
This is quiz #2 of our ongoing URLQuiz series.
In this quiz, there are 26 pairs of URLs (A..Z) and for each pair the questions is: “Which of these two URLs are equivlent?” i.e. which return the same resource when dereferenced, and “Which can be cached as the same URL?”
You should answer ‘Yes‘, ‘No‘, or ‘Maybe‘ where ‘Maybe‘ means ‘It might return the same resource but should not be cached according to the specs.’
To answer, leave a comment and ideally explain your reasoning for each. Feel free to group answers based on your reasoning and/or the answer given (Yes, No, and Maybe.) Print it out and take the quiz with pencil and paper if you serious about getting it right, and feel free to use a computer or browser or whatever to test your results before answering. Good luck!
About the TLD .foo[1]
Clarification (2007-March-02): Some people have stated that the server could possibly return the same resource for any two given URLs so they felt the answer could never be ‘No.’ I definitely see their point, for example http://mysite.foo/bar and http://mysite.foo/bazz could possibly return the same thing but nobody would ever reasonably expect them to do so on their on. So let me clarify to say that I meant a quiz taker to select ‘No‘ in the case where where the resource returned would definitely not be the same thing unless the developer or server admin explicity programmed or configured them to do so. On the other hand ‘Maybe‘ would be used in the case where someone might reasonably expect the two URLs to return the same resource even though the RFC 3986 would define the two URLs as being different such as in [footnote 2], or when it depends on the O/S of the server as in [footnote 3]. Regarding fragments, the question is “In a transaction between a client and a server, is the cache allowed to view them as the same?” Regarding which can be cached I was looking for what are appropriate per the spec, not necessarily whether any particular software in the cloud (i.e. routers, proxies, browsers, etc.) actually does cache but instead “Would it be allowed to cache?”
P.S. Don’t stress if you can’t answer them all. It took me months to uncover all these nuances, and if I were taking this quiz I doubt I could get them right all in one sitting.
I was planning to blog something else today, but Mark Nottingham of Yahoo made a statement about URL Design [1] in his post entitled REST Issues, Real and Imagined and I simply could not let his statement without comment.
But first let me say I always appreciate Mark’s perspective on issues, and enjoy reading his writings whether on his blog or in the mailing lists. His perspective is typically insightful and prescient, and he hovers above the hovers above the muck-racking and unsupported claims that can occur on mailing lists populated by egos. All in all his involvement is very professional, and I highly respect him for that,
That said, here is the comment me made that really bothered me (2nd and subsequent emphasis mine):
Red Herring: URI Design
When somebody first “gets” REST, they often spend an inordinate amount of time agonising over the exact design of the URIs in their application; take a look over on rest-discuss, for example. In the end, though, URI design is a mostly a cosmetic issue; sure, it’s evidence that you’ve thought about good resource modelling, and it makes things more human-intelligable, but it’s seldom worth spending so much time on it.I’d worry a lot more about cacheability, extensibility and well-defined formats before blowing out my schedule on well-designed URIs. For me, the high points are broadly exploiting the hierarchy, allowing relative references, and making sure that tools (e.g., HTML forms) can work well with my URIs; everything else is gravy.
I’ll start by saying I don’t really disagree significantly with his overall points that I believe he was making, such as the fact that there are other aspects that deserve attention in addition to URL design and also the fact some people appear to thrash when designing URLs. But to someone who does not appreciate URL design his statement could be easily misconstrued to mean that URI Design is not at all important, especially when he says it is mostly a cosmetic issue. That is false.
URI design is NOT merely a cosmetic issues! It has many tangible ramifications and those who ignore it do so at their own peril. Just scratching the surface, proper URL planning and design provides a framework for good information architecture, can facilitate spontaneous inbound linking via blogs, voting, tagging, and other social media sites, and can guard against broken URLs and subsequent traffic loss, to name but a few. They are issues that concern, or should concern everyone who publishes a site on the world wide web.
But what’s worries me most are the people who will certain latch onto Mark’s words as not only justification for ignoring patterns and best practices but also for antagonistically preaching against URL Design. This group includes web and content management system developers who would prefer not to be bothered with usability issues, system administrators who believe in security by obscurity (which itself is a fool’s precaution), dogmatists who misinterpret the principle of URI opacity and preach that web publishers should publish completely opaque URLs, opaque even to the web publishers themselves, and a tiny but vocal contingent that for reasons I cannot fathom argue against URI design even within totally unrelated conversations. It is for this reason I think Mark’s statement is potentially very damaging.
And as for his comments about rest-discuss, it is quite possible he was referring to conversations in which I participated. If so, I believe he mistook the crux of the conversation on several levels. The first was that in many cases I was advocating URL design, not agonizing over it. Certainly, URI design is really not that hard, it just takes understanding core principles and best practices, and then applying them. And secondly, some people escalate conversations to raging debates when simple questions are asked about proper URL usage in context of REST and Web Architecture. Mark could easily yet wrongly have misinterpreted those as too much hand-ringing over URL design.
In summary I don’t really fault Mark’s comments as I appreciate them to be. And I think Mark does a great service for the web in his quest to educate people about the value of REST. But as Mark is justifiably well respected I’m worried Mark’s comments may be used to rationalize bad URL Design. As such, I definitely hope he updates his post to guard against his word’s misuse.
P.S. Oh, and I also couldn’t help but wonder if Mark was trying to get in a cheap dig when he wrote “…before blowing out my schedule on well-designed URIs“…? But naaaaah, Mark’s a real professional and wouldn’t go for such an underhanded shot. ;-)
If you’ve read many of our other posts here at The Well Designed URLs Initiative, you know that we are strong advocates for User-Centered URL Design as well as for URL Literacy. It’s our contention that the URL is woefully under-appreciated as the most fundamentally important technology of the web, more important than HTTP, and even more important than HTML. The purpose of this post is to provide background for future posts explaining URL design importance from a perspective most website owners can appreciate; search engine rankings!
For those unfamiliar with Google’s core algorithm for determining its search engine results, you can read this article to learn about PageRank in more painstaking detail. Here, I’ll just try to explaining the aspects of PageRank as it relates to URLs. Note also that my explanations are simply meant to be a conceptual guide and not exacting details. The founders of Google did publish their initial algorithms but have since made tweaks that are as closely held a corporate secret as the formula for Classic Coke!
PageRank ias essentially a popularity rating, and a page’s PageRank is determined by the inbound links from other pages on the web. A PageRank can be as low of almost zero (0) to as high as ten (10). Google’s algorithms determine a page’s PageRank by dividing the PageRank for each of the inbound linking pages by the number of outbound links on each of those pages, factoring in each page’s PageRank, and then summing the results for all inbound links. Clear as mud, right? It’s easier to explain with an example, but let’s cover a bit more ground first.
PageRank considers each link a ‘vote’ for the page linked to. But unlike in a democratic “one citizen, one vote” society, Google’s algorithm more closely models the shareholders of a corporation voting their shares; the votes of those with “more” (PageRank or shares) have a greater influence on the outcome. So a link from a page with a PageRank of 7 is more valuable than a link from a page of PageRank 3; probably many orders of magnitude more valuable, as you’ll see next.
Because of the nature of the web, a small number of pages have a huge number of inbound links, and vice versa. So those with more links get more PageRank, but the value of PageRank is on a logarithmic scale thus it increases exponentially. Assuming[1] that the base were five (5), the value a page would get to vote based on it’s PageRank would look like this:
| PageRank | Value |
|---|---|
| 0 | 0 |
| 1 | 5 |
| 2 | 25 |
| 3 | 125 |
| 4 | 625 |
| 5 | 3,125 |
| 6 | 15,625 |
| 7 | 78,125 |
| 8 | 390,625 |
| 9 | 1,953,125 |
| 10 | 9,765,625 |
Assume a site somehow manages to get a persistent link from MySpace’s home page (www.myspace.com). At the moment contains MySpace’s home page contains about 70 outbound links and has a PageRank of seven (7). Let’s also assume that there are a total of 50 other inbound links, and let’s say the average PageRank for those pages linking in is three (3) and those pages have an average outbound link count of 10. From this, let’s calculate PageRank:
Looking it up in the table, the resultant PageRank for the home page is four (4).
As with the three ‘L’s of real estate, the three ‘P’s of inbound links are: PageRank, PageRank, PageRank! [2] Note how in the prior example the 50 inbound links of PR3 offered less PageRank than the one (1) inbound link from MySpace with PR7! Of course we don’t know the logarithmic base [1] but Phil Craven says 5 or 6 are what many people believe it to be.
Here is what it would look like with base two (2) through ten (10) (download the full calculations here as a zipped Excel 2003 file [4kb]):
| Logarithmic Base |
Value from PR3 * 50 / 10 |
Value From PR7 * 1 / 70 |
Total Value |
Resultant PageRank |
|---|---|---|---|---|
| 2 | 40 | 2 | 42 | 5 |
| 3 | 135 | 31 | 166 | 4 |
| 4 | 320 | 234 | 554 | 4 |
| 5 | 625 | 1,116 | 1741 | 4 |
| 6 | 1,080 | 3,999 | 5,079 | 4 |
| 7 | 1,715 | 11,765 | 13,480 | 4 |
| 8 | 2,560 | 29,959 | 32,519 | 4 |
| 9 | 3,645 | 68,328 | 71,973 | 5 |
| 10 | 5,000 | 142,857 | 147,867 | 5 |
So depending on the logarithmic base, PageRank fluctuates between four (4) and five (5) for this hypothetical example. However, starting with a logarithmic base of five (5) the one MySpace link overpowers the 50 others! And because pages with a PageRank closer to 10 are listed higher in Google’s search engine results page among competing pages, people focused on SEO are always trying to increase their page’s PageRank, often via unscrupulous means.
Of course nobody outside Google knows the exact formula or base exponents used, but hopefully this post illustrates the value of links from high PageRank pages.
However, I would be remiss if I didn’t point out that a single-minded focus on inbound links is fraught with peril, not the least because it might cause your pages to removed from Google’s index! Just as there are people selling weight loss products they claimn don’t require dieting or exercise, there are people offering ways to inbound links that don’t require having real people link to you. However, Google considers these shortcuts to be gaming the system is ever vigilant to discover those cheaters. If caught cheating, Google will ban your pages from their index without notice.
The best way to gain inbound links for your key pages on your website is to do the hard work of creating a site with great content that people want to link to.
So as an epilogue, getting inbound links is clearly necessary for high PageRank and thus good search engine results, but all those inbound links can be squandered without a good architecture and site management plan. To ensure that a site’s great content and popularity get reflected in appropriately high search engine ranking it’s critical to optimize the architecture of the site, the pages, and the URL structure as well as make plans for how the URL structure might change over time.
Personally, I think the most under-realized aspect of white hat SEO [3] is the lack of attention paid to URL planning and design, especially for larger websites. There are very few tools [4] besides the low-level and effectively simplistic URL rewriters like mod_rewrite for creating and maintaining a URL plan, very few articles [4] that discuss URL design, and no articles [4] I am aware of that discuss URL planning.
However, I believe website owners will see huge improvements compared to their prior rankings if they focus on URL design and create a URL management plan. The good news is that URL design is mostly a one time endeavor assuming site maintainers adhere to the management plan, at least until there is a full site rearchitecture.
But all of the whys and wherefores regarding URL planning and design are beyond the scope of this post, and instead will be the subject of many posts in the future. Stay subscribed!
And, as I stated at the start of this post you can learn more the PageRank formula here, and you can also google for PageRank to get a large list of other resources.
Technorati Tags: PageRank | Whitehat | SEO | URL Design
Jon Udel has a post today entitled Divergent citation-indexing paths. Funny that he wrote about this; it seems he and I are on such a parallel trajectory these days. For evidence, take a look at my post from last week I titled Lessons Learned from Delicious Praise.
In his post Jon states “Del.icio.us, unlike Bloglines, treats the URLs that you feed to its citation counter in a case-sensitive way.” I wonder if Jon is aware of the Tim Berners-Lee’s URI Opacity Axiom? A correct reading of the URI Opacity Axiom would reveal that in that Bloglines is in the wrong and Del.icio.us is in the right in this case; i.e. URLs are case-sensitive and programs SHOULD NOT[1] infer that two URLs with different casing point to the same resource. (NOTE: case doesn’t matter with domain names but it DOES matter in URL paths.)
As a matter of fact the URI Opacity Axiom is such as a closely held belief among those I like to call “the Weborati” that if you even question it so as to understand it on certain W3C or related mailing lists you’ll be in for a firestorm as if you blasphemed the messiah! ;-0
All kidding aside, especially since some of those people who hold the URI Opacity Axiom dear read this blog (!), after spending the time to research it and really learn it I came to believe that it is a very good idea for people to follow the URI Opacity Axiom. And I’ll discuss why in the future when I have more time. Unfortunately, like many principled concepts some people have elevated the URI Opacity Axiom to the level of dogma, and many of those who preach it believe it means a distortion of what it really means.
So Jon identified a real-world problem that following the URI Opacity Axiom introduces yet it is somewhat of a “catch-22“; following the axiom creates real-world problems but not following it creates other real world problems. But longer term, I really don’t think it has to be this way, and I’m working on ideas to address this issue that may turn into draft proposals or recommendations or something else. Basically I think with some “layers” of technology added to the web we could have the best of both worlds.
As an aside, Del.icio.us could update links based upon 301 redirects and then website owners could 301 redirect to a(n ideally lowercased) canonical URL whenever their server receives a request for a URL that is not in the canonical URLs format. This assumes of course that the website owner/server operator has chosen for their URLs to effectively be case-insensitive[2].