URL Quote #2: Think about your website’s “public face.”
Saturday, March 3rd, 2007“…one should take an hour or so and really think about their website’s ‘public face.’”
“…one should take an hour or so and really think about their website’s ‘public face.’”
If you’ve read many of our other posts here at The Well Designed URLs Initiative, you know that we are strong advocates for User-Centered URL Design as well as for URL Literacy. It’s our contention that the URL is woefully under-appreciated as the most fundamentally important technology of the web, more important than HTTP, and even more important than HTML. The purpose of this post is to provide background for future posts explaining URL design importance from a perspective most website owners can appreciate; search engine rankings!
For those unfamiliar with Google’s core algorithm for determining its search engine results, you can read this article to learn about PageRank in more painstaking detail. Here, I’ll just try to explaining the aspects of PageRank as it relates to URLs. Note also that my explanations are simply meant to be a conceptual guide and not exacting details. The founders of Google did publish their initial algorithms but have since made tweaks that are as closely held a corporate secret as the formula for Classic Coke!
PageRank ias essentially a popularity rating, and a page’s PageRank is determined by the inbound links from other pages on the web. A PageRank can be as low of almost zero (0) to as high as ten (10). Google’s algorithms determine a page’s PageRank by dividing the PageRank for each of the inbound linking pages by the number of outbound links on each of those pages, factoring in each page’s PageRank, and then summing the results for all inbound links. Clear as mud, right? It’s easier to explain with an example, but let’s cover a bit more ground first.
PageRank considers each link a ‘vote’ for the page linked to. But unlike in a democratic “one citizen, one vote” society, Google’s algorithm more closely models the shareholders of a corporation voting their shares; the votes of those with “more” (PageRank or shares) have a greater influence on the outcome. So a link from a page with a PageRank of 7 is more valuable than a link from a page of PageRank 3; probably many orders of magnitude more valuable, as you’ll see next.
Because of the nature of the web, a small number of pages have a huge number of inbound links, and vice versa. So those with more links get more PageRank, but the value of PageRank is on a logarithmic scale thus it increases exponentially. Assuming[1] that the base were five (5), the value a page would get to vote based on it’s PageRank would look like this:
| PageRank | Value |
|---|---|
| 0 | 0 |
| 1 | 5 |
| 2 | 25 |
| 3 | 125 |
| 4 | 625 |
| 5 | 3,125 |
| 6 | 15,625 |
| 7 | 78,125 |
| 8 | 390,625 |
| 9 | 1,953,125 |
| 10 | 9,765,625 |
Assume a site somehow manages to get a persistent link from MySpace’s home page (www.myspace.com). At the moment contains MySpace’s home page contains about 70 outbound links and has a PageRank of seven (7). Let’s also assume that there are a total of 50 other inbound links, and let’s say the average PageRank for those pages linking in is three (3) and those pages have an average outbound link count of 10. From this, let’s calculate PageRank:
Looking it up in the table, the resultant PageRank for the home page is four (4).
As with the three ‘L’s of real estate, the three ‘P’s of inbound links are: PageRank, PageRank, PageRank! [2] Note how in the prior example the 50 inbound links of PR3 offered less PageRank than the one (1) inbound link from MySpace with PR7! Of course we don’t know the logarithmic base [1] but Phil Craven says 5 or 6 are what many people believe it to be.
Here is what it would look like with base two (2) through ten (10) (download the full calculations here as a zipped Excel 2003 file [4kb]):
| Logarithmic Base |
Value from PR3 * 50 / 10 |
Value From PR7 * 1 / 70 |
Total Value |
Resultant PageRank |
|---|---|---|---|---|
| 2 | 40 | 2 | 42 | 5 |
| 3 | 135 | 31 | 166 | 4 |
| 4 | 320 | 234 | 554 | 4 |
| 5 | 625 | 1,116 | 1741 | 4 |
| 6 | 1,080 | 3,999 | 5,079 | 4 |
| 7 | 1,715 | 11,765 | 13,480 | 4 |
| 8 | 2,560 | 29,959 | 32,519 | 4 |
| 9 | 3,645 | 68,328 | 71,973 | 5 |
| 10 | 5,000 | 142,857 | 147,867 | 5 |
So depending on the logarithmic base, PageRank fluctuates between four (4) and five (5) for this hypothetical example. However, starting with a logarithmic base of five (5) the one MySpace link overpowers the 50 others! And because pages with a PageRank closer to 10 are listed higher in Google’s search engine results page among competing pages, people focused on SEO are always trying to increase their page’s PageRank, often via unscrupulous means.
Of course nobody outside Google knows the exact formula or base exponents used, but hopefully this post illustrates the value of links from high PageRank pages.
However, I would be remiss if I didn’t point out that a single-minded focus on inbound links is fraught with peril, not the least because it might cause your pages to removed from Google’s index! Just as there are people selling weight loss products they claimn don’t require dieting or exercise, there are people offering ways to inbound links that don’t require having real people link to you. However, Google considers these shortcuts to be gaming the system is ever vigilant to discover those cheaters. If caught cheating, Google will ban your pages from their index without notice.
The best way to gain inbound links for your key pages on your website is to do the hard work of creating a site with great content that people want to link to.
So as an epilogue, getting inbound links is clearly necessary for high PageRank and thus good search engine results, but all those inbound links can be squandered without a good architecture and site management plan. To ensure that a site’s great content and popularity get reflected in appropriately high search engine ranking it’s critical to optimize the architecture of the site, the pages, and the URL structure as well as make plans for how the URL structure might change over time.
Personally, I think the most under-realized aspect of white hat SEO [3] is the lack of attention paid to URL planning and design, especially for larger websites. There are very few tools [4] besides the low-level and effectively simplistic URL rewriters like mod_rewrite for creating and maintaining a URL plan, very few articles [4] that discuss URL design, and no articles [4] I am aware of that discuss URL planning.
However, I believe website owners will see huge improvements compared to their prior rankings if they focus on URL design and create a URL management plan. The good news is that URL design is mostly a one time endeavor assuming site maintainers adhere to the management plan, at least until there is a full site rearchitecture.
But all of the whys and wherefores regarding URL planning and design are beyond the scope of this post, and instead will be the subject of many posts in the future. Stay subscribed!
And, as I stated at the start of this post you can learn more the PageRank formula here, and you can also google for PageRank to get a large list of other resources.
Technorati Tags: PageRank | Whitehat | SEO | URL Design
As promised, this is the first of what will be many URLQuizes here are the blog for The Well Designed URLs Initiative. This URLQuiz discusses the convention of using a subdomain with the name ‘www‘ to identify a website.
As most everyone knows, many of the first sites on the web started using this convention. Examples include www.amazon.com, www.yahoo.com, www.google.com, and www.ebay.com. However, there is nothing about the web that requires a subdomain be named ‘www‘ when selecting the address for a website. To the contrary, many websites use other subdomains for prefixes such as:
There is even a passionate contingent of web developers that believe the ‘www‘ convention is an anachronism and should be deprecated (or ‘eventually abolished‘, in layman’s terms.)
So how should the base domain and subdomain(s) be handled, and what are the pros and cons of each? Here are the options I’ve identified, but feel free to suggest others that come to mind as well:
So there you go; give your answer(s) in the comments. Though I definitely have my opinions on the subject I will stay out of it unless I don’t see anyone mentioning several of the points I think are relevant. After enough comments come in, I’ll summarize and write a follow up post, just like Dan Cederholm did with SimpleQuiz.
Hint: You might want to consider not only online usage but offline usage as well.
UPDATE: Just days after writing this post Tim Bromhead wrote: Which is better for your site: www or no www? Is that weird or what? Tim must have had some kind of a Vulcan Mind Meld or similar going on… Anywho, great article Tim and thanks for being a URLian!
UPDATE#2: Looks like I picked the right time to discuss this issue! A few days ago Scott Hanselman talked about the downside of ignoring the distinction between ‘www’ and the root domain, Jeff Atwood discussed how to solve it, to which Phil Haack then responds with a bit of a rant about the www or lack thereof. Since they both have such strong yet opposite opinions on the subject, maybe we can get both Jeff and Phil to weight in on the subject over here…?
Technorati Tags: URL Design | Subdomains | Canconical Form | www | no-www
As I was going through my Akismet spam filter today reviewing the 87 comment spam I got during the prior ~24 hours to ensure I didn’t delete any legitimate comments, it occurred to me that maybe there is a simple solution to comment spam.
What if blog apps could simply mark a hyperlink with?:
rel=”spam”
The simple idea is that rather than delete spams, blogs could start maintaining a special page of links to comment spammer’s websites using rel=”spam” on the “A” element. Basically this would be PageRank in reverse. The search engines would then apply negative weighting to anything marked spam and give the spammers the exact opposite of what they were pursuing when they unethically tried to game the system!\.
For example, Google could give negative PageRank for a spam link compared to positive PageRank for a non-spam link. Google could also weight the relevency of the link text negatively vs. the positive value it would give a non-spam link. This would have the affect of distributing the watch-dogging of spammers out onto the web without requiring any new infrastructure, and it would create a clear disincentive for comment spammers instead of the lack of disincentive from “nofollow.”
Are there problems with this I’m not foreseeing? Probably. I already know that people would try to game the system for negative purposes, and that’s to be expected. Still, I think that for the most part anyone simply using it to field a grudge or in as attempt to harm a competitor would be doing it by definition on such a small scale that it would have no effective. Given that the many comment spammers automate, they can end up with huge numbers of comment spam links. If the search engines merely weighted a spam link as 1/10th the negative value of a positive link, it would certainly still be effective.
Of course the hard-core Linux faithful would immediately spam-link to Microsoft.com just to spite them! But I really don’t (currently) see how that couldn’t be detected and managed via policies and algorithms. For example, if a company has a large number positive links it could be exempt from the effects of spam links. And I’m sure automated methods or methods using collective intelligence could emerge to resolve these problems the vast majority of time. The rest could be handled via policy; get caught spam-linking someone inappropriately and get your domain pulled from the index!
What’s more, it would give bloggers a sense of purpose when they review their spam filters instead of them feeling like the time spent was just a waste. I know that if my efforts to detect comment spammers could get them lower PageRank, I’d feel good about monitoring my comments for spam as I would be doing a service for the public good. And I’m sure most other bloggers would feel the same.
Now I know that Microformats.org has the similiar proposal VoteLinks, but that is about registering opinion as opposed to calling out gamers of the system. VoteLinks is also much broader than what I’m suggesting. If we keep the focus really narrow — shine a spolight on spam so that the search engines can erradicate it — then I’m pretty sure it would be a success.
What do you think? Good idea? Filled with holes I’ve not considered? I look forward to your feedback.
I recently had an off-list email conversation with Ian Hickson, the editor of the Web Application Hypertext Technology Working Group specifications (i.e. HTML5 and WebForms 2.0). I was proposing to him that the current WebForms 2.0 be draft specification be amended to include a URI Template in the “action” attribute of the FORM element. Because I believe so strongly in the benefit of this proposal and because such things are inline with the Well Designed URLs Initiatitve was envisioned to advocate for, I decided to publish it to our blog and reference it in the WHATWG blog. The following is what I sent to Ian in email:
I really want to see WHATWG incorporate URI Templates for Web Form actions[1]. i.e.:
<form
action="http://foo.com/{make}/{model}/”
method="get">
<input type="text" name="make" />
<input type="text" name="mode" />
<input type="submit" />
</form>If I type “Honda” and “Civic”, it will do a get to:
http://foo.com/Honda/Civic/Instead of the only current possibility being something like:
<form
method="get"
action="http://foo.com/cars.php">
<input type="text" name="make" />
<input type="text" name="mode" />
<input type="submit" />
</form>Which would produce the following for “Honda” and “Civic”:
http://foo.com/cars.php?make=Honda&model=Civic
To which Ian replied in two parts. Here is the first part:
“Why not just write a server-side redirector? That’s a trivial one to write. Four lines of code, maybe 10 if you make the recommended security checks first. You could also do it with a little bit of JavaScript.”
Unfortunately, a server-side redirector is not an appropriate solution in one case for the use-cases this proposal would address and doesn’t work for two others:
<form
action="http://www.myblog.com/{topic}/”
method=”post”>
<select name=”topic”>
<option value=”first”>My 1st Post</option>
<option value=”second”>My 2nd Post</option>
<option value=”third”>My 3d Post</option>
</select>
<input type=”text” name=”comment”>
<input type=”submit”>
</form>
<form
action="http://blog.whatwg.org/{topic}"
method="post">
<select name="topic">
<option value="feed-autodiscovery">
Feed Autodiscovery
</option>
<optionvalue="text-content-checking">
textContent Checking
</option>
<option value="checker-bug-fixes">
Bug Fixes
</option>
<option
value="significant-inline-checking">
Significant Inline Checking
</option>
<option value="charmod-norm-checking">
Charmod Norm Checking
</option>
<optionvalue="proposing-features">
Proposing features
</option>
</select>
<input type="submit">
</form>
So yes, server-side redirection is possible in some cases but by no means all, and for those cases where it’s possible it is not optimal.
Moving on the Ian’s suggestion to use “a little bit of JavaScript” to meet this use-case, I will admit it is possible to use JavaScript but these are the drawbacks in viewing JavaScript as the solution for this use-case:
It’s interesting to note that in the preface to the introduction for Section 3 of the WebForms 2.0 Working Draft of 12 October 2006, the following note is made about how everything that repeating form controls offers can already be done in JavaScript and the DOM. The mere fact that they went to the trouble to include something as complex as repeating controls into HTML5 when it can be done with JavaScript and the DOM implies that well-known patterns in web architecture are better implemented declaratively instead of via JavaScript and the DOM:
Occasionally forms contain repeating sections. For example, an order form could have one row per item, with product, quantity, and subtotal controls. The repeating form controls model defines how such a form can be described without resorting to scripting.
Note: The entire model can be emulated purely using JavaScript and the DOM. With such a library, this model could be used and down-level clients could be supported before user agents implemented it ubiquitously. Creating such a library is left as an exercise for the reader.
So yes it is possible to use JavaScript in many cases, but it no where near optimal. Javascript should not be considered the solution for as well-defined and obvious patterns such as submitting to a clean URL.
To further drive home the value of this proposal, anyone monitoring the REST-discuss list for any length of time will see that most REST experts tend toward using (what I call :) well-designed URLs, i.e. URLs where the resource is identified by path instead of query string. With WebForm 2.0’s pending support of PUT and DELETE, it would be just short of a crime not to include support for posting to clean URLs in WebForms 2.0.
Since having this discussion with Ian via email it was since pointed out to me on rest-discuss by Mark Baker that my proposal as written would break the existing web so was a non-starter. For some reason I wasn’t thinking about that, probably because I was more concerned about getting Ian (who I like to call: Mr. “No“ :) to agree that URI Templates were needed. Still, the solution would be simple.
What follows are my examples from above recast using an optional template attribute that would override the action attribute for WebForms 2.0 compliant browsers. This would of course require the server to accept both query string parameters and clean URLs (and hopefully do a server redirect from the former to the latter), or the submit could be implemented using Javascript for older browsers when applicable. Note that I didn’t show an example using JavaScript but, as the WebForm 2.0 spec says “(that) is left as an exercise for the reader”:
<form
action="http://foo.com/model"
template=”http://foo.com/{make}/{model}/”
method=”get”>
<input type=”text” name=”make” />
<input type=”text” name=”mode” />
<input type=”submit” />
</form>
<form
action="http://www.myblog.com/topic"
template=”http://www.myblog.com/{topic}/”
method=”post”>
<select name=”topic”>
<option value=”first”>My 1st Post</option>
<option value=”second”>My 2nd Post</option>
<option value=”third”>My 3d Post</option>
</select>
<input type=”text” name=”comment”>
<input type=”submit”>
</form>
<form
action="http://blog.whatwg.org/topic"
template=”http://blog.whatwg.org/{topic}”
method="post">
<select name="topic">
<option value="feed-autodiscovery">
Feed Autodiscovery
</option>
<optionvalue="text-content-checking">
textContent Checking
</option>
<option value="checker-bug-fixes">
Bug Fixes
</option>
<option
value="significant-inline-checking">
Significant Inline Checking
</option>
<option value="charmod-norm-checking">
Charmod Norm Checking
</option>
<optionvalue="proposing-features">
Proposing features
</option>
</select>
<input type="submit">
</form>
So in summary I really hope that Ian, who definitely seems to be the gatekeeper for what goes into HTML5 and what doesn’t go into HTML5, can see his way clear to add this feature to WebForms 2.0. If his main issue with it is needing to have it written up for inclusion in the spec, I’m more than happy to help.
Here at the Well Designed URLs Initiative we plan to address a wide audience and cover a plethora of URL-related topics. If it wasn’t obvious from yesterday’s post we plan to publish content for a variety of roles so we will categorize all our posts by the audience we are targeting.
Using our audience categories you can subscribe to our RSS feed then configure your feed reader to filter out all but those topics which are likely to appeal to you so as not to be overwhelmed by the rest. Some of our audience categories encompass other categories such as Everyone and Internet Professionals so we’ll plan to tag only the highest level category that applies to avoid duplication. For example, if you are a web developer you might want to filter out all but the Everyone and Internet Professionals, and Web Developers categories. Of course, if that’s too much trouble just subscribe to our entire feed and just ignore those posts that don’t interest you.
The following is the list of categories we’ve set up by audience role:
NOTE: If you read this post shortly after it is published, most of those links above will just redisplay this post. Let me explain. Because of the way our WordPress blog software works, those links would have displayed a 404 Not Found error if no posts existed for the given category. To avoid that I’ve tagged this post with all audience categories contradicting what I said above; that we would only put a post in its highest level category. Moving forward we shouldn’t need to do this again.