Why SnipURL’s API is Unsafe a.k.a. How NOT to design your Web API

Sigh.

If you’ve read my blog, you know that I like well designed URLs. You’ve probably also discerned that I like web APIs with well designed URLs. And further you might be aware that I like APIs that are as lightweight as possible that use; i.e. ones that use PUT and/or POST content of mime-type “application/x-www-url-encoded” and that respond with a body in of the “text/html” or “application/json” mime-types. So on the surface you might think that I’d really like the SnipURL API; after all it is the epitome of simple.

But unfortunately, I don’t like SnipURLs API. You see, it violates one of the most important tenants of HTTP which is “GET requests MUST be ‘safe’; i.e. a GET should have “no side effects.” SnipURL allows programs to use HTTP GET to issue a request to add a URL to their database of “snips.”

For example, to “snip” the URL “http://blog.welldesignedurls.org/url-design/” and enable the URL “http://snipr.com/urldesign” to redirect to it simply type the following URL into your browser (your browser will issues a “GET” request to SnipURL’s server API):

http://snipr.com/site/snip?r=simple
&link=http://blog.welldesignedurls.org/url-design/
&title=URL+Design
&snipnick=urldesign

This GET could also have been issued by a programming language, which of course is the whole point of an API. Unfortunately, the fact that the SnipURL server adds a record to their “snip” database on a GET request is a side effect and violates the HTTP spec. But before you think I’m just being a pedantic standardista with my undershorts on too tight, take a gander at the firestorm that resulted when Google released it’s Google Web Accelerator in early 2005. GWA caught lots of web developers with their pants down, including the golden boys of Rails who evidently hadn’t read the HTTP specs either (to their credit, they’ve made great strides since.)

So when I realized their API was violating the spec I of courseemailed their Google group with comments very similar to the ones I left on Douglas Karr’s blog where I first learned about SnipURL in hopes they could nip any damage in the bud. I’d point you to my email on the list but unfortunately they moderated my comments. On the other hand they did reply back to me in email. They responded with the following otherwise cordial email that was filled with mistaken assumptions:

Hello,

Thanks for an informative post.

Snipurl’s API is based on requests from our own users who wished to use our API, who stated clearly that they would prefer not to have to go through the rigmarole of charting through XML return values, and instead preferred a simple snip returned to them.

As for the comment: “GET is used for safe interactions and SHOULD NOT have the significance of taking an action other than retrieval.”

This is exactly what Snipurl’s API does. While we “get” the form values, they are rigorously checked for validity (which is in our own interest; otherwise miscreants could ruin our system) and only a value is returned.

The POST method is also significantly more resource-intensive when you talk of non-trivial traffic. Because we are a free service, we chose our approach based on practical reasons rather than puritanical design goals.

Again, it is great to have some feedback from the community. If I can find the time, I’ll be doing a REST API sometime, as has been on the cards anyway.

Many thanks
SnipURL Editor

So first off it is clear from their reply that their knowledge of HTTP is limited and that they are making assumptions that don’t follow from the comments I sent them. But I’m not posting their email to attack them or make fun of them, I’m posting it to illustrate that many professional web developers today don’t know the rules of HTTP, probably more than half. Instead I’m using their email as an example in hopes to educate those who, like me for the first 10+ years of my web development career, just let their tools isolate them from HTTP (yes I’m talking to you Visual Studio, IIS, and ASP/ASP.NET.)

So let me address their points one at a time.

First they said their users wished to use an API without having to “go through the rigmarole of charting through XML return values.” Obviously they were reading my email in haste and assuming that I was advocating XML when I most definitely not. And it is all the more ironic that they assumed XML advocacy given my recent anti-XML rant on the rest-discuss list. To be clear, I didn’t suggest the use of XML but what I did suggest (using PUT or POST) is no harder than programming a GET.

Second they claimed that their “rigorously check for validity” made their GET “safe” demonstrates a fundamental ignorance of what the term “safe” means related to HTTP GET. The term has a very precise meaning according to the spec, and for those that don’t know it I highly recommend they read “Safe Interations” from the W3C Technical Architecture Group’s Finding “URIs, Addressability, and the use of HTTP GET and POST.” In addition the checklist explaining when to use GET vs POST from the same document is both short and highly readable so web developers who’ve never learned the HTTP spec in detail should at least read that.

Lastly they claimed that “the POST method is significantly more resource-intensive than non-trivial traffic” (when compared to GET) means they really do not understand GET or POST. Let’s look at the difference between the two; the first is a GET and the second is a POST (NOTE: I’ve wrapped the HTTP requests to keep from overflowing the blog’s borders but you would not wrapped them in an HTTP request. The wrapped lines start with “>>>”):

GET http://snipr.com/site/snip?r=simple
>>> &link=http://blog.welldesignedurls.org/url-design/
>>> &title=URL+Design
>>> &snipnick=urldesign
>>> HTTP/1.1

POST http://snipr.com/site/snip HTTP/1.1
<blank line goes here>
r=simple
>>> &link=http://blog.welldesignedurls.org/url-design/
>>> &title=URL+Design
>>> &snipnick=urldesign

So where’s the resource intensiveness of the latter? The latter actually transmits fewer characters over the wire (not that a few characters make any difference except at “Yahoo-scale.”) What’s more, the overhead of writing the snip to their database will be orders of magnitude more overhead than the difference between GET vs. POST (if there even is any difference), though their may be some trivial differences on some server frameworks.

But better than a POST, I’d recommend a PUT. Note how I PUT to the URL that we want to be our snip URL thus eliminating the awkward “snipnick” parameter (this PUT’s body is wrapped too):

PUT http://snipr.com/urldesign HTTP/1.1
<blank line goes here>
r=simple
>>> &link=http://blog.welldesignedurls.org/url-design/
>>> &title=URL+Design

So hopefully the SnipURL Editor and anyone else reading this will now realize that it is important to ensure that APIs always use HTTP GETs in a ‘safe‘ manner. Of course if they don’t “get” what I’m trying to say (pun intended), then maybe I should just post the following code to a page on my website (or something similar for other HTTP-violating APIs) and let Google’s spiders have at it. :-)

<html>   

   <title>Spidering SnipURL's Naughty API</title>   

<body>   

<h1>Spidering SnipURL's Naughty API...</h1>   

<dl>   

<?php   

   $url= "http://snipr.com/site/snip?r=simple" .   

      "&link=http://www.w3.org/2001/tag/doc/whenToUseGet?nick=<<nick>>" .   

      "&title=HTTP+GETs+MUST+be+SAFE" .   

      "&snipnick=";   

   for($i=0; $i<=36; $i++) {   

      $nick = $_GET['nick'] . chr( ($i<26 ? 97 : 22)+$i );   

      $nick_url = str_replace('<<nick>>',$nick,$url) . $nick;   

      print '<dt><a href="?nick=' . $nick . '">' . "$nick</a></dt>";   

      print '<dd><a target="_blank" href="' . $nick_url. '">HTTP GETs MUST be SAFE</a></dd>';   

   }   

?>   

</dl>   

</body>   

</html>

Whadayathink? Should I post it to my site? ….. Nah, I’ll be nice today.

P.S. After writing this post it’s occurring to me that maybe SnipURL simply thought I was suggesting they use a SOAP API instead? Clearly all their rationales would apply against SOAP as:

  1. SOAP uses XML,
  2. SOAP “safety” implies validation (I think), and
  3. SOAP is a LOT more overhead than using a GET or a PUT/POST.

Think maybe that’s what SnipURL thought I meant?

7 Comments

URL Quote #4: An utter disaster to disable the URL

“Given that social filtering is one of the most powerful mechanisms for information discovery on the Internet, it is an utter disaster to disable the URL as an addressing mechanism. “

-Jakob Nielsen on “Why Frames Suck (Most of the Time)

Posted in Champions, Frames, URL Quotes, Virality | Leave a comment

Seeing things the way in which one wants them to be (not the way they are)

DabbleDB’s Really Bad URLsThroughout history there have been people who only saw things as they wanted them to be. People with strongly held beliefs whose values guided their actions be they counter-productive, detrimental, or worse just plain wrong; nothing mattered but to believe the world was as they wanted it to be. And it’s not just those from the past that are guilty; nay it seems that people the world over are now more ideological than any period in my own lifetime. I’ve give many examples, but whatever examples I gave I’d be sure to offend over half of my readers!

It’s probably pretty obvious that I’m talking mostly about war and religion in the above but it’s also sad to see the same from technologists. Case in point: the creators of DabbleDB and the Seaside Framework. There is an unfortunate school of thought among some web developers that Avi Bryant evidently shares[1] that clean URLs are simply not important, that they are just the obsession of overly pendantic developers pursing unimportant elegance. And those opinions are often rationalized by statements like these (on Mike Pence‘s blog) that clearly exhibit confirmation bias:

“I have not had one single person ever mention in Dabble DB that the URL’s look funny. People are used to it. If you go to Amazon.com, the URL’s have all kinds of opaque identifiers in them. It is just not something that the average user cares about. I think it becomes an obsession for developers to have this sense of having a clean API exposed by their web application, but I think you can have a clean API that does not have to include every single page in your app, and I don’t think that every single page in your app has to be bookmarkable. I think that as long as a bookmark gets you back, roughly, to where you wanted to be, or for really crucial things to have permalinks, then you are fine.”

Well I guess Ari can’t say that anymore (that he hasn’t had one single person complain about DabbleDB’s URLs.) 

That said, why would Ari believe URLs to be unimportant anyway?  There is significant evidence all over the web that URLs are important, not the least of which has been document on this blog in the past. As best I can tell Ari’s regrettable belief occurs because of his desire to be unburdened from dealing with web architecture so that he can hoist highly stateful web apps onto an unknowning and unsuspecting public simply because that’s what Ari values. Basically Ari chooses to ignore the importance of URL design for both users and good web architecture and have his framework emit simply awful URLs simply because doing so makes coding and using his server-side framework so much easier. That’s similar to someone not addressing the unfortunate necessity of security simply because dealing with security is a PITA. (BTW, Amazon’s URLs are some of the worst and they only get away with it because of their early momentum. They are NOT a good example to emulate.)

So you see DabbleDB exhibits some very clear examples of really bad URLs. To see for myself I created a free trial account over at DabbleDB, which gave me my own well-designed URL (itself, not bad):

http://welldesignedurls.dabbledb.com/

Next I created an application called “Sites” and a first “category” that I named “Domains” (evidently in DabbleDB parlance a “category” is like a table to us relational database types.)  This gave me the following URL:

http://welldesignedurls.dabbledb.com/dabble/sites?view=2&_k=ZEiTkHyn

Not bad, but the “/dabble/” is unnecessary, the “view” could have been defaulted, and the “_k” is, well, is so gratuitous I doubt I need even further criticize it.  Clearly what I would have preferred to see is this:

http://welldesignedurls.dabbledb.com/sites/

Or at least:

http://welldesignedurls.dabbledb.com/apps/sites/

And I believe anyone would be hard pressed to explain why the actual URL DabbleDB uses is better or why the URL I proposed would not be workable. Still, all is not so bad to this point because it appears DabbleDB will respond appropriately to:

http://welldesignedurls.dabbledb.com/dabble/sites

Of course anyone bookmarking the URL vs. composing the URL for a blog will be linking to two different URLs as per web architecture, which has its own perils for the owner of the website. (I’m of course assuming public URLs for this use-case which is possible via DabbleDB Commons, itself having a great URL of http://dabbledb.com/commons, but many usability problem still exist in closed environments how most DabbleDB databases will be used.)

But matters get much worse when we drill down into the “Domains” category I created. Compare the following two URLs and guess which one I envisioned vs. the one Dabble generated:

http://welldesignedurls.dabbledb.com/dabble/sites?view=2&_k=qPDotnwm

http://welldesignedurls.dabbledb.com/sites/domains/

And if we click on the name of the domain “welldesignedurls.org“, if gets even worse:

http://welldesignedurls.dabbledb.com/dabble/sites?entry=7&view=2&_k=jGMmkZyZ#objectEditor

Again, I would have liked to have seen:

http://welldesignedurls.dabbledb.com/sites/domains/welldesignedurls.org/

Why is this important?  Because cognition of the meaning of the URL is used in a significant number of contexts by humans, often in the context of where only recognition (vs. URL construction) is important. In email, in the browser history list, in older bookmarks, on printed communication, and more.  By analogy imagine how much harder computers would be to use if users had no choice but to always navigate the tree structure of a deeply nested directory instead of simply copying and pasting the path from, for example, Windows’ Explorer or the Mac’s Finder to an Open File dialog[2]. Just imagine what it would be like if a path to the user’s directory was named “C:\%GSkstyrWshs\@9KBHasklp\” Ye-Gads!)

There are still further cases where clean URL design is important. For bloggers composing their links having the ability to learn a link structure rather than having to navigate to each page they want to link (such as on Wikipedia) is invaluable. For marketers wanted to convey a location in advertising for their customers and prospects to visit. And especially for users of web that are heavily data-oriented where users are involved in editing, navigating to, and communicating various application states (a.k.a. web pages) to their colleagues, such as an app like DabbleDB. If ever there was a category of web apps where good clean URL design is critical, it would be online databases!

So NO Ari, URL Design IS important. I hope you can learn this and make changes to DabbleDB and Seaside before it’s too late for you, and worse, for your users.

Footnotes

  • 1.) How ironic Avi would name his blog “HREF Considered Harmful” as HREFs are truly one of the core foundations of web architecture.
  • 2.) Yes I know that some people don’t ever ccpy and paste paths but many of the more intelligent and/or aware users do.
Posted in Everyone, SoapBox, The Unenlightened, URLs that Suck, Warnings | 17 Comments

Why URL design matters in email

I’ve long believed email provides one of the better justifications for good URL design. Having a well designed URL structure inspires a user to have faith in a site’s URL integrity making it more likely then will email a URL to their friends. What’s more, a good URL gives hints to what can be found making it more likely for an email recipients to visit the link. And a readable URL provides something to “google” when the emailed URL is mangled or simply mistyped by the sender.

But it simply hadn’t occurred to me just how important URL design can be for marketing emails until today when I read a post today by Mark Brownlow of Email Marketing Reports. Mark’s post, entitled Forget email design, what about URL design? discusses the immediately obvious benefits of URL design in email marketing and lists several reasons why email marketers should pay particular attention to their URLs.

As Mark effectively states, well designed URLs can (elaborations mine):

  • Reinforce a brand message (when a good domain and/or logic URL path is used),
  • Help orientate the reader (within the website’s structure, and/or regarding the offer),
  • Provide text clues to the destination page’s content and value,
  • Indicate important content relationships (via the URL path’s heirarchy and/or between multiple emailed URLs), and
  • Remain relevant and recognisable over a long period of time (assuming the email marketer has a process in place to manager their site’s URL architecture.)

In addition Mark also gives a few examples that clearly make the case for good URL design in email marketing. He effectively asks which of these two URLs send a stronger message to the prospect?

  1. http://www.brandk.com/land.php?123456
  2. http://www.brandk.com/rings/coupon/

I think the preferrable one is obvious, don’t you?

Mark also suggests providing your prospect with their own custom call-to-action URL in marketing emails such as:

http://www.brandk.com/rings/coupon/justformark/

I too believe that providing customers with their own personal well designed URL can be an incredibly powerful marketing and SEO strategy. However, I’m not so sure it will work well for unknown prospects.

Well done Mark. Nice to have another URLian on the bandwagon.  :-)

Posted in Best Practices, Champions, Commentary, Email, SEO | 2 Comments

Proposing HTTP Request Forwarding

I’ve been monitoring the ietf-http-wg mailing list and have noticed there is renewed actively around revising RFC 2616, the HTTP/1.1 specification. This renewed activity got me thinking it was time to discuss the need for HTTP Request Forwarding.

I’ll start by saying it is possible that Request Forwarding is already part of the HTTP spec and that I just overlooked it. Lord knows I’ve read enough W3C and IETF specs lately to raze a small forest if printed, so I could easily have missed that part as I read bleary-eyed through the specs. But I’ve asked the question privately of a few people that should know and they all said that HTTP does not support request forwarding; one of them pointed out that that is why VOIP needed SIP.

If the term “HTTP Request Forwarding” isn’t obvious from context, let me give a use-case to illustrate:

USE-CASE: Retaining control of URL Interfaces when outsourcing image hosting

  1. http://example.com
  2. http://image-servers-r-us.net
  3. http://example.com/index.html
  4. http://example.com/images/splash.png
  5. http://image-servers-r-us.net/example.com/splash.png

Assume client “A”, server “B” mapped to [1] and server “C” mapped to [2] (servers “B” and “C” are different computers with different IP addresses probably at different locations.) For this use case client “A” requests the HTML file at [3] which includes an <img> tag pointing to a graphic at [4].

The request from client “A” for the .html file at [3] goes to server “B” and the response is returne to client “A.” After parsing [3] client “A” realizes it also needs to download the image at [4] and requests the .png file from server “B” at [1].

HTTP Request Forwarding (Note use of HTTP 1.2 to make URL in example explicit)

However, server “B” knows that the .png file [3] is actually located on server “C” at [5] so it forwards the request to server “C” at [2]. Server “C” responds by returning the .png file directly back to client “A” and client “A” is none the wiser; i.e. client “A” still thinks that the image was returned by [4].

In addition, if client “A” is a browser and the user inspects the properties of the image, the user would find the URL to be [4], not [5]. No where in the response given to the client would there be information that the image came from [5] instead of [4] except that the IP address differs from the server that requested it. It is possible that a header could contain the forwarding information, but the client would not need to act on it except potentially for debugging.

There are several other use-cases as well, such as making it possible to fully virtualize a domain authority’s URL interface. However, the use-case of outsourcing images or any large static content illustrates a benefit that I believe should be obvious to most people. And for those that are interested, especially if you are involved in updating the spec, I could detail other use-cases if people are interested.

Now it is possible there are security implications with this that I have not considered. If so, I would hope that we could at least explore potential ways to mitigate those issues as opposed to immediately dismissing HTTP Request Forwarding out of hand.

In summary, adding a Request Forwarding functionality to HTTP would allow servers to maintain control of their URL interfaces while still being able to distribute loads and services to appropriate servers. Having such a feature could improve consistency in URL design and make it easier for website owners to restructure their websites without the level of broken links seen on the web today.

Interestingly, on Tuesday Scott Hanselman blogged about just this problem from a slightly different perspective; I present Scott’s post for your consideration.

Posted in Call to Action, For Comment, HTTP, Potential, Standards Participants | 3 Comments

URL Quote #3: Wikipedia’s URLs a reason for their success?

“Wikipedia’s URL spaces are highly elegant; I suspect it’s one of the reason’s Wikipdedia is successful”

-Bill de hOra on “Wikipedia’s Highly Elegant URLs

Posted in Champions, URIs, URL Quotes | Leave a comment

URL Quote #2: Think about your website’s “public face.”

“…one should take an hour or so and really think about their website’s ‘public face.’”

-Scott Hanselman on “A Website’s Public Face

Posted in All Organizations, Best Practices, Champions, Everyone, Framework Developers, Hosting Providers, Open-Source Participants, SEO Consultants, Standards Participants, Teachers, URIs, URL Quotes, URL Rewriters, Web App Providers, Web Designers, Web Developers | Leave a comment

URL Quote #1: A URL is like a big “YOU ARE HERE” sign

“A URL is like a big “YOU ARE HERE” sign for each page of your site. It should allow people to get a sense of where they are in your site, even if they decide not to use that information for navigation.”

-Keith Devins on URL Design

Posted in Champions, Everyone, URL Quotes | Leave a comment

URLQuiz #2: URL Equivalence and Cachability

This is quiz #2 of our ongoing URLQuiz series.

In this quiz, there are 26 pairs of URLs (A..Z) and for each pair the questions is: “Which of these two URLs are equivlent?” i.e. which return the same resource when dereferenced, and “Which can be cached as the same URL?

You should answer ‘Yes‘, ‘No‘, or ‘Maybe‘ where ‘Maybe‘ means ‘It might return the same resource but should not be cached according to the specs.’

To answer, leave a comment and ideally explain your reasoning for each. Feel free to group answers based on your reasoning and/or the answer given (Yes, No, and Maybe.) Print it out and take the quiz with pencil and paper if you serious about getting it right, and feel free to use a computer or browser or whatever to test your results before answering. Good luck!

About the TLD .foo[1]

Clarification (2007-March-02): Some people have stated that the server could possibly return the same resource for any two given URLs so they felt the answer could never be ‘No.’ I definitely see their point, for example http://mysite.foo/bar and http://mysite.foo/bazz could possibly return the same thing but nobody would ever reasonably expect them to do so on their on. So let me clarify to say that I meant a quiz taker to select ‘No‘ in the case where where the resource returned would definitely not be the same thing unless the developer or server admin explicity programmed or configured them to do so. On the other hand ‘Maybe‘ would be used in the case where someone might reasonably expect the two URLs to return the same resource even though the RFC 3986 would define the two URLs as being different such as in [footnote 2], or when it depends on the O/S of the server as in [footnote 3]. Regarding fragments, the question is “In a transaction between a client and a server, is the cache allowed to view them as the same?” Regarding which can be cached I was looking for what are appropriate per the spec, not necessarily whether any particular software in the cloud (i.e. routers, proxies, browsers, etc.) actually does cache but instead “Would it be allowed to cache?”

Questions

  1. The ‘www’ domain
    1. http://mysite.foo/
    2. http://www.mysite.foo/
  2. Letter casing in path
    1. http://mysite.foo/Index.htm
    2. http://mysite.foo/index.htm
  3. Letter casing in domain
    1. http://MySite.foo/index.htm
    2. http://mysite.foo/index.htm
  4. Index.htm vs. Default.aspx
    1. http://mysite.foo/Index.htm
    2. http://mysite.foo/Default.aspx
  5. Trailing slash on domain
    1. http://mysite.foo
    2. http://mysite.foo/
  6. Trailing slash on path
    1. http://mysite.foo/path
    2. http://mysite.foo/path/
  7. Empty question mark
    1. http://mysite.foo/
    2. http://mysite.foo/?
  8. Empty parameter
    1. http://mysite.foo/?
    2. http://mysite.foo/?param=
  9. Port 80
    1. http://mysite.foo/
    2. http://mysite.foo:80/
  10. Port 443
    1. http://mysite.foo/
    2. http://mysite.foo:443/
  11. Https vs. Port 443
    1. https://mysite.foo/
    2. http://mysite.foo:443/
  12. Ftp vs. Http
    1. ftp://mysite.foo/
    2. http://mysite.foo/
  13. Letter casing in parameter name
    1. http://mysite.foo/?param=bar
    2. http://mysite.foo/?Param=bar
  14. Letter casing in parameter value
    1. http://mysite.foo/?param=bar
    2. http://mysite.foo/?param=Bar
  15. Hash vs. no hash
    1. http://mysite.foo
    2. http://mysite.foo#
  16. Hash vs. Fragment
    1. http://mysite.foo#frag
    2. http://mysite.foo#
  17. Fragment vs. no Fragment
    1. http://mysite.foo#frag
    2. http://mysite.foo
  18. Plus vs. Space in path
    1. http://mysite.foo/url+design
    2. http://mysite.foo/url design
  19. Space vs. Encoded Space in path
    1. http://mysite.foo/url design
    2. http://mysite.foo/url%20design
  20. Plus vs. Encoded Plus in path
    1. http://mysite.foo/url+design
    2. http://mysite.foo/url%2Bdesign
  21. Slash vs. Encoded Slash in path
    1. http://mysite.foo/top/second
    2. http://mysite.foo/top%2Fsecond
  22. Ampersand vs. Encoded Ampersand in path
    1. http://mysite.foo/abc&xyz
    2. http://mysite.foo/abc%26xyz
  23. Ampersand vs. Encoded Ampersand in parameter value
    1. http://mysite.foo/?q=abc&xyz
    2. http://mysite.foo/?q=abc%26xyz
  24. Equals vs. Encoded Equals in path
    1. http://mysite.foo/abc=xyz/
    2. http://mysite.foo/abc%3Dyxz/
  25. Equals vs. Encoded Equals in parameter value
    1. http://mysite.foo/?q=abc=xyz
    2. http://mysite.foo/?q=abc%3Dyxz
  26. Parameter order
    1. http://mysite.foo/?abc=123&xyz=987
    2. http://mysite.foo/?xyz=987&abc=123

P.S. Don’t stress if you can’t answer them all. It took me months to uncover all these nuances, and if I were taking this quiz I doubt I could get them right all in one sitting.

FootNotes

  1. I’m using the non-existent top-level domain “.foo” to avoid giving any link-love to arbitrary example sites that don’t deserve it! For the purpose of the quiz, just assume that “.foo” is a functioning top level domain.
  2. Question A.
  3. Question B.
Posted in Everyone, URIs, URLQuiz | 12 Comments

Sorry Mark; URL Design DOES matter!

I was planning to blog something else today, but Mark Nottingham of Yahoo made a statement about URL Design [1] in his post entitled REST Issues, Real and Imagined and I simply could not let his statement without comment.

But first let me say I always appreciate Mark’s perspective on issues, and enjoy reading his writings whether on his blog or in the mailing lists. His perspective is typically insightful and prescient, and he hovers above the hovers above the muck-racking and unsupported claims that can occur on mailing lists populated by egos. All in all his involvement is very professional, and I highly respect him for that,

That said, here is the comment me made that really bothered me (2nd and subsequent emphasis mine):

Red Herring: URI Design
When somebody first “gets” REST, they often spend an inordinate amount of time agonising over the exact design of the URIs in their application; take a look over on
rest-discuss, for example. In the end, though, URI design is a mostly a cosmetic issue; sure, it’s evidence that you’ve thought about good resource modelling, and it makes things more human-intelligable, but it’s seldom worth spending so much time on it.

I’d worry a lot more about cacheability, extensibility and well-defined formats before blowing out my schedule on well-designed URIs. For me, the high points are broadly exploiting the hierarchy, allowing relative references, and making sure that tools (e.g., HTML forms) can work well with my URIs; everything else is gravy.

I’ll start by saying I don’t really disagree significantly with his overall points that I believe he was making, such as the fact that there are other aspects that deserve attention in addition to URL design and also the fact some people appear to thrash when designing URLs. But to someone who does not appreciate URL design his statement could be easily misconstrued to mean that URI Design is not at all important, especially when he says it is mostly a cosmetic issue. That is false.

URI design is NOT merely a cosmetic issues! It has many tangible ramifications and those who ignore it do so at their own peril. Just scratching the surface, proper URL planning and design provides a framework for good information architecture, can facilitate spontaneous inbound linking via blogs, voting, tagging, and other social media sites, and can guard against broken URLs and subsequent traffic loss, to name but a few. They are issues that concern, or should concern everyone who publishes a site on the world wide web.

But what’s worries me most are the people who will certain latch onto Mark’s words as not only justification for ignoring patterns and best practices but also for antagonistically preaching against URL Design. This group includes web and content management system developers who would prefer not to be bothered with usability issues, system administrators who believe in security by obscurity (which itself is a fool’s precaution), dogmatists who misinterpret the principle of URI opacity and preach that web publishers should publish completely opaque URLs,  opaque even to the web publishers themselves, and a tiny but vocal contingent that for reasons I cannot fathom argue against URI design even within totally unrelated conversations.  It is for this reason I think Mark’s statement is potentially very damaging.

And as for his comments about rest-discuss, it is quite possible he was referring to conversations in which I participated. If so, I believe he mistook the crux of the conversation on several levels. The first was that in many cases I was advocating URL design, not agonizing over it. Certainly, URI design is really not that hard, it just takes understanding core principles and best practices, and then applying them. And secondly, some people escalate conversations to raging debates when simple questions are asked about proper URL usage in context of REST and Web Architecture. Mark could easily yet wrongly have misinterpreted those as too much hand-ringing over URL design.

In summary I don’t really fault Mark’s comments as I appreciate them to be. And I think Mark does a great service for the web in his quest to educate people about the value of REST. But as Mark is justifiably well respected I’m worried Mark’s comments may be used to rationalize bad URL Design. As such, I definitely hope he updates his post to guard against his word’s misuse.

Footnotes

  1. Mark prefers to use the term URI instead of URL whereas I obviously prefer the later when in the appropriate context. By definition, URLs are URIs with the difference being that a URL is dereferenceable. When discussing REST end-points, the term URL is applicable and the term URI which refers to non-derefenceable identifiers, is not. And don’t you forget it! ;-)

P.S. Oh, and I also couldn’t help but wonder if Mark was trying to get in a cheap dig when he wrote “…before blowing out my schedule on well-designed URIs“…?  But naaaaah, Mark’s a real professional and wouldn’t go for such an underhanded shot. ;-)

Posted in All Organizations, Internet Professionals, REST, SoapBox, Teachers, URIs | 3 Comments

An Embarrassment of Riches!

Wow. Over the past several days there have been numerous articles about URL design in one format or another, so much so that I’ve not been able to comment on them all in a reasonable amount of time.

Rather than wait, I decided to go ahead and give them all some link love and plan to discuss the issues they raise in due time:

So glad to see so many new URLians appearing on the horizon!

Posted in Everyone, Links | Leave a comment

ASPnix Supports URL Rewriting

After a long thread over at the forums for the web host ASPnix who specializes in Community Server where customers were clamoring for URL rewriting capability, ASPnix finally announced that it will now offer ISAPI Rewrite to its customers on ASPnix’s web servers.

That’s yet another IIS-centric web host who has finally freed its customers from the shackles of poorly designed URL hell! Hooray!

Technorati Tags: ASPnix | ISAPI Rewrite | URL Rewriting | IIS | Web Hosts | CommunityServer

 

Posted in Enablers, Internet Professionals, URL Rewriters | Leave a comment

PageRank

SEO: Illuminating the value of URL design

If you’ve read many of our other posts here at The Well Designed URLs Initiative, you know that we are strong advocates for User-Centered URL Design as well as for URL Literacy. It’s our contention that the URL is woefully under-appreciated as the most fundamentally important technology of the web, more important than HTTP, and even more important than HTML. The purpose of this post is to provide background for future posts explaining URL design importance from a perspective most website owners can appreciate; search engine rankings!

But they’d have to kill you

For those unfamiliar with Google’s core algorithm for determining its search engine results, you can read this article to learn about PageRank in more painstaking detail. Here, I’ll just try to explaining the aspects of PageRank as it relates to URLs. Note also that my explanations are simply meant to be a conceptual guide and not exacting details. The founders of Google did publish their initial algorithms but have since made tweaks that are as closely held a corporate secret as the formula for Classic Coke!

Popularity is the key

PageRank ias essentially a popularity rating, and a page’s PageRank is determined by the inbound links from other pages on the web. A PageRank can be as low of almost zero (0) to as high as ten (10). Google’s algorithms determine a page’s PageRank by dividing the PageRank for each of the inbound linking pages by the number of outbound links on each of those pages, factoring in each page’s PageRank, and then summing the results for all inbound links. Clear as mud, right? It’s easier to explain with an example, but let’s cover a bit more ground first.

Like voting company shares

PageRank considers each link a ‘vote’ for the page linked to. But unlike in a democratic “one citizen, one vote” society, Google’s algorithm more closely models the shareholders of a corporation voting their shares; the votes of those with “more” (PageRank or shares) have a greater influence on the outcome. So a link from a page with a PageRank of 7 is more valuable than a link from a page of PageRank 3; probably many orders of magnitude more valuable, as you’ll see next.

The old 80/20 rule, on steroids

Because of the nature of the web, a small number of pages have a huge number of inbound links, and vice versa. So those with more links get more PageRank, but the value of PageRank is on a logarithmic scale thus it increases exponentially. Assuming[1] that the base were five (5), the value a page would get to vote based on it’s PageRank would look like this:

PageRank Value
0 0
1 5
2 25
3 125
4 625
5 3,125
6 15,625
7 78,125
8 390,625
9 1,953,125
10 9,765,625

An example:

Assume a site somehow manages to get a persistent link from MySpace’s home page (www.myspace.com). At the moment contains MySpace’s home page contains about 70 outbound links and has a PageRank of seven (7). Let’s also assume that there are a total of 50 other inbound links, and let’s say the average PageRank for those pages linking in is three (3) and those pages have an average outbound link count of 10. From this, let’s calculate PageRank:

MySpace’s Available PageRank per outbound link:
78,125 / 70 => 1,116
PageRank value contributed by 50 other sites:
125 * 50 / 10 => 625
Total PageRank value:
1,116 + 625 => 1,741

Looking it up in the table, the resultant PageRank for the home page is four (4).

The Three ‘P’s of Inbound Links

As with the three ‘L’s of real estate, the three ‘P’s of inbound links are: PageRank, PageRank, PageRank! [2] Note how in the prior example the 50 inbound links of PR3 offered less PageRank than the one (1) inbound link from MySpace with PR7! Of course we don’t know the logarithmic base [1] but Phil Craven says 5 or 6 are what many people believe it to be.

Here is what it would look like with base two (2) through ten (10) (download the full calculations here as a zipped Excel 2003 file [4kb]):

Logarithmic
Base
Value from
PR3 * 50 / 10
Value From
PR7 * 1 / 70
Total
Value
Resultant
PageRank
2 40 2 42 5
3 135 31 166 4
4 320 234 554 4
5 625 1,116 1741 4
6 1,080 3,999 5,079 4
7 1,715 11,765 13,480 4
8 2,560 29,959 32,519 4
9 3,645 68,328 71,973 5
10 5,000 142,857 147,867 5

So depending on the logarithmic base, PageRank fluctuates between four (4) and five (5) for this hypothetical example. However, starting with a logarithmic base of five (5) the one MySpace link overpowers the 50 others! And because pages with a PageRank closer to 10 are listed higher in Google’s search engine results page among competing pages, people focused on SEO are always trying to increase their page’s PageRank, often via unscrupulous means.

Of course nobody outside Google knows the exact formula or base exponents used, but hopefully this post illustrates the value of links from high PageRank pages.

Don’t game the system

However, I would be remiss if I didn’t point out that a single-minded focus on inbound links is fraught with peril, not the least because it might cause your pages to removed from Google’s index! Just as there are people selling weight loss products they claimn don’t require dieting or exercise, there are people offering ways to inbound links that don’t require having real people link to you. However, Google considers these shortcuts to be gaming the system is ever vigilant to discover those cheaters. If caught cheating, Google will ban your pages from their index without notice.

The best way to gain inbound links for your key pages on your website is to do the hard work of creating a site with great content that people want to link to.

Architecture Matters

So as an epilogue, getting inbound links is clearly necessary for high PageRank and thus good search engine results, but all those inbound links can be squandered without a good architecture and site management plan. To ensure that a site’s great content and popularity get reflected in appropriately high search engine ranking it’s critical to optimize the architecture of the site, the pages, and the URL structure as well as make plans for how the URL structure might change over time.

The most under-realized aspect of SEO

Personally, I think the most under-realized aspect of white hat SEO [3] is the lack of attention paid to URL planning and design, especially for larger websites. There are very few tools [4] besides the low-level and effectively simplistic URL rewriters like mod_rewrite for creating and maintaining a URL plan, very few articles [4] that discuss URL design, and no articles [4] I am aware of that discuss URL planning.

However, I believe website owners will see huge improvements compared to their prior rankings if they focus on URL design and create a URL management plan. The good news is that URL design is mostly a one time endeavor assuming site maintainers adhere to the management plan, at least until there is a full site rearchitecture.

But all of the whys and wherefores regarding URL planning and design are beyond the scope of this post, and instead will be the subject of many posts in the future. Stay subscribed!

For Further Research

And, as I stated at the start of this post you can learn more the PageRank formula here, and you can also google for PageRank to get a large list of other resources.

Footnotes

  1. But remember it’s a secret, so we can’t know for sure.
  2. There’s more to search engine ranking than just PageRank, like applicable content, but PageRank differentiates pages that compete competitive for the same keywords.
  3. To those SEO-haters of the world, please note that I’m referring to those things that you can do with pure white hat techniques, things that if not done can result in a great site being given less credit by the search engines.
  4. Over time, I plan to address the lack of such articles and tools for URL planning and design.

Technorati Tags: | | |

Posted in All Organizations, Basic, Framework Developers, Hosting Providers, Infrastructure Providers, Open-Source Participants, Overview, Search Engines, SEO, SEO Consultants, Teachers, URIs, Web App Providers, Web Developers | 8 Comments

URLQuiz #1: To .WWW or not to .WWW?

As promised, this is the first of what will be many URLQuizes here are the blog for The Well Designed URLs Initiative. This URLQuiz discusses the convention of using a subdomain with the name ‘www‘ to identify a website.

As most everyone knows, many of the first sites on the web started using this convention. Examples include  www.amazon.com, www.yahoo.com, www.google.com, and www.ebay.com. However, there is nothing about the web that requires a subdomain be named ‘www‘ when selecting the address for a website. To the contrary, many websites use other subdomains for prefixes such as:

There is even a passionate contingent of web developers  that believe the ‘www‘ convention is an anachronism and should be deprecated (or ‘eventually abolished‘, in layman’s terms.)

So how should the base domain and subdomain(s) be handled, and what are the pros and cons of each? Here are the options I’ve identified, but feel free to suggest others that come to mind as well:

  1. Establish the ‘www‘ form as the implicit canonical form and issue a 404 – Not Found whenever an inbound request attempts to deference a URL using the root domain (i.e. without ‘www‘ or any other subdomain.)
  2. Establish the non-’www‘ form as the implicit canonical form and issue a 404 – Not Found whenever an inbound request attempts to deference a URL using the ‘www‘ subdomain.
  3. Establish the ‘www‘ form as the implicit canonical form and use a 301 – Moved Permanently (redirect)  whenever an inbound request attempts to deference a URL using the root domain (i.e. without ‘www‘ or any other subdomain.)
  4. Establish the the non-’www‘ form as the implicit canonical form and use a 301 – Moved Permanently (redirect) whenever an inbound request attempts to deference a URL using the ‘www‘ subdomain.
  5. Do not establish a canonical form and return 200 – Ok for both the ‘www‘ form and the non-’www‘ form.
  6. Abandon both the ‘www‘ form and the non-’www‘ form and always use explicitly subdomains based on your site organization like in the examples shown above.
  7. Some combination of 1 through 6 I haven’t already described.
  8. Or, something completely different?

So there you go; give your answer(s) in the comments. Though I definitely have my opinions on the subject I will stay out of it unless I don’t see anyone mentioning several of the points I think are relevant. After enough comments come in, I’ll summarize and write a follow up post, just like Dan Cederholm did with SimpleQuiz.

Hint: You might want to consider not only online usage but offline usage as well.

UPDATE: Just days after writing this post Tim Bromhead wrote: Which is better for your site: www or no www?  Is that weird or what? Tim must have had some kind of a Vulcan Mind Meld or similar going on… Anywho, great article Tim and thanks for being a URLian!

UPDATE#2: Looks like I picked the right time to discuss this issue! A few days ago Scott Hanselman talked about the downside of ignoring the distinction between ‘www’ and the root domain, Jeff Atwood discussed how to solve it, to which Phil Haack then responds with a bit of a rant about the www or lack thereof. Since they both have such strong yet opposite opinions on the subject, maybe we can get both Jeff and Phil to weight in on the subject over here…?

Technorati Tags: URL Design | Subdomains | Canconical Form | www | no-www

Posted in Framework Developers, Internet Professionals, Open-Source Participants, Search Engines, SEO Consultants, Standards Participants, URLQuiz, Web App Providers, Web Designers, Web Developers | 15 Comments

PayPal’s New API: So Close, Yet So Far

I got an email from the PayPal Developer Network today announcing PayPal’s new “NVP” (or “Name-Value Pair“) API. Clearly they’ve learned that the complexity of SOAP is counter productive to adoption. Here’s what the email had to say about their new API:

NVP Is Your Integration MVP
We’re proud to announce that PayPal’s Name-Value Pair API has launched. Complex SOAP structure is now gone. All API methods are supported, except for AddressVerify. Get exclusive sample code – download two SDKs (Java and .NET). Get Details

Taking a look at their examples (in .ASP, .PHP, or ColdFusion) and their SDKs (for Java and ASP.NET [v1.1]) it’s nice to see they are using POST instead of GET. The following is one of their functions from their PHP examples (CallerService.php) that illustrates how their code is calling their NVP API (I edited for line-length only):

function hash_call($methodName,$nvpStr)
{
   //declaring of global variables
   global $API_Endpoint,$version,$API_UserName,
          $API_Password,$API_Signature,$nvp_Header;

   //setting the curl parameters.
   $ch = curl_init();
   curl_setopt($ch, CURLOPT_URL,$API_Endpoint);
   curl_setopt($ch, CURLOPT_VERBOSE, 1);

   //turning off the server and peer verification
   //(TrustManager Concept).
   curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
   curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
   curl_setopt($ch, CURLOPT_POST, 1);

   //if USE_PROXY constant set to TRUE in Constants.php,
   //then only proxy will be enabled. if USE_PROXY constant
   //set to TRUE in Constants.php, then only proxy will be
   //enabled.
   //Set proxy name to PROXY_HOST and port number to
   // PROXY_PORT in constants.php
   if(USE_PROXY)
      curl_setopt ($ch,
                        CURLOPT_PROXY,
                        PROXY_HOST.”:”.PROXY_PORT);

   //NVPRequest for submitting to server
   $nvpreq= “METHOD=”.urlencode($methodName).
            “&VERSION=”.urlencode($version).
            “&PWD=”.urlencode($API_Password).
            “&USER=”.urlencode($API_UserName).
            “&SIGNATURE=”.urlencode($API_Signature).$nvpStr;

   //setting the nvpreq as POST FIELD to curl
   curl_setopt($ch,CURLOPT_POSTFIELDS,$nvpreq);

   //getting response from server
   $response = curl_exec($ch);

   //convrting NVPResponse to an Associative Array
   $nvpResArray=deformatNVP($response);
   $nvpReqArray=deformatNVP($nvpreq);
   $_SESSION['nvpReqArray']=$nvpReqArray;

   if (curl_errno($ch)) {
        // moving to display page to display curl errors
        $_SESSION['curl_error_no']=curl_errno($ch) ;
        $_SESSION['curl_error_msg']=curl_error($ch);
        $location = “APIError.php”;
        header(“Location: $location”);
    } else {
        //closing the curl
        curl_close($ch);
     }

return $nvpResArray;
}

Much nicer and simplier than having to go to all the effort to set up a SOAP call.

Unfortunately, PayPal missed a huge opportunity to make their new API fully RESTful. Instead of designing a URL-centric REST interface (with a hypermedia component to keep the purist or the pure RESTafarians happy) they instead tunneled method calls over HTTP!  They used methods like “DoDirectPayment” and “RefundTransaction” Sheesh! (Note: these links to their methods load slower than any website I can remember visiting in ages and while loading my browser does lots of clicking. What the heck is going on in there I have no idea! You can go to the main docs via a much faster downloading PDF here. Wow, if that isn’t usually an oxymoron!)

Though much easier than calling SOAP, clearly not very RESTful.  Three steps forward, and two steps back. Sigh…

Posted in Commentary, Framework Developers, Leading Companies, REST, SOAP, SoapBox, Web Developers, Website Owners | 2 Comments

Use rel=”spam” to Fight Comment Spam?

As I was going through my Akismet spam filter today reviewing the 87 comment spam I got during the prior ~24 hours to ensure I didn’t delete any legitimate comments, it occurred to me that maybe there is a simple solution to comment spam.

What if blog apps could simply mark a hyperlink with?:

rel=”spam”

The simple idea is that rather than delete spams, blogs could start maintaining a special page of links to comment spammer’s websites using rel=”spam” on the “A” element. Basically this would be PageRank in reverse. The search engines would then apply negative weighting to anything marked spam and give the spammers the exact opposite of what they were pursuing when they unethically tried to game the system!\.

 For example, Google could give negative PageRank for a spam link compared to positive PageRank for a non-spam link. Google could also weight the relevency of the link text negatively vs. the positive value it would give a non-spam link. This would have the affect of distributing the watch-dogging of spammers out onto the web without requiring any new infrastructure, and it would create a clear disincentive for comment spammers instead of the lack of disincentive from “nofollow.”

Are there problems with this I’m not foreseeing?  Probably.  I already know that people would try to game the system for negative purposes, and that’s to be expected. Still, I think that for the most part anyone simply using it to field a grudge or in as attempt to harm a competitor would be doing it by definition on such a small scale that it would have no effective. Given that the many comment spammers automate, they can end up with huge numbers of comment spam links. If the search engines merely weighted a spam link as 1/10th the negative value of a positive link, it would certainly still be effective.

Of course the hard-core Linux faithful would immediately spam-link to Microsoft.com just to spite them! But I really don’t (currently) see how that couldn’t be detected and managed via policies and algorithms. For example, if a company has a large number positive links it could be exempt from the effects of spam links. And I’m sure automated methods or methods using collective intelligence could emerge to resolve these problems the vast majority of time. The rest could be handled via policy; get caught spam-linking someone inappropriately and get your domain pulled from the index!

What’s more, it would give bloggers a sense of purpose when they review their spam filters instead of them feeling like the time spent was just a waste. I know that if my efforts to detect comment spammers could get them lower PageRank, I’d feel good about monitoring my comments for spam as I would be doing a service for the public good. And I’m sure most other bloggers would feel the same.

Now I know that Microformats.org has the similiar proposal VoteLinks, but that is about registering opinion as opposed to calling out gamers of the system. VoteLinks is also much broader than what I’m suggesting.  If we keep the focus really narrow — shine a spolight on spam so that the search engines can erradicate it — then I’m pretty sure it would be a success.

What do you think?  Good idea?  Filled with holes I’ve not considered?  I look forward to your feedback.

Posted in Everyone, Framework Developers, Proposing, Search Engines, SEO Consultants, SoapBox, Standards Participants, Web App Providers, Web Developers | 9 Comments

URLs for Multilingual Web Sites

Another URLian has appeared: Brad Fults. Brad just added himself to our wiki and became a signatory; thanks Brad! Better yet, on his user page on our wiki he linked to his post Designing URLs for Multilingual Web Sites; execellent job Brad!

That was a subject I’ve been planning to write for a while, and I’ll probably cover the issue in the future to here on the WDUI Blog to future the conversation but I doubt I could have done as good a job as Brad for my first post on the subject.

One option he did not cover was using using language in filenames, such as #1 – an extention:

example.com/bar/baz.en-US
example.com/bar/baz.en-GB
example.com/bar/baz.de

Or a #2 – suffix to an extension (note I had to add an .html extension for this option):

example.com/bar/baz.html.en-US
example.com/bar/baz.html.en-GB
example.com/bar/baz.html.de

Or as an #3 – extension prefix (also needed an .html extension):

example.com/bar/baz.en-US.html
example.com/bar/baz.en-GB.html
example.com/bar/baz.de.html

Or as an #4 – filename prefix:

example.com/bar/en-US.baz
example.com/bar/en-GB.baz
example.com/bar/de.baz

Or #5 – one level up in the path:

example.com/bar/en-US/baz
example.com/bar/en-GB/baz
example.com/bar/de/baz

Unlike Brad, I didn’t provide an evaluation of these simply because I haven’t researched the subject enough at this time. Maybe he can do a follow up post providing an evaluation of each of these.

However, I can say I don’t really like any of these options, nor are any of the options Brad provides sit well with me except possibly his “Modified Directory Structure (#2)” combined in creative ways with his “Use of Accept-Language HTTP Header (#6)” the latter a.k.a. Content Negotiation. Whatever the case, there will be more on this subject in the future, I’m sure, and it’s good to have this discussion taking place.

Posted in Champions, Commentary, International, Internet Professionals, Website Owners | 1 Comment

Which is Worst: the URL for IE7 Add-ons, Firefox Extensions, or Greasemonkey?

I am working on a project that had me was writing about browser plug-ins and I needed to link to the main page for Microsoft’s Internet Explorer Add-ons, for Firefox’s Extensions, and lastly for Greasemonkey for Firefox

I actually looked up those three in opposite order than I have them listed above. Greasemonkey’s URL was pretty good although it’s a shame it’s not greasemonkey.com/.net/.org; the .com resolves to a 403 forbidden page, the .org resolves to a list of advertising links, and the .net resolves to Grease Monkey International, a franchiser of automotive preventive maintenance centers! Whatever the case, I feel pretty good that this URL is going to have really good persistence. It should be around at least as long a Greasemonkey is relevant if for no other reason than to return a 301:

http://greasemonkey.mozdev.org/

The second URL for Firefox extensions was not so good, but I still think there a pretty good chance it will still resolve a year from now:

https://addons.mozilla.org/extensions.php?app=firefox

Then there is Microsoft’s horrific URL for Internet Explorer Add-ons.  What were they thinking?  I’ll bet this URL doesn’t resolve three months from now let alone in a year of five:

http://www.windowsmarketplace.com/category.aspx?bcatid=834&tabid=1&WT.mc_id=0107_20

URLs like this one from Microsoft are a crying shame. Sadly, Microsoft is one of the few companies that can get away with this without be negatively affected. On the other hand, most companies haven’t a clue how bad URLs like this can affect them.

That said, I’d love to get your input:

  1. Why is Microsoft’s URL so bad?  Help me find and explain all the reasons why companies should care not to be so careless when designing their URLs. Why is it bad for users, and why is it bad for Microsoft?
  2. Design the Ideal URLs. Assume you have no constraints at all – no badly designed content management system and no inflexible server technology — and suggest the ideal URL for each of the above three resources. Heck,  you can even change domain names if you want to. So what would be the best URLs for each of the three above?

Posted in Internet Professionals, Reader Input, URLs that Suck, Website Owners | 8 Comments

Bitten by the URI Opacity Axiom

 

Jon Udel has a post today entitled Divergent citation-indexing paths. Funny that he wrote about this; it seems he and I are on such a parallel trajectory these days. For evidence, take a look at my post from last week I titled Lessons Learned from Delicious Praise.

In his post Jon states “Del.icio.us, unlike Bloglines, treats the URLs that you feed to its citation counter in a case-sensitive way.” I wonder if Jon is aware of the Tim Berners-Lee’s URI Opacity Axiom?  A correct reading of the URI Opacity Axiom would reveal that in that Bloglines is in the wrong and Del.icio.us is in the right in this case; i.e. URLs are case-sensitive and programs SHOULD NOT[1] infer that two URLs with different casing point to the same resource. (NOTE: case doesn’t matter with domain names but it DOES matter in URL paths.)

As a matter of fact the URI Opacity Axiom is such as a closely held belief among those I like to call “the Weborati” that if you even question it so as to understand it on certain W3C or related mailing lists you’ll be in for a firestorm as if you blasphemed the messiah! ;-0

All kidding aside, especially since some of those people who hold the URI Opacity Axiom dear read this blog (!), after spending the time to research it and really learn it I came to believe that it is a very good idea for people to follow the URI Opacity Axiom. And I’ll discuss why in the future when I have more time. Unfortunately, like many principled concepts some people have elevated the URI Opacity Axiom to the level of dogma, and many of those who preach it believe it means a distortion of what it really means.

So Jon identified a real-world problem that following the URI Opacity Axiom introduces yet it is somewhat of a “catch-22“; following the axiom creates real-world problems but not following it creates other real world problems. But longer term, I really don’t think it has to be this way, and I’m working on ideas to address this issue that may turn into draft proposals or recommendations or something else. Basically I think with some “layers” of technology added to the web we could have the best of both worlds.

As an aside, Del.icio.us could update links based upon 301 redirects and then website owners could 301 redirect to a(n ideally lowercased) canonical URL whenever their server receives a request for a URL that is not in the canonical URLs format. This assumes of course that the website owner/server operator has chosen for their URLs to effectively be case-insensitive[2].

  1. I use the “SHOULD NOT” in the same way RFC 2119 defines the use of the uppercased term.  
  2. In my opinion running a website with case-sensitive URLs either means the web developer just wasn’t thinking or that they don’t have a clue about the affect case-sensitive URLs have on website usability. Ah, but that’s another subject for another day. :)
Posted in Axioms, Commentary, Internet Professionals, SoapBox, URIs | 2 Comments

Best Practice: Always ID your Heading Tags

Here’s a simple best practice. Always ID your heading tags! For example, if you’ve got an <h2> element, be sure to make it <h2 id=”some-heading”>.

IDing heading tags is especially important on long documents.

Why? Because if you don’t, someone else can’t reference the part of the document that they want to reference in a blog post or somewhere else. And if they can’t, they just might reference someone else’s web page instead. Or if they do reference it, readers who click over to your URL might give up on reading before finding the appropriate document, and never come back to your site when they might otherwise have become an avide reader. How often have you see a link to a web page where the person linking included the text “Scroll down to the section entitled…“  Bleach!

Given the heading tag mentioned in the first paragraph above, and assuming it was contained in a document entitled “whitepaper” in the root of www.foo.com, you can point straight to that heading using a URL fragment like so:

http://www.foo.com/whitepaper#some-heading

Ben Coffey talks about this same problem over at URLs for Specific Portions of Documents.  He also talks about CiteBite which helps bloggers and others link directly into a part of a document as if there had been an ID there. But publishers, if others start using CiteBite on your content simply because you don’t include the ID attributes they need to link to your directly, guess who will get the Google PageRank?  Not you… ;-)

One more thing. If you are creating content that will be displayed above or below other content, i.e. blog posts that get listed with other blog posts on the same HTML page, you’ll need to make sure your IDs are unique. I personally have started using a convention that appends the date in “YYYYMMDD” format to the end of a meaningful fragment, seperated by a dash, as in:

http://www.foo.com/whitepaper#some-heading-20070118

This tends to work for me because I almost never post more than once per day. Also, though I personally dislike the inclusion of dates in URLs because of how difficult it makes things for users to remember or discover the URLs, having the date as a fragment suffix is not quite at bad. People using the browser URL auto-complete can still easily find the URL they visited recently enough that its URL is still in the browser’s cache. YMMV.

Lastly, if you are going to ID your heading tags, you probably should also create a table of contents. ‘-)

Posted in Best Practices, Call to Action, Publishers, Teachers, Web Designers, Web Developers, Website Owners | 5 Comments