Jun
11
2007

Canonical Web URLs using the Apache Rewrite Engine

With current browsers and typical ISP web hosting account http server setups, surfers can view a web site by typing either www.example.com or example.com into their browser’s URL field. Prevailing wisdom in the SEO community is that web administors should convert all visits to one or the other. They suggest that if your site’s visitors split half & half in their choice of URL, Google and other page rankers may see the two variants as separate sites and divide your page visits accordingly (or even drop one set of pages as duplicates). Matt Cutts from Google suggests canonicalization as a good idea, but doesn’t comment on any problems if you don’t. It seems to me that converting all visitors to a canonical URL has no downside and many benefits, so I decided to do it.

Using Apache’s Rewrite Engine to Force the use of one Server Name

We can consolidate all visitors to a single, canonical form of the server name portion of our URL using the Apache rewrite engine. The Apache documentation’s Rewriting Guide presents example code related to this case:

   RewriteCond %{HTTP_HOST}   !^www\\.domain\\.name [NC]
   RewriteCond %{HTTP_HOST}   !^$
   RewriteRule ^/(.*)         http://www.domain.name/$1 [L,R]

I tried this code on two different ISP accounts and could not get it to work. After checking all sorts of blind alleys, I realized that none of my other rewriting rules started with a slash, so I took it (shown in red) out and the rule started working. Either the example in the Apache manual is wrong, or some server installations pass the opening slash to the engine and others (like my two ISPs) do not.

Although the example now worked, it required hand-editing for each new site. So I wrote the following rule set which can be inserted into any root .htaccess file:

   # Force top-level "domain.com" requests to "www.domain.com":
   RewriteCond %{HTTP_HOST}   !^www\\. [NC]
   RewriteCond %{HTTP_HOST}   ^[^.]+\\.(com|edu|net|org)$ [NC]
   RewriteRule (.*)           http://www.%{HTTP_HOST}/$1 [R=permanent,L]
   # NOTE: Other rules must follow the www rule:
   RewriteRule ^sitemap.xml   sitemap.php [NC]

The second rewrite condition restricts the rule to primary domain requests, preventing intranet, subdomain, and/or localhost accesses from being rewritten. I also set the force redirect (R) flag code to permanent (301). The last rule (L) flag instructs the engine to stop rewriting and return the redirect at this point. The browser receives the 301 message and re-requests the page using the canonical form. Canonical page requests are scanned by any other rewriting rules, which should always follow the renaming rules.

To Prefix or Not to Prefix, That is the Question

My final rule set above chooses the “www” variant as the canonical form. However, the WWW is Deprecated folks advocate choosing the non-prefixed form — “use of the www subdomain is redundant and time consuming to communicate. The internet, media, and society are all better off without it.”

Excellent point, but small sites must follow the real-world trendsetters in matters like this. A quick survey shows that yahoo.com, google.com, and msn.com all choose “www” as their canonical URL format.

posted in Apache by Bozzie

1 Comment to "Canonical Web URLs using the Apache Rewrite Engine"

  1. Keith wrote:

    Too many slashes? I got this recipe to work by removing extraneous slashes, vis:

    # Force “example.com” requests to “www.example.com”:
    RewriteCond %{HTTP_HOST} !^www\. [NC]
    RewriteCond %{HTTP_HOST} ^[^.]+\.(com|edu|net|org)$ [NC]
    RewriteRule (.*) http://www.%{HTTP_HOST}/$1 [R=permanent,L]

    Cheers.

 
Powered by Wordpress and MySQL. Theme by openark.org