More and more site owners are concerned that they might get penalized accidentally or overtly because of duplicate content. Last month Greg Grothaus of the Search Quality Team at Google gave a talk at the Search Engine Strategies San Jose conference on the subject.
The session looked at the issues and explored solutions for people running mirror sites or having multiple listings that are similar in nature. In addition what happens if you syndicate content through RSS and feeds? Will other sites be considered the “real” site and rob you of a rightful place in the search results?
Greg began by clearing up the popular myth that Google penalizes sites for having duplicate content. This is not the case. That’s not to say that duplicate content doesn’t have a negative impact on your rankings, but you are not being penalised by Google itself.
Greg says when people see messages like the one below they can think their content is getting omitted from Google’s results, when in fact it may just be being omitted for that particular query.
In order to show you the most relevant results, we have omitted some entries very similar to the 20 you already displayed.
If you like, you can repeat the search with the omitted results included.
“What’s actually happening, is that we’re looking at the query that the user’s doing, and we’re saying that we want diversity in the results we’re going to show a user,” says Grothaus. He says those who think their content is being omitted because it is duplicate, will likely find that if they adjust their query to more specifically reflect the missing piece, they may just find that it shows up in results after all.
Google recognizes that most of the time duplicate content doesn’t exist to try and deceive people. There are of course exceptions, which are considered spam. However, you may find it surprising that Grothaus says even spam sites aren’t being penalized for having duplicate content. Spam sites are being penalized for being spam. Just like some spammers use bold tags, he says. They don’t penalize people just for using them. And they don’t penalize people just for having duplicate content.
A classic case of duplicate content within the same site would be pages getting indexed with www. and without www. e.g. www.yoursite.com and yoursite.com. Another example would be a shopping site where items can be navigated to via different routes e.g www.myshop.com/category/item and www.myshop.com/item. In both examples Google would see the same page content but on different URLs i.e duplicate content.
Google will try to pick the right URL to use but sometimes they pick the wrong one.
You will not be penalized for using more than one URL but in doing so there are some issues that may negatively affect your rankings. Firstly, an important part of SEO for your site is to have other sites linking to it. If your content can be accessed on multiple URLs then it stands to reason that people linking to your site will end up using different URLs.
Backlinks pointing to several different URL versions of the same content, will spread link juice across these multiple URLs diluting the effectiveness of the links. Greg says that unfriendly URLs in search results may offset branding efforts and decrease usability as well. Plus, with multiple versions of the same thing, Google will spend more time crawling the same content, meaning it will have less time to go deeper into your site, and you run the risk of having content not get indexed.
The Solutions
The obvious solution is to ensure that each page on your site can only be accessed by one URL by linking consistently within your site.
For sites with dynamic URL’s or to solve the www or non-www problem you could use Apache’s mod_rewrite module and an htaccess file to 301 redirect pages to your prefered URL (a whole other topic). In addition, in Google’s Webmaster Tools, you can specify www. vs. non-www. 301 redirects are commonly used when moving sites.
You can also use the rel=”canonical” link element in your page headers. The canonical tag is supported by Google, Yahoo and Microsoft. Google gives the following example on the Webmaster Central blog:
<link rel=”canonical” href=”http://www.example.com/product.php?item=swedish-fish” />
Simply place this link tag in the head section of the duplicate content URLs. There are rules for the rel=”canonical” link element to consider.
It should be used between pages that are on the same domain, however, it does work across sub domains. For example, blog.mysite.com could suggest www.mysite.com as a canonical URL. It doesn’t work across domains, so www.mysite.com couldn’t suggest www.myothersite.com. Both absolute and relative links are acceptable, but the search engines recommend absolute links. Also, links to all URLs will be directed to the one preferred url. You can use the element for protocols, such as http:// vs. https://, and you can use it for ports. Pages don’t have to be identical, but they should be similar.
For more details see Greg’s video and presentation on the Webmaster Central Blog.


