20 Feb Duplicate Content: What it is and How to Make it Go Away
Identical twins, unknown to each other – one cool and
laid-back and one prim-and-proper – shunted off into isolation for unsocial
behaviour, but guilty of nothing more than wanting to be loved by those that
matter the most to them. No, I’m not talking about a career-defining
performance from Lindsey Lohan in the seminal movie ‘The Parent Trap’ (okay I am, but that’s not what this article is
about); I’m gently easing you in to the much less heart-warming subject of
If you want to pop away now and watch The
Parent Trap feel free. I can fully appreciate the need for a Li-Lo fix.
I’ll still be here when you get back…
Right, time to knuckle down. Duplicate content is an ongoing
battle in SEO that involves more than one version of a page being indexed by
search engines. It exists in two forms: onsite duplication, in which two or more
identical pages or domains exist on your own website; and offsite duplication,
whereby the content on your website is identical or near-identical to that on
Both forms play havoc with your SEO.
Help a Robot out
Search engines are brilliant. Get your SEO on point and
you’ll enjoy an endless stream of traffic without having to pay for any of it.
However, they are also robots, and robots need to be told what to do.
With duplicate content, search engines are suddenly faced
with a choice they didn’t want. Their job is to point users in the direction of
quality, unique content. What they don’t want to be doing is sending people to
various pages that all basically say the same thing. If several pages of
content arise that all appear to be identical, a search engine has to choose which version to index or rank – one, both or neither.
For this reason, and because search engines can crush your
bottom line a hell of a lot quicker than they build it up, you need to help a
robot out by keeping your content duplicate free.
Duplicate content in your own backyard
Duplicate content that exists on your own site tends to be
the most challenging; however, it’s also the easiest to fix. This type of
duplicity is typically caused by one of the following issues:
pages – unless specifically blocked, search engines will find and index
additional CMS-created print pages. Chances are it may even favour them over
the Web alternative due to the lack of peripheral content.
product descriptions – a nightmare for e-commerce sites. Generic product
copy, typically manufacture supplied, used across multiple similar products.
IDs – tracking visitors makes sense, not in any kind of NSA way, but just
to make sure user experience is good as it possibly can be. A session is a
brief history of what a visitor did when browsing your website, the best
example being storing items in a cart. Session IDs tend to result in systems
adding the session data to a URL and, due to the fact that each ID is unique to
a session, creates a new URL. If that sounds rather complicated, it’s because it
is. Basically, the problem is caused by different session IDs being assigned to
parameters – click tracking, analytics code or any other parameter that can
be added to a URL without changing the content of a page is a problem. You can
read about just how much of a problem in the Google’s
Webmaster Tools documentation.
non-www – a classic! Two versions of the same site – one with all the W’s
and one without.
pagination – a great technique for organisations, but one that causes
duplicated content by adding comment pages to URLs.
content across a site – on home pages, blog pages, archive pages and Meta
titles and descriptions.
architecture – a common problem, whereby there are multiple paths to the
Oi! That’s my content…what you playing at?
Most duplicate content exists right at home on your own
website, where there’s really no one to blame but yourself. However, you write
good content – much better than Average Joe – and this attracts the vultures.
The more popular your site becomes, the more you’ll find yourself having to
deal with scrapers.
Scrapers are basically a brand of content thief. Not always,
but most of the time.
There are occasions when a scraper may take your content
with consent (I’m not sure why you’d ever give them the green light?), but most
of the time they’ll just go ahead and republish your stuff without linking to
the original piece, thus leaving search engines with the problem of which
article they should be linking too.
Scrapers do what they do simply to attract visitors to their
own webpages and encourage them to click on their ads. They’re annoying and
there isn’t a whole lot you can do if that big scooper comes in and takes
a helping of your content, but they are not the worst offenders. That honour
belongs to the plagiarisers.
There are people out there that will simply take your
content and tell the world that they wrote it. You could call it flattery, I call it taking the Mickey Bliss.
This is form of theft is the lowest of the low, and more
damaging than scraping. Typically, plagiarisers come from respected sites and
are able to get more links for your content than you could, meaning they
outrank you. The absolute cheek of it! That’s like the guy from down the market
that sells dodgy Burberry raking in more profits than Burberry themselves.
Not all offsite duplicate content comes is as a result of
rogues, though. There are occasions when your own strategy trips you up. Mass
article distribution, used gets your content far and wide only to end up taking
traffic away from your own site; and syndication, which may often be done on
relevant, industry-related sites, but causes problems for search engines due to
no backlinking, are the best examples.
How do I sort this mess?
First of all you need to find out if duplicate content is
causing you problems. There are several ways you can do this:-
Google Webmaster Tools under Diagnostics >
Search for your content on Google by using
either the URL + keywords.
Copy and paste a couple of sentences from your
content into Google.
Enter your site URL or content into Copyscape.
As I mentioned earlier, onsite duplicate content is the most
problematic, but it’s also the easiest to fix, so let’s quickly rattle through a few common remedies:
pages – don’t use them. Use a print style sheet instead.
IDs – avoid them. Cookies do a far better job at tracking.
parameters – get your programmer to use a URL factory to keep parameters in
the same order.
non-www – pick one and redirect the other to it.
pagination – disable it in your settings.
Right, those were the easy fixes, kept as briefly as they
deserved to be. Here are a several more complex solutions:
If you have two versions of a page but only need one, a good
old 301 redirect has you covered. This method allows you to send all of your
loyal readers with old bookmarked pages over to the new version of a page
without being hit with the dreaded 404. 301’s also let search engines know your
content has moved so the index can be updated and authority can be transferred.
The idea behind “rel=canonical” tags is to let
search engines know which version of a page you want to appear in search
results. This method works by placing the tag in the page header and is best
used when you know the URL is wrong, but you can’t get rid of the duplicate
According to Google’s Webmaster Trends analyst, John
tags are slower than 301 redirects, so you’ll probably want to try that
Robots need to be told what to do, and editing the /robots.txt
file is the way to do it. Using ‘The
Robots Exclusion Protocol’ you can effectively tell a search engine to not
to waste their time crawling specific areas of your site. This is a powerful
method that can be used to block entire directories, and every bit as daunting
as dabbling with .txt files sounds. Useful, but best left to the experts.
I fully understand that the more complex solutions listed
here might be too complex – anything that involves coding or back-end
development is rarely ever straight-forward. But, hey, this ain’t no Lindsey
Lohan movie we’re dealing with here, this is duplicate content and it doesn’t
So, if Google is chewing your ear about duplicate content,
here’s what you need to do: dip into your Webmaster Tools, find out what’s
happening, fix the easy stuff and get Pea Soup to take care of the rest.