How Google Calculates Duplicate Content Via Dupe Detection


How Google Calculates Duplicate Content Via Dupe Detection

On the 4th of November, 2020, a team of Google workers published a podcast talking about the dupe detection and canonicalization process at Google. John Mueller, Martin Splitt, Gray Illyes, and Lizzi Harvey were hosts of this podcast. They talked about some really amazing stuff that we all should be knowing about how they process a lot of information and content available online and how they maintain a higher quality of search engine relevance to the audience by providing them the top quality and original content. 

The podcast started with some really refreshing environment, and then Gary Illyes went really well with explaining a significant difference between dupe detection and canonicalization. 

What Is Dupe Detection?

To begin with the process, Google creates a Checksum for each page, meaning a unique fingerprint based on the words of a particular page. By using checksums of multiple pages, Google can identify the pages that have similar content. To do so, Google collects small-sized data derived from a set of digital data with a purpose to identify flaws that may have occurred during the time of transmission or storage. Additionally, checksums verify the integrity of data available, but it may sometimes fail to examine its authenticity. 

Going further, Gray mentioned that dupe detection and canonicalization are two different things. Dupe detection is the primary step, followed by canonicalization. In the dupe detection process, Google clusters similar-looking content together and then chooses one out of them as a final one or a “leader,”; known as canonicalization. Another thing that we must consider is that duplication includes cluster building and canonicalization. Dupe detection mainly relies on the hash or checksums made by reducing content, followed by a thorough comparison. Converting content into hash or checksums makes it easier to do dupe detection. Gray explains further that scanning texts take more resources, but it will show almost similar results that Google gets from checksums. 

In the process of dupe detection, checksums detect “exact” and similar kinds of content. Google has many algorithms that find and exclude the boilerplate from the pages. To describe this in other words, we can say that Google eliminates navigation and footer content for checksum calculation and examines only the centric piece of pages. 

What Happens After Dupe Detection?

After collecting and detecting dupe, how does Google process canonicalization? Canonicalization factors are inclusive of content, page rank, HTTPS, sitemap file, server redirect signal, and real canonical. Machine learning algorithms decide the weightage of all parameters, which generally puts higher weightage on redirect and canonical tag. Gary further explains that although ML puts more emphasis on some factors, it doesn’t have any consequences on rankings. The page that Google chooses as canonical will rank, but it is not based on these factors. 


Smrutri KakkadDigital Marketing Manager

After graduating from Be.IT, Smruti decided to gain her expertise in the field of digital marketing. Be it SEO or Paid advertising; she knows how to make it work like a pro. Plus, she likes to watch a variety of movies and has a great taste of cloth fashion. Leading a team with a great level of motivation and inspiration is what she is known for.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blogs

Content Marketing(Read Time: min.)July 30, 2021Top 4 Reasons to Use Content Marketing For Your eCommerce Business| Here’s All You Need to Know About

If you are running an eCommerce store, you already know how important it is to bring more traffic and generate more conversions.      Getting higher traffic doesn’t guara

Read More

Development(Read Time: min.)July 15, 20215 Things You Should Know Before Designing Your Website | A Complete Guide to a Newbee

Have you ever seen a website and thought, “Wow! This is the best website one could have of both worlds”, and then you see another website that looks terrific and you think who

Read More

PPC(Read Time: 5 min.)June 30, 2021How To Set Up Time Of Day Bid Adjustments

What is Ad Scheduling in Google Ads and How to Set Up Time of Day Bid Adjustments -  As we humans evolve every day, our technologies are also enhancing every day. The way of ma

Read More

YOU’VE GOT A PROJECT IN MIND Let’s Build Something Together

Drop us a message with a brief description of your dream project.

Our industry domain expert will review it and get back to you within 24 hours with free consultation and best reliable solutions.