On the 4th of November, 2020, a team of Google workers published a podcast talking about the dupe detection and canonicalization process at Google. John Mueller, Martin Splitt, Gray Illyes, and Lizzi Harvey were hosts of this podcast. They talked about some really amazing stuff that we all should be knowing about how they process a lot of information and content available online and how they maintain a higher quality of search engine relevance to the audience by providing them the top quality and original content.
The podcast started with some really refreshing environment, and then Gary Illyes went really well with explaining a significant difference between dupe detection and canonicalization.
To begin with the process, Google creates a Checksum for each page, meaning a unique fingerprint based on the words of a particular page. By using checksums of multiple pages, Google can identify the pages that have similar content. To do so, Google collects small-sized data derived from a set of digital data with a purpose to identify flaws that may have occurred during the time of transmission or storage. Additionally, checksums verify the integrity of data available, but it may sometimes fail to examine its authenticity.
Going further, Gray mentioned that dupe detection and canonicalization are two different things. Dupe detection is the primary step, followed by canonicalization. In the dupe detection process, Google clusters similar-looking content together and then chooses one out of them as a final one or a “leader,”; known as canonicalization. Another thing that we must consider is that duplication includes cluster building and canonicalization. Dupe detection mainly relies on the hash or checksums made by reducing content, followed by a thorough comparison. Converting content into hash or checksums makes it easier to do dupe detection. Gray explains further that scanning texts take more resources, but it will show almost similar results that Google gets from checksums.
In the process of dupe detection, checksums detect “exact” and similar kinds of content. Google has many algorithms that find and exclude the boilerplate from the pages. To describe this in other words, we can say that Google eliminates navigation and footer content for checksum calculation and examines only the centric piece of pages.
After collecting and detecting dupe, how does Google process canonicalization? Canonicalization factors are inclusive of content, page rank, HTTPS, sitemap file, server redirect signal, and real canonical. Machine learning algorithms decide the weightage of all parameters, which generally puts higher weightage on redirect and canonical tag. Gary further explains that although ML puts more emphasis on some factors, it doesn’t have any consequences on rankings. The page that Google chooses as canonical will rank, but it is not based on these factors.
If you are running an eCommerce store, you already know how important it is to bring more traffic and generate more conversions. Getting higher traffic doesn’t guaraRead More ➜
Have you ever seen a website and thought, “Wow! This is the best website one could have of both worlds”, and then you see another website that looks terrific and you think whoRead More ➜