How to prevent and remove duplicate webpages from search results

It's common for websites to have different URL’s load the same or similar content. For example some Content Management Systems can produce a new URL when an edit is made to an existing article, or content editors may change the URL slug. In some scenarios websites use directories to define and categorise content by state or language i.e. https://mysite.com/mywebpage/qld having the same content as https://mysite.com/mywebpage/vic.

Our crawler visits every unique URL so this can result in multiple records containing the same or similar content. Having all the records returned in a search result may not create the desired user experience.

If an old or unwanted webpage is within the index then it needs to be removed.

Our recommendation is to use a 301 redirect for old content that you don’t want indexed anymore (https://kb.search.io/KB/How-does-the-crawler-handle-301-or-302-redirect%3F.219152657.html).

When it comes to live content we recommend using canonical tags which ensures the crawler only visits and indexes the master page (https://kb.search.io/KB/How-do-canonicals-impact-indexing%3F.219218150.html)

Other alternatives are to

Add data-sj-noindexto pages you don’t want to be crawled (https://kb.search.io/KB/How-do-I-prevent-pages-from-being-crawled%3F.195723517.html) or set up and exclusion rule within our platform (https://docs.search.io/documentation/guides/content-websites/excluding-documents)

All of the above solutions will remove the unwanted webpage from the index if the URL has been crawled in the past