HTTP GET PUT POST DELETE
Google is indexing ajax pages with hastags since 2015: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html
However, is there a possibility to exclude specific URLs with a specific hash tag (because of duplicate content, i.e. sorting parameters)?
- example.com/#!explore/world (is OK to be indexed)
- example.com/#!explore/world:sortby=date (should not be indexed)
Since the page does not get reloaded after the hash tag changes to a new ajax page, it does not make sense to use the
<meta name='robots' content='noindex'> tag, since it would count for ALL ajax hash URLs...
The best thing you can do is setting the canonical meta tag for all pages with filtered views (sort by, ascendant, descendant, price range, etc), to let bots know which is the original page and which one should be indexed.
So when URL is:
Canonical meta tag should be set to:
<link rel='canonical' href='example.com/#!explore/world'>
After implementing the canonical, wait some time, maybe a week, to make sure the bots know that the canonical tag is present and then proceed blocking web crawlers via robots.txt.
After waiting for a couple of days/weeks block via robots.txt
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /*sortby=
Note 1: the /*sortby= will match any url containing the string sortby= . Do not use ! as in regex has a specific meaning.
Note 2: It might be longer or less than a week, check the SERP after a while to see if hash filtered urls have been removed.
Note 3: the order is important. Implement canonical, wait, then block via robots.txt. The reason this is important is because you need to allow web crawlers to read the canonical tags, once the access is 'blocked' via robotx.txt they wont be able to see the canonical tags.
- Thank you. But when you set a canonical meta tag for an ajax site. How does this work? Is Google reloading a page when navigating through the ajax pages such that the canonical tag can update too? Or do you need to change the canonical meta tag, everytime a new ajax pages is loaded? Would Google even notice that?
Update your robots.txt and disallow the bots from indexing or crawling these dynamic pages. Your robots.txt can be something like:
User-agent: * Disallow: /*sortby=date*
Also, if you connected your website to a Google Webmaster Tool, make sure to run the robots.txt tester on the dashboard.
And yes, the for the dynamic pages can be used too.
- Do you have documentation that robots.txt can be used for hash fragments? I don't believe that it can.