robots.txt – John Kieken dot com

As you may already know, I’ve been a Web Developer since 1999 and run Website Setup dot net.

I use Google Webmaster Tools for several of my and my customers’ websites. Recently, under Diagnostics > Crawl Errors, I discovered quite a few 404 (not found) errors pointing to the same non-existent location:

http://baudindentalmission.org/a

Sure, a 404 error is totally expected. After all, the “/a” directory does not exist, however, the question remains, how did the Google-bot get the idea to crawl there in the first place? Perhaps a programming error on my part or a typo?

Well, this is strange, I’m now seeing it on more than one site and I’m finding other people complaining about the same thing suddenly appearing in their Webmaster Tools Dashboards.

Let’s now examine my Dashboard’s Linked From data:

http://baudindentalmission.org/donate.html
http://baudindentalmission.org/haiti.html
http://baudindentalmission.org/about.html

Hmm, no clues there. Nothing in any of those pages link to a “/a” location.

Let’s dig further into the JavaScript Includes. Searching the first JavaScript file (which happens to be the jQuery JavaScript Library) for “/a”…

<script src="/ajax/jquery/jquery-1.5.min.js" type="text/javascript"></script>

We find this occurrence…

<a style="color: red; float: left; opacity: .55;" href="/a">a</a>

What’s that doing in there? Actually, it doesn’t matter… it’s part of jQuery and very smart programmers spend countless hours developing, troubleshooting and refining jQuery… so, for this discussion, we’ll just trust them. What’s more important is that the Googlebot is actually crawling around inside an external JavaScript file, presumably searching for content. Why? Ask Google.

Personally, I fail to see the value in this practice of crawling JavaScript. If it’s searching for malware or some SEO trickery then great, but it shouldn’t be following things it thinks (assumes) are valid links and creating 404 errors. If it’s crawling JavaScript in order to figure out your site navigation, then shame on Google for rewarding such a poor programming practice. (Your content and code should be two separate things!) Malware or some goofy JavaScript navigation system, either way, these things should be penalized with lower rankings or removed from Google altogether.

What’s even more odd is that the particular JavaScript file it’s crawling is jQuery itself. Since jQuery is part of the Google Libraries API, you’d think it would quickly realize crawling around in there is kinda pointless.

Here is a jQuery Bug Report on this very issue. According to notes in that report, this is not something they intend on fixing. I can’t say that I’d blame them for this attitude, Google should not be crawling JavaScript if it doesn’t know how to properly parse it for valid content. (Although as mentioned before, I can’t imagine how one could argue that JavaScript should contain any content at all. Best practices indicate always maintaining a separation between content and code.)

What’s the solution to all this? You don’t want a bunch of 404 errors piling up… although Google is smart enough to drop bad URL’s from their index, they can also penalize a site for this by reducing the crawl rate.

Solution 1: Redirect “/a” to your home page with a 301 in your htaccess file. This approach has two minor issues. One, that your server is doing the work by sending the Googlebot back to your home page and two, the page never existed in any Search Index, theoretically, there should be no reason to redirect it elsewhere.

Solution 2: Block this location from the Googlebot in your robots.txt file. This puts the responsibility on Google to stay out of someplace they don’t belong.

Disallow: /a/
Disallow: /a

After several weeks, you should see these erroneous 404 errors disappear. Good luck!

_________________

EDIT: In this article, I’m only assuming this is an issue for sites that host jQuery locally. I cannot imagine the google-bot trying to crawl scripts hosted on it’s own CDN!

_________________

EDIT 2: Here is an official response from a Google employee posting in Google Groups:

JohnMu
Google Employee
4/28/11 – 4:39 AM

Hi guys

Just a short note on this — yes, we are picking up the “/a” link for many sites from jQuery JavaScript. However, that generally isn’t a problem, if we see “/a” as being a 404, then that’s fine for us. As with other 404-URLs, we’ll list it as a crawl error in Webmaster Tools, but again, that’s not going to be a problem for crawling, indexing, or ranking. If you want to make sure that it doesn’t trigger a crawl error in Webmaster Tools, then I would recommend just 301 redirecting that URL to your homepage (disallowing the URL will also bring it up as a crawl error – it will be listed as a URL disallowed by robots.txt).

I would also recommend not explicitly disallowing crawling of the jQuery file. While we generally wouldn’t index it on its own, we may need to access it to generate good Instant Previews for your site.

So to sum it up: If you’re seeing “/a” in the crawl errors in Webmaster Tools, you can just leave it like that, it won’t cause any problems. If you want to have it removed there, you can do a 301 redirect to your homepage.

Cheers
John

Tag: robots.txt

404 errors (url: /a) in Google Webmaster Tools