Google Crawling and Indexing Experiments 2010

Added: 31.05.2010
When optimising a website I find it useful to know how to let Google in (or keep them out) of different parts of the site, and what they will (or won't) take into account when assessing a page. This article is the result of experiments carried out in April and May 2010 to answer a few questions I had plus try some things I'd read about and been interested in. It's always good to see for yourself!

Does Google crawl links in forms and Javascript?

Yes. Google will crawl a url if it can find that url in the code on a page. This includes

Forms

Google announced in April 2008 that it crawled forms in certain circumstances. In a test I was able to confirm this and get Google to index a new page via a form. I find it interesting that Google state that pages discovered by this method are not indexed at the expense of others. A site's index allocation is based on PageRank - does this mean that these pages are exempt from needing PageRank to be indexed? Does this mean that PageRank doesn't flow via forms?

Javascript

It was reported as far back as May 2009 that Google was sending PageRank and anchor text down certain Javascript links (this information is about half way down the article). This information comes from a reliable source who spoke to a Google representative at a conference, but I can find no official pronouncement from Google on the subject.

I can't comment on PageRank being sent down Javascript links, but in my tests Google did not send anchor text down Javascript links, whether these use real text or an image with alt text. 

I tested two types of links:

<a href="javascript:;" onClick="window.open('pop.htm')">anchor text</a> 
and
<img src="image.gif" onClick="window.open('pop.htm')" alt="anchor text">

In both cases, Google indexed pop.htm - however anchor text was not passed. I verified this by using a separate unique phrase for each link. When searching on those phrases Google did not return the popup as a result, which it would have done if anchor text had been passed.

So how do we get Google to completely ignore a link?

You could use the 'nofollow' tag but Google have said that PageRank is wasted when you do that (whether that is true or not is something we won't get into here).

One solution is to use the Javascript document.write function to create text links; the function should be in an external file that Google is blocked from visiting by robots.txt. My tests showed that Google didn't follow the link, and didn't the index the text (or the function name).

Another way of hiding a block of links is to use Ajax to call them from an external file referenced by an empty <div>, which is the only code left in the page.

Does Google index the meta description and keywords tags?

For years the answer has been no, and this is still the case.

How does Google see 'alt' and 'title'?

It's well known that 'alt' text on an image link is treated as anchor text. What about 'title'? Whether it is on an image inside an <a href=""> link, or inside the tag itself (i.e. <a title="" href="">), the answer is no.

Whilst testing this I also found that

Remember those 'alt' tags people!

What about 'noindex' and 'nofollow'?

Is the text inside a 'nofollow' link indexed?

Yes. Searching for a unique phrase inside a 'nofollow' link returns the page the link is on as a result. Another reason not to use 'nofollow' for user-generated links.

If a page is 'noindex,follow' can it help other pages to rank?

Yes. Anchor text is passed by links on a 'noindex,follow' page. I verified this by using a unique phrase in a link on Page A (which was 'noindex,follow') to Page B. When searching on that phrase Google returned Page B.

Does Google use 'nofollow' links for discovery?

This idea has been put forward on a few forums, and the fact that Google shows 'nofollow' links from places like Twitter in their Webmaster Tools console has given rise to speculation that some 'nofollow' links carry weight.

I tested this by setting up a 'nofollow' link to a page which then had unique text and a clean link to a further page that was not crawlable any other way. Neither page was indexed. 

Please note however that different result might be noticeable on a huge 'hub' site. It's at least a possibility that Google deals with certain nofollow links differently according to the site it finds them on (and possibly also where in the page it finds them). Also note comments made in this thread on WebmasterWorld; on Twitter (and therefore possibly other big social networking sites) 'nofollow' links can help with indexing because they get picked up by other sites that then strip off the 'nofollow' attribute. 

Does Google index content in <iframes>?

Yes; iframes are is indexed and returned as separate urls, and pages linked to from within iframes are crawled and anchor text is passed. I verified this by using a unique phrase for a link from an iframe (page A) to page B. When searching on that phrase Google returned both pages.

You can make Google notice more than one anchor text link from one page to another

I read about this on YouMoz and wanted to try it for myself.

Historically, if you linked from Page A to Page B twice, Google would only pass anchor text from the first link, not the second (or third etc). 

However, it has been confirmed recently that if you set up anchor points on Page B and link to those from Page A, then anchor text is passed for each link. 

I verified this by setting up three links from Page A to Page B on unique phrases. Searching on those phrases returned both pages every time.

The anchor names were not indexed by Google.

How does Google treat 'meta refresh' redirects?

A common feature on many domain control panels is the ability to add a redirect from a parked domain to another site. This is very often done using code like this:

<meta http-equiv="refresh" content="0;url=http://www.example.com/">

Traditionally Google has always treated a 'meta refresh' of 0 seconds as a 302 (temporary) redirect, which can be a problem. Google has long-standing problems with 302 redirects, and doesn't interpret them in the same way as a 301 (permanent) redirects. For more information read the problem with 302 redirects - an old (2005) thread from WebmasterWorld but still a great introduction to the subject.

So what about 2010? I set up the following tests:

0 second 'meta refresh' works like a 302 hijack - but not always!

Google attributed the content of page 1b to page 1a. Searching on the unique link text used returned no results; in other words page 1a effectively did not exist as far as Google was concerned. However, see the update below.

5 second 'meta refresh' like a 301 redirect - but not always!

Google didn't index page 2a, but instead only page 2b. It didn't return any results for the unique text on page 2a, so again page 2a effectively did not exist to Google. Again, see below for an update.

Update December 2010: more news on this from WebmasterWorld users. Others running their own tests saw exactly the opposite of what I saw, and one highly experienced SEO reports that they have seen the same meta refresh code treated differently on different sites. The most reliable fix in the long run will be a 301 redirect. 
Frantic Fish: Web Design, SEO, Internet Marketing & Web Development Brighton & Hove, East Sussex
Registered in England : 6905397 | tel 01273 275 614