Google Crawling and Indexing Experiments 2010
Yes. Google will crawl a url if it can find that url in the code on a page. This includes
- urls given as the destination for a form (usually where the form is used for navigation or search).
Google announced in April 2008 that it crawled forms in certain circumstances. In a test I was able to confirm this and get Google to index a new page via a form. I find it interesting that Google state that pages discovered by this method are not indexed at the expense of others. A site's index allocation is based on PageRank - does this mean that these pages are exempt from needing PageRank to be indexed? Does this mean that PageRank doesn't flow via forms?
I tested two types of links:
<img src="image.gif" onClick="window.open('pop.htm')" alt="anchor text">
In both cases, Google indexed pop.htm - however anchor text was not passed. I verified this by using a separate unique phrase for each link. When searching on those phrases Google did not return the popup as a result, which it would have done if anchor text had been passed.
So how do we get Google to completely ignore a link?
You could use the 'nofollow' tag but Google have said that PageRank is wasted when you do that (whether that is true or not is something we won't get into here).
Another way of hiding a block of links is to use Ajax to call them from an external file referenced by an empty <div>, which is the only code left in the page.
Does Google index the meta description and keywords tags?
For years the answer has been no, and this is still the case.
How does Google see 'alt' and 'title'?
It's well known that 'alt' text on an image link is treated as anchor text. What about 'title'? Whether it is on an image inside an <a href=""> link, or inside the tag itself (i.e. <a title="" href="">), the answer is no.
Whilst testing this I also found that
- the image file name is not indexed by Google images or Google web (no surprise there), but
- 'alt' text for an image that isn't a link is indexed by both - which was a surprise to me.
Remember those 'alt' tags people!
What about 'noindex' and 'nofollow'?
Is the text inside a 'nofollow' link indexed?
Yes. Searching for a unique phrase inside a 'nofollow' link returns the page the link is on as a result. Another reason not to use 'nofollow' for user-generated links.
If a page is 'noindex,follow' can it help other pages to rank?
Yes. Anchor text is passed by links on a 'noindex,follow' page. I verified this by using a unique phrase in a link on Page A (which was 'noindex,follow') to Page B. When searching on that phrase Google returned Page B.
Does Google use 'nofollow' links for discovery?
This idea has been put forward on a few forums, and the fact that Google shows 'nofollow' links from places like Twitter in their Webmaster Tools console has given rise to speculation that some 'nofollow' links carry weight.
I tested this by setting up a 'nofollow' link to a page which then had unique text and a clean link to a further page that was not crawlable any other way. Neither page was indexed.
Does Google index content in <iframes>?
Yes; iframes are is indexed and returned as separate urls, and pages linked to from within iframes are crawled and anchor text is passed. I verified this by using a unique phrase for a link from an iframe (page A) to page B. When searching on that phrase Google returned both pages.
You can make Google notice more than one anchor text link from one page to another
I read about this on YouMoz and wanted to try it for myself.
Historically, if you linked from Page A to Page B twice, Google would only pass anchor text from the first link, not the second (or third etc).
However, it has been confirmed recently that if you set up anchor points on Page B and link to those from Page A, then anchor text is passed for each link.
I verified this by setting up three links from Page A to Page B on unique phrases. Searching on those phrases returned both pages every time.
The anchor names were not indexed by Google.
How does Google treat 'meta refresh' redirects?
A common feature on many domain control panels is the ability to add a redirect from a parked domain to another site. This is very often done using code like this:
<meta http-equiv="refresh" content="0;url=http://www.example.com/">
Traditionally Google has always treated a 'meta refresh' of 0 seconds as a 302 (temporary) redirect, which can be a problem. Google has long-standing problems with 302 redirects, and doesn't interpret them in the same way as a 301 (permanent) redirects. For more information read the problem with 302 redirects - an old (2005) thread from WebmasterWorld but still a great introduction to the subject.
So what about 2010? I set up the following tests:
- Page 1a features a meta refresh of 0 seconds to page 1b, and a unique phrase in a text link pointing to 1b.
- Page 2a features a meta refresh of 5 seconds to page 2b, and another unique phrase text link pointing to 2b.
0 second 'meta refresh' works like a 302 hijack - but not always!
Google attributed the content of page 1b to page 1a. Searching on the unique link text used returned no results; in other words page 1a effectively did not exist as far as Google was concerned. However, see the update below.
5 second 'meta refresh' like a 301 redirect - but not always!
Google didn't index page 2a, but instead only page 2b. It didn't return any results for the unique text on page 2a, so again page 2a effectively did not exist to Google. Again, see below for an update.