The Google Sandbox Effect has been discussed at length in our
case study of a new website first crawled in May by Googlebot.
We can now further the case study with indexing comparisons
and discuss interesting Googlebot crawler behavior after
release, at the 75 day mark, of the study website from that
very confining Sandbox.
This case study is not for the faint of heart - those just
launching a new web business on a new domain name with hopes
of instant indexing and immediate traffic may find their
website very lonely for two and a half months - if it is in a
competitive market segment. You may as well plan to stay in
the Google Sandbox for at least 45 days on average. If some
early release stories are to be believed, search phrases
nobody wants to play with are taken pity on by Google and sent
home for early release.
Those non-competitive or obscure search phrases seem to be
seen as good, quiet little children, playing by themselves in
Sandbox playground and are sent home early on good behavior.
Googlebot probably sees good behavior as playing well with
others, like a good little baby domain and NOT being
competitive as some young domains can be. Throwing sand in
other childrens' faces and insisting on having your site
indexed, throwing sand out of the Sandbox with your bright
plastic toy shovel and bucket will not be allowed.
Now that the site discussed in this study is out of the
Sandbox, it still lingers on the playground, unable to escape
the community park and leave for the business world to play
with the big boys in the outside world. It does indeed take
time to grow up and be the model citizen in this new search
playground. Though on the first full day after this first week
of being released from the sandbox, the site has gotten 68
visitors referred by searches done at Google, the first
referred search traffic coming into the site. MSN has sent 8
visitors, Yahoo has sent 6, 4 came from AOL searches, 2 from
Netscape and 1 from Dogpile.
The indexing behavior of Yahoo and MSN has been nothing short
of bizarre with numbers of indexed pages increasing rapidly
over the first two months to reflect 6,941 pages indexed until
8 weeks into this study and we outlined previously how numbers
changed as you click through results pages first upward, then
downward to about half the total of highest numbers listed
along the top of the results pages.
It appears that Yahoo and MSN are playing on the 'slippery
slide' in this playground, climbing to the top of the ladder
of results at about 10 week mark showing 8,210 and 6,941 pages
respectively indexed, then sliding down again to 3,510 for
Yahoo and 373 for MSN, as of this writing two weeks later on
August 6. Still, Yahoo will show you only 1,000 (100 pages) of
those results and MSN will show you only 250 results, or 25
pages, no matter how many they claim to index. MSNbot is
crawling the site faster and more consistently than any of the
engines, yet shows by far fewer pages indexed than the others.
One of the interesting comparisons between Google and MSN in
our Sandbox study is that Google will show you most of what
they claim to have indexed after you click that link at the
bottom of the first page showing only 3 or 4 results when you
use the "site:Publish101.com" query operator then go to the
bottom of the page and click the link under the line reading,
"In order to show you the most relevant results, we have
omitted some entries very similar to the 3 already displayed.
If you like, you can repeat the search with the omitted
results included."
Go ahead and click that link, then you'll be presented with
the claimed total of indexed pages. That number has very
steadily increased since Sandbox release after 75 days from
first crawling of this Sandbox study site. The timing and
numbers of indexed pages at Google goes upward, and ONLY
upward with VERY distinct patterns noted from raw log files.
Crawling schedules seem to have been established for this site
by Google and indexing changes occur on a very regular
schedule.
The first observation of Sandbox release was at noon on
Thursday July 28, seventy-five days from first crawling by
Googlebot when a search turned up 379 pages indexed with a
"site:Publish101.com" query. That number increased later the
same evening to 3,660 pages at a search done around the dinner
hour Pacific time. Oddly, the next day, Friday July 29, the
number took a slight hop upward to 3,700 pages and on the
following Monday, showed 3,770 pages indexed.
That schedule and pattern have repeated on the second week of
Sandbox release when a "site:Publish101.com" query produced
5,660 results from from Google for the site on Thursday August
4 at just after noon and then nearly doubled at around the
dinner hour to 10,700 pages on that same query. A final check
just now on Saturday shows it at 12,100 pages indexed by
Google. It should be pointed out to those who wonder about the
total number of pages that this is a dynamic site with a very
large archive of articles that increases daily as new
submissions are contributed by member authors at the site.
Those articles are added through a content management system
on a daily basis by an editor who reviews submissions and
processes them for approvals or rejections. Those approved are
made live from the home page nightly. We've started doing this
on the crawler's schedules as we've noted very regular visits
by Yahoo's Slurp crawler to the site home page just once daily
at around 5pm each evening and Googlebot visiting the home
page only once, at near 11pm nightly, so we've instituted a
midnight activation of each day's new article submissions on
the home page of the site so that none of the new pages are
missed by those crawlers. MSNbot seems to hit the home page
multiple times through the day, so timing is less important
for MSN.
Crawler activity has been heated, with Yahoo crawling the
least and the slowest, barely seeming to attempt any updates
and the total of indexed pages has not changed for over three
weeks since it peaked at 8,210 pages indexed and then dropped
to it's current level of 3,510. As previously stated, Slurp
seems to be unhindered by any form of consistency in indexing
or crawling behavior. MSNbot has crawled extensively and
fairly regularly for weeks, but that odd indexing behavior is
a serious flaw in their utility as a search tool.
It should be mentioned here that AskJeeves had been noted to
crawl the site extensively early in this case study and
displayed a very regular and consistent crawl, but stopped
abruptly three weeks ago on july 13, after hitting most of the
pages then available on the site. Teoma, their spider, has
been absent ever since and they have not indexed this domain
at all since first crawling on May 23, over 10 weeks ago.
Clearly, Teoma appears to have the longest Sandbox of all the
search engines.
Much has been learned in this Sandbox case study about crawler
behavior, indexing delays, robots.txt requirements and index
updates at each of the top three search engines. Where that
knowledge leads will, of course, change as algorithms and
crawling schedules are adjusted by MSN, Yahoo and Google. But
valuable information has been shared that may help other
webmasters to better understand each of the factors that
determine the success of any website.
"Further findings in follow-up articles at the 3, 6 and 9
month marks, explore search referrals gained as Google adds
more pages and rankings fluctuations begin to level.
Meanwhile, we'd like to encourage others to publicly review
their crawler traffic through logs to compare behavior on new
domains to verify findings and disclose indexing behavior and
timing for new domains and further document SE indexing as
well as crawling behavior.
Copyright ? August 6, 2005
Previous Sandbox Case Study Articles:
http://Publish101.com/Sandbox2
http://Publish101.com/Sandbox3
http://Publish101.com/Sandbox4
Mike Banks Valentine is a search engine optimization
specialist