BURNINGBIRD
a node at the edge  


May 08, 2002
TechnologyMaking Peace with Google

I can't wait until I get up in the morning and pop on to my machine so I can download 50+ spam emails. One of the funnest games of the day is to try and find "real" email among all of the junk. When I find one, I holler out "email whack!"

As you can tell, I am being facetious. I don't know of anyone who likes spam, or wants to spend time on it, or wants to waste email bandwidth on it.

So why do we all like the crazy hits we get from Google?

Dave Winer pointed out a posting from Jon Udell discussing a posting from Dave Sims at O'Reilly. In it, Dave Sims wrote:

    Google's being weakened by its reliance on webloggers and their crosslinks
    ...
    If Google wants to evolve into a functional resource for all users, it will have to work itself off this current path, or it will open up an opportunity for The Next Great Search Engine.

Jon responds with:

    In the long run, the problem is not with Google, but with a world that hasn't yet caught up with the web. I'm certain that in 10 years, US Senators and Inspectors General will leave web footprints commensurate with their power and influence. I hope that future web will, however, continue to even the odds and level the playing field.

Sorry, Jon. I'm with Dave Sims on this one. Weblogs are weakening Google.

When I ported the Burningbird to Movable Type and moved to the new location, I also created a robots.txt file that disallowed any web bot other than the blogdex or Daypop bots. And the Googlebot, being a well behaved critter, has honored this (as have several other bots, my referrer log is getting sparkly clean).

In the meantime, I've left my old site as is, bot-beaten poor little thing that it is. As a result, in the referrer log I've found the following searches:

rufus wainright shrek
devonshire tea graphics
missouri point system drivers license
bill gates popular science
entrenched in hatred
richard ashcroft money to burn
shelley bird
pictures of terrorists burning american flags
south carolina state patrol fishing
pictures of women in afghanistan
we start fire billy joel
fairy tale blue bird
beautiful outlook pictures
fighting fishies
high blood pressure burning
hacking statistics in Australia
lord of the rings pictures and drawings sting sword
add morpheus node

...and on and on

And all of these Google searches happened in three days time. Three days.

Comparing usage estimates, Google was effectively chewing up over 30% of my web site CPU and bandwidth on searches that were on the average accurate 3% of the time.

My regular web sites (Dynamic Earth, YASD, P2P Smoke, and Burningbird Network) have on average seven times the traffic of my weblog, with half the Google traffic and an accuracy of over 98%. This figure means that Google searches resulting in hits to the regular web sites are finding resources matching their searches. People may still continue looking at other sites, but the topic of the search is being met by the topic covered in the page.

Weblogs -- might as well call us Google Viruses.

This isn't to say that Google and weblogs can't work together, but it isn't up to Google to make this happen. Google is a web bot and an algorithm; we're supposed to be the ones with the brains.

Weblogs that focus on one specific topic are ideal candidates for Google scanning. For instance, zem is a weblog focusing on topics of cryptography, security, and copyrights. Because he consistently stays on topic, he's increasing his accuracy ratio -- people are going to find data on the page that meets their search.

Victor, who's as interested in Google as I am, is trying to work with Google by creating a new weblog that focuses purely on web development resources, Macromedia products, and browser development. It's early days yet, but as time goes by and more people discover Victor's weblog, he should increase his Google page rank, resulting in an increase of the number and accuracy of his Google hits.

So what's a weblogger who just wants to have fun to do? Well, if you don't mind the crazy searches and the waste of your bandwidth and CPU, don't do anything. Let all those little bots just crawl all over your weblog's butt. Google's bandwidth and accuracy is Google's problem (time for smarter algorithms, perhaps).

However:

-if you're saving up to add some nice graphics or MP3 files to your weblog and your bandwidth is restricted, as most servers are or

-if you're getting tired of crawling through the bizarre Google searches or

-if you're getting tired of not being able to put "xxx" on your weblog page

then you might want to consider providing a few helpful aids to Google.

Google Helpful Aids

1. Create a robots.txt file and restrict Googlebot's search to specific areas of your weblog web site -- not to include your weblog page or archives.

2. If possible, create individual archive pages for each post. Otherwise, for all posts that deserve to stand alone, copy the generated HTML into a separate file.

3. For your weblog posts that you think will make a great resource, and that stay on topic and don't meander all over the place, copy or hard link it (if you're using Unix) to a directory that allows bots to crawl.

4. Avoid the use of 'xxx' in any shape and form in any of your Googlized pages

Over time, we'll add to these aids.

Now, if only I can figure out what to do with all these XML and RDF aggregators that are now crawling all over my server....



Posted by Bb at May 08, 2002 10:09 PM




Comments

Shelley, if you don't include your weblog page or archives, what's left for Google to index?

Posted by: Jonathon Delacour on May 9, 2002 01:15 AM

I glad you've expounded on why you don't want Google crawling through. Admittedly, I find my referer logs highly amusing when I see an odd search come in - although I don't have nearly the traffic you do, so it's not a problem for me.

By the way, +3.

Posted by: Bill Simoni on May 9, 2002 03:30 AM

Jonathon, good question that points to a clarification:

Your category and multi-posting archives. Anything that mixes content from multiple postings. You will want to Googlize your single posting archives files (or separate postings if you manually split).

One other thing to consider -- de-localize the postings. For instance, don't say something such as "Today Jonathon wrote...", which implies that the reader knows Jonathon. This works for Weblogging, but not Google. Instead, when you Googlize a page, use something such as, "A friend of mine wrote something once..." or along those lines -- add context.

more to come...

Posted by: Bb aka Shelley aka Weblog Bosswoman on May 9, 2002 08:10 AM

Sorry, you don't want to googlize your monthly or category archives.

Bill, thanks for update. Will add.

Posted by: Bb aka Shelley aka Weblog Bosswoman on May 9, 2002 08:11 AM

Shelley, it's not just staying on topic that increases Google's accuracy on my weblog. Archiving each article on a separate page - rather than daily or weekly or monthly pages with "#anchors" for individual posts - has made a huge difference.

There are still a fair number of bogus search hits on my monthly arhive pages (although google seems to be gradually 'learning' to prefer the individual articles). I'm planning a few subtle changes using robots.txt and some archive indexes designed specifically to help search robots, which will hopefully cut the bogus hits even further. If it works well, I'll write up the results.

BTW, one other thing to consider: many of those search terms you list are so unfocused, that Burningbird is probably as good a place as any for them to find. Not everyone knows what they're looking for.

Posted by: zem on May 9, 2002 09:34 PM


Post a comment

Name:


Email Address:


URL:


Comments:


Remember info?