Help search engines index your site

This entry was published at least two years ago (originally posted on July 30, 2003). Since that time the information may have become outdated or my beliefs may have changed (in general, assume a more open and liberal current viewpoint). A fuller disclaimer is available.

We all know that Google is god. Chances are you’ve used Google when doing a search on the ‘net at least once, if not daily, or many times a day. If not, then I’ve heard rumors that there are other search engines out there — though I haven’t used any in so long, I can’t really vouch for the veracity of that rumor. ;)

I wanted to share a few tricks I use here to help Google (and other search engines) index my site, and to try to ensure that searches that hit my site get the most useful results.

All of the following tips and tricks do require access to your source HTML templates (in TypePad, you’ll need to be using an Advanced Template Set). While I’m writing this for an Advanced TypePad installation, the tips will work just as well in any other website or weblog application where you have access to the HTML code.

Specify which pages get indexed, and which don’t

What? One of the most important pages on a weblog from a user’s point of view is the main page. It has all your latest posts, all the links to your archives, your bio, other sites you enjoy reading, webrings, and who all knows what else. However, from the perspective of a search engine, the main page of a weblog is most likely the single least important page of the entire site!

This is simply because the main page of a weblog is always changing, but search engines can only give good results when the information that they index is still there the next time around. I’ve run into quite a few situations where I’ve done a search for one term or another, and one of the search results leads to someone’s weblog. Unfortunately, when I go to their page, the entry that Google read and indexed is no longer on the main page. At that point, I could start digging through their archives and trying to track down what I’m looking for — but I’m far more likely to just bounce back to Google and try another page.

Thankfully enough, though, there’s an extremely easy fix for this that keeps everyone happy.

How? One short line of code at the top of some of your templates is all it takes to solve the problem. We’re going to be using the robots meta tag in the head of the HTML document. The tag was designed specifically to give robots (or spiders, or crawlers — the automated programs that search engines use to read websites) instructions on what pages should or shouldn’t be indexed.

For the purposes of a weblog, with one constantly changing index page and many static archive pages, the best possible situation would be to tell the search engine to read and follow all the links on an index page (so that it finds all the other pages of a site), but not to index that page. The rest of the site, it will be free to read and index normally.

That’s very easy to set up, as it turns out. The robots meta tag allows four possible arguments:

INDEX: Read and index a page normally
NOINDEX: Do not index any of the text of the page
FOLLOW: Follow all the links on a page to read linked pages
NOFOLLOW: Ignore all links on a page

So, in order to do what we want, we add the following meta tag to our document, in the head section, right next to the meta tags that are already there:

<meta name="robots" content="noindex,follow" />

Now, when a search engine robot visits the index page of the site, it knows that it should not index the page and add it to its database, however, it should follow any links on that page to find other pages within the site. This way, searches that return hits for the site will be sure to find your archive pages for the information that is requested, rather than your front page, which may not have the information anymore.

Update: It turns out that this technique may have some side effects that I hadn’t considered, and might possibly not work at all. For more details, please scroll down to Anode’s comment and my reply in the comment thread for this post. Hopefully I’ll be able to dig up more information on this soon.

Fine tune what sections of a page get indexed

What? There is a proposed extension to the robots meta tag that allows you to not just designate which pages of a site get indexed, but also which sections of a page get indexed. I discovered this when I was setting up a shareware search engine for my old website, and have since gotten in the habit of using it. Now, this is not a formal standard, and I don’t know for sure which search engines support it and which don’t — the creator of this technique has suggested it to the major search sites, but it is not known what the final result was.

Now, why would you want to do this? Simply this: on many weblogs, including TypePad sites, the sidebar information is repeated on every page of the site. There is also certain informational text repeated on every page (for instance, the TrackBack data, the comments form, and so on). This creates a lot of extraneous, mostly useless data — doubly so when that information changes regularly.

By using these proposed tags, any search engine that supports them will only index the sections of a page that we want indexed, and will disregard the rest of the page.

How? Because this is based on the robots meta tag discussed above, it uses the same four arguments (INDEX, NOINDEX, FOLLOW, and NOFOLLOW). Instead of using a meta tag, though, we use HTML comment syntax to designate the different sections of our document.

For instance, every individual archive page on a TypePad weblog that has TrackBack enabled will have the following text (or something very similar):

Trackback
TrackBack URL for this entry:
http://www.typepad.com/t/trackback/(number)

Listed below are links to weblogs that reference (the name of the post)

In order to mark this out as a section that we wanted the search engine not to index and not to follow (as the only link is to the page that the link is on), we would surround it with the following specialized tags:

<!-- robots content="noindex,nofollow" -->
<!-- /robots -->

For example, I would change the code in the TypePad Individual Entry template to look like this:

<mtentryIfAllowPings>
<!-- robots content="noindex,nofollow" -->
<h2><a id="trackback"></a>TrackBack</h2>
TrackBack URL for this entry:<br /><$MTEntryTrackbackLink$>
Listed below are links to weblogs that reference <a href="<$MTEntryPermalink$>"><$MTEntryTitle$></a>:
<!-- /robots -->
<mtpings>

The same technique can be used wherever you have areas in your site with content that doesn’t really need to be indexed.

Now, as I stated above, this is only a proposed specification, and it is not known which (if any) search engines support it. It also requires a healthy chunk of mucking around with your template code. Because of these two factors, it may not be an approach that you want to take, instead simply using the “sledgehammer” approach of the page-level robots meta tag discussed above.

However, I do think that the possible benefits of this being used more widely would be worth the extra time and trouble (at least, for those of us obsessive about our code), and I’d also suggest that should TypePad gain a search functionality, that these codes be recognized and followed by the (purely theoretical, at this point) TypePad search engine.

Put the entry excerpt to use

What? The entry excerpt is another very handy field to use in fine tuning your site. I believe that the field is turned off on the post editing screen by default, but it can be enabled by clicking on the ‘Customize the display of this page’ link at the bottom of the post editing screen.

By default, the entry excerpt is used for two things in TypePad: when you send a TrackBack ping to another weblog, the excerpt is sent along with the ping as a short summary of your post; and it is used as the post summary in your RSS feed if you have selected the ‘excerpts only’ version of the feed in your weblog configuration. However, it can come in handy in a few other instances too. One that I’ve discussed previously is in your archive pages. However, the excerpt can also be used to help out search engines.

You may have noticed that when you do a search on Google, rather than simply returning the link and page title, Google also returns a short snippet of each page that the search finds. Normally, this text snippet is just a bit of text from the page being referenced, intended to give some amount of context to give you a better idea of how successful your search was. There is a meta tag that lets us determine exactly what text is displayed by Google for the summary, though — which is where the extended entry field comes in.

How? We’re adding another meta tag here, so this will go up in the head section of your Individual Archives template. Next to any other meta tags you have, add the following line:

<meta name="description" content="<$MTEntryExcerpt>" />

Then save, and republish your Individual Archives, and you’re done. Now, the next time that Google indexes your site, the excerpt will be saved as the summary for that page, and will display beneath the link when one of your pages comes up in a Google search.

So what happens if you don’t use the entry excerpt field? Well, TypePad is smart enough to do its best to cover for this — if you use the <$MTEntryExcerpt$> tag in a template, and no excerpt has been added to the post, TypePad automatically pulls the first 20 words of your post to be the excerpt. While this works to a certain extent, it doesn’t create a very useful excerpt (unless you’re in the habit of writing extremely short posts). It’s far better to take a moment to create an excerpt by hand, whether it’s a quick cut and paste of relevant text in the post, or whether it’s more detailed (“In which we find out that…yadda yadda yadda.”). In the end, of course, it’s your call!

Use the Keywords

What? Keywords are short, simple terms that are either used in a page, or relate to the page. The original intent was to place a line in the head of an HTML page that listed keywords for that page, which search engines could read in addition to the page content to help in indexing.

Unfortunately, keywords have been heavily abused over the years. ‘Search Engine Optimizers’ started putting everything including the kitchen sink into their HTML pages for keywords in an effort to drive their pages rankings higher in the search engines. Because of this, some of the major search engines (Google included) now disregard the ‘keywords’ meta tag — however, not all of them do, and used correctly, they can be a helpful additional resource for categorizing and indexing pages.

How? One of the various fields you can use for data in each TypePad post is the ‘Keywords’ field. I believe that it is turned off by default, however you can enable it by clicking on the ‘Customize the display of this page’ link at the bottom of your TypePad ‘Post an Entry’ screen.

Once you have the ‘Keywords’ field available, you can add specific keywords for each post. You can either use words that actually appear in the post, or words that relate closely to it — for instance, I’ve had posts where I’ve used the acronym WMD in the body of the post, then added the three keywords ‘weapons mass destruction’ to the keywords field. You never know exactly what terms someone will use in their search, might as well give them the best shot at success, right?

Okay, so now you have keywords in your posts. What now? By default, TypePad’s templates don’t actually use the data in the Keywords field at all. This is fairly easy to fix, however.

In your Individual Archives template, add the following line of code just after the meta tags that are already there:

<meta name="keywords" content="<$MTEntryKeywords$>" />

Then save your template, republish your site (you can republish everything, but doing just the Individual Archives is fine, too, as that’s all that changed), and you’re done! Now, the next time that a search engine that reads the keywords meta tag reads your site, you’ve got that much more information on every individual post to help index your site correctly.

Conclusion

So there we have it. One extremely long post from me, with four hopefully handy tips for you on how you can help Google, and the rest of the search engines out there, index your site more intelligently. If you find this information of use, wonderful! If not…well, I hope you didn’t waste too much of your day reading it. ;)

Feel free to leave any questions, comments, or words of wisdom in the comments below!

10 thoughts on “Help search engines index your site”

Laura

July 31, 2003 at 9:08 am

Great post! I’m not currently using a version of TP with the capability for advanced templates, but the advice on using excerpts is useful to me. I haven’t been utilizing them, but I can see now why they’d be helpful.

Oh, and slightly off-topic, but amusing: See this “User Friendly” cartoon, and the five that follow chronologically.
lane

July 31, 2003 at 10:41 am

Good tips. I implemented the robots/follow idea, but got an error from TypePad when i tried to add the Entry Excerpt meta tag. The error read as follows:

Build error in template ‘Archive Index Template’: Error in tag: You used an ‘MTEntryExcerpt’ tag outside of the context of an entry; perhaps you mistakenly placed it outside of an ‘MTEntries’ container?

If anyone lese runs into this, let me know. All i did was copy and paste the meta tag as listed above into my Archive Index Template.

Splat.
Michael Hanscom

July 31, 2003 at 10:56 am

Lane – the EntryExcerpt meta ‘destination’ tag should be put in the Individual Entry Archive template (for the individual entry archives), rather than the Archive Index template (for the main index page of the archives).

Because the archives main page (which is what the Archive Index Template controls) can contain multiple posts, a single excerpt won’t work there.
Anode

August 1, 2003 at 3:56 am

Only problem is, if you don’t let Google index your front page, you’ll kill all the PageRank for it. Since a plurality of the links to your blog will likely be pointed at the front page, this is a BadThing ™ if you care about that sort of stuff (and if you’re trying to make it easier for Google to index you, you probably do.)
Michael Hanscom

August 1, 2003 at 10:40 am

Hmmm. I never thought about that as a possible side effect. It never seemed to affect me, though, it was always very easy for me to bring up my site in Google through searches for my name or the site’s name.

Of course, now that I actually try that, I see that Google does have the index page to my site indexed and cashed (it’s the first result for the search on my name. As I understand the meta robots tag, that shouldn’t be the case. Now, I’m confused…guess it’s time to hit the books again. Harrumpf.

Thanks for bringing that up, this’ll be worth looking into.
Scott Johnson

April 19, 2006 at 4:12 pm

I just found this post via Google and have a little something to add. When using MTEntryExcerpt in the description META tag, beware of quotation marks. I use the MT Regex plugin to correct this:

<MTAddRegex name=”dequote”>s|”|’|g</MTAddRegex>

<meta name=”description” content=”<$MTEntryExcerpt regex=”dequote”$>” />
Michael Hanscom

April 19, 2006 at 5:42 pm

Thanks Scott, that’s a good point. I’ve gotten in the habit of ensuring that I only use single quotes rather than double quotes in my excerpts, but the regex trick is another good way to do it.

Comments are closed.