Prior art for ‘nofollow’ blocking

This entry was published at least two years ago (originally posted on January 19, 2005). Since that time the information may have become outdated or my beliefs may have changed (in general, assume a more open and liberal current viewpoint). A fuller disclaimer is available.

With the addition of rel=“nofollow” to our arsenal of anti-spam tools, there’s a certain level of chatter about the ability to add a block element to a webpage to delineate certain areas of the page that should not be indexed by Google or other search engines.

Most of the time I see this mentioned, credit has gone to Brad Choate’s post from Feb. 2002 for first advancing the idea. However, the idea itself dates as far back as Jan. 2001 in Zoltan Milosevic’s Fluid Dynamics Search Engine, a shareware site-specific search engine.

I used the FDSE on my site for a while (starting Feb. 6, 2002), and found its support for blocking sections of pages from the search engine to be incredibly useful.

For instance, the sidebar on my site changes frequently: on the front page, the linklog updates often, somtimes multiple times a day; and on the individual pages, the ‘related entries’ list can change over time as new entries are added and the pages are rebuilt. Because of this, it’s not uncommon for me to see people arrive through Google searches for terms that were in the sidebar of a particular page when Google’s spider crawled my site, but have since disappeared.

In another situation, try using Google to search my site for an instance of when I’m actually talking about TrackBack: as the term “TrackBack” is on every single individual entry page, the noise to content ratio is weighted in entirely the wrong direction. If I had the ability to block off the sidebar and the TrackBack section header, these problems could be avoided.

FDSE allowed me to do just that — and part of what I liked about it was that it used the same syntax as the standard robot commands used in robots.txt files or meta tags. From the FDSE Help Pages:

FDSE supports the proprietary “robots” comment tag. This tag allows a web author to apply robots exclusion rules to arbitrary sections of a document. The tag has one attribute, content, with the following possible values:

noindex – the text enclosed in the tag is not saved in the index

nofollow – links are not extracted from the text enclosed

none – enclosed text is not indexed nor searched for links

Values “index”, “follow”, and “all” are also valid. In practice they are ignored since they are the unspoken defaults.

This feature is expected to fit the customer need of preventing certain parts of a document – such as a navigational sidebar – from being included in the search.

Example:

<HTML>
<BODY>

    This text will be indexed.
    <a href="foo.html"> this link will be followed </A>

    <!-- robots content="none" -->

        This text will NOT be indexed.
        <a href="bar.html"> this link will NOT be followed </A>

    <!-- /robots -->

    <!-- robots content="noindex" -->

        This text will NOT be indexed.
        <a href="bar1.html"> this link WILL be followed </A>

    <!-- /robots -->

    <!-- robots content="nofollow" -->

        This text WILL be indexed.
        <a href="bar1.html"> this link will NOT be followed </A>

    <!-- /robots -->

    la la la

</BODY>
</HTML>

For the example of a navigational sidebar, the “noindex” vale would be the best choice.

This syntax was designed to match the robots META tag.

For documents which have both the “robots” META tag and the “robots” comment tag, the most restrictive interpretation will be made, always erring on the side on not indexing or not following.

According to the above cited help documentation, Milosevic introduced this functionality in v2.0.0.0031 of the FDSE, and a quick check of FDSE’s version history dates that release to Jan. 26th, 2001 — four years before even a hint of its functionality was added to the major search engines, and just over a year before Brad’s post went up (no disrespect at all is meant to Brad here — different people have the same ideas fairly often, after all, and it’s an equally good idea no matter who came up with it — I’m just trying to give credit where credit is due, since this is a technique I’m actually familiar with).

Obviously, I’m fairly happy about seeing rel=“nofollow” gain support with Google and the other search engines. Equally obviously by this point, I’m sure, I’d love to see a block-level implementation made available, and I think Milosevic had a good approach. It’s easy to implement, follows already established conventions (robots.txt and meta tags), validates (as it’s simply an HTML comment), and allows for a little more control than a simple on/off ignore switch would.