All Request Saturday

Here’s an interesting idea, stolen from Terrance, who stole it from Stay of Execution: an all-request day.

Something about me you’d like to know? Something you’d like me to ramble on about? Pick a topic, any topic, and drop it in the comments. Come Saturday, I’ll go through what (if anything) is there and start babbling.

Of course, if nothing appears, I still reserve the right to go on about whatever I damn well please, so don’t think that by not suggesting anything you’re any more likely to get me to shut up. :)

iTunesSituation (The English Breakfast)” by Yaz from the album Don’t Go/Situation (1999, 9:04).

Prior art for ‘nofollow’ blocking

With the addition of rel=“nofollow” to our arsenal of anti-spam tools, there’s a certain level of chatter about the ability to add a block element to a webpage to delineate certain areas of the page that should not be indexed by Google or other search engines.

Most of the time I see this mentioned, credit has gone to Brad Choate’s post from Feb. 2002 for first advancing the idea. However, the idea itself dates as far back as Jan. 2001 in Zoltan Milosevic’s Fluid Dynamics Search Engine, a shareware site-specific search engine.

I used the FDSE on my site for a while (starting Feb. 6, 2002), and found its support for blocking sections of pages from the search engine to be incredibly useful.

For instance, the sidebar on my site changes frequently: on the front page, the linklog updates often, somtimes multiple times a day; and on the individual pages, the ‘related entries’ list can change over time as new entries are added and the pages are rebuilt. Because of this, it’s not uncommon for me to see people arrive through Google searches for terms that were in the sidebar of a particular page when Google’s spider crawled my site, but have since disappeared.

In another situation, try using Google to search my site for an instance of when I’m actually talking about TrackBack: as the term “TrackBack” is on every single individual entry page, the noise to content ratio is weighted in entirely the wrong direction. If I had the ability to block off the sidebar and the TrackBack section header, these problems could be avoided.

FDSE allowed me to do just that — and part of what I liked about it was that it used the same syntax as the standard robot commands used in robots.txt files or meta tags. From the FDSE Help Pages:

FDSE supports the proprietary “robots” comment tag. This tag allows a web author to apply robots exclusion rules to arbitrary sections of a document. The tag has one attribute, content, with the following possible values:

  • noindex – the text enclosed in the tag is not saved in the index
  • nofollow – links are not extracted from the text enclosed
  • none – enclosed text is not indexed nor searched for links

Values “index”, “follow”, and “all” are also valid. In practice they are ignored since they are the unspoken defaults.

This feature is expected to fit the customer need of preventing certain parts of a document – such as a navigational sidebar – from being included in the search.

Example:

<HTML>
<BODY>

    This text will be indexed.
    <a href="foo.html"> this link will be followed </A>

    <!-- robots content="none" -->

        This text will NOT be indexed.
        <a href="bar.html"> this link will NOT be followed </A>

    <!-- /robots -->

    <!-- robots content="noindex" -->

        This text will NOT be indexed.
        <a href="bar1.html"> this link WILL be followed </A>

    <!-- /robots -->

    <!-- robots content="nofollow" -->

        This text WILL be indexed.
        <a href="bar1.html"> this link will NOT be followed </A>

    <!-- /robots -->

    la la la

</BODY>
</HTML>

For the example of a navigational sidebar, the “noindex” vale would be the best choice.

This syntax was designed to match the robots META tag.

For documents which have both the “robots” META tag and the “robots” comment tag, the most restrictive interpretation will be made, always erring on the side on not indexing or not following.

According to the above cited help documentation, Milosevic introduced this functionality in v2.0.0.0031 of the FDSE, and a quick check of FDSE’s version history dates that release to Jan. 26th, 2001 — four years before even a hint of its functionality was added to the major search engines, and just over a year before Brad’s post went up (no disrespect at all is meant to Brad here — different people have the same ideas fairly often, after all, and it’s an equally good idea no matter who came up with it — I’m just trying to give credit where credit is due, since this is a technique I’m actually familiar with).

Obviously, I’m fairly happy about seeing rel=“nofollow” gain support with Google and the other search engines. Equally obviously by this point, I’m sure, I’d love to see a block-level implementation made available, and I think Milosevic had a good approach. It’s easy to implement, follows already established conventions (robots.txt and meta tags), validates (as it’s simply an HTML comment), and allows for a little more control than a simple on/off ignore switch would.

Battling the spammers

Over the past few days, I’ve noticed off and on that my webserver has been extremely slow to respond — less obviously when just browsing pages, but attempting to connect to the Movable Type interface was increasingly difficult, often resulting in nothing but timeouts and connection failures.

I had a hunch that I knew what was going on, but I wasn’t entirely sure at first. I logged in to the server locally — something I haven’t had to do in a while — and realized just how badly the machine was bogged down when the OS X user interface was almost as unresponsive as Movable Type. Not a good sign. Once I made it in and got a terminal window up, I ran top -u 15 to see what was going on.

Not surprisingly, every entry that top displayed was a perl process, with mysqld occasionally clawing its way to the top for a moment or two. Now I was almost entirely sure that one or more of the sites I host was under a major automated comment spam attack, as even with MT-Blacklist installed and refusing the majority of the submitted comments, it would require a certain amount of processing for each request, and while I’m not sure just how many a minute were being submitted, it was obviously enough to bring my server to its knees.

So, seeing if I could kill two birds with one stone, I renamed all the comment and trackback scripts on the webserver, figuring that this would kill any in-progress attack and in doing so, confirm that it was a spam attack. Sure enough, as the multitudes of perl processes slowly worked their way through to completion, top started running faster (it had been updating once every 6-10 seconds, rather than once a second) and other processes started to show up on the display. After about two minutes, there wasn’t a single perl process on top‘s list, top was updating at its standard once-per-second frequency, and the computer’s UI was responding as it should.

The downside to this technique is that it breaks comment and trackback ability. Easy enough to fix, though, with a quick change to MT’s config file and a rebuild of the sites. So, the comment scripts have been renamed, and I’m in the process of rebuilding the sites to reflect the new script locations.

And you know what?

Even in mid-rebuild, I’m already starting to watch the number of perl process climb. One or two I’d expect while rebuilding the site, but I’m currently seeing anywhere from two to ten at a time. I’ve got a really bad feeling that whatever spammer has me targeted has a script smart enough to scrape the pages to find the script locations, no matter what they are named.

This — in a word — sucks. Outside of turning comments off entirely for the targeted sites, which really doesn’t thrill me, I’m not sure where to go next.

Guess for now I’ll just have to keep an eye on things and see how they go.

rel=“nofollow” : Massive weblog anti-spam initiative

Wow. Straight from Jay Allen:

Six Apart has announced in co-operation with Google, Yahoo, MSN Search and other blog vendors a massive joint anti-spam initiative based on the HTML link type rel="nofollow".

The initiative is based upon the idea of taking away the value of user-submitted links in determining search rankings. By placing rel="nofollow" into the hyperlink tags of user-submitted feedback, search engines will ignore those links for the purposes of ranking (e.g. PageRank) and will not follow them when spidering a site.

[…]

It is important to note that while the links will no longer count for PageRank (and other search engines’ algorithms), the content of user-submitted data will still be indexed along with the rest of the contents of the page. Forget all of those silly ideas of hiding your comments from the GoogleBot. Heck, the comments in most blogs are more interesting that the posts themselves. Why would you want to do that to the web?

Now, the astute will point out that because links in comments/TrackBacks are ignored by the search bots, the PageRank of bloggers all around the blooog-o-sphere will suffer because hundreds of thousands of comments linking back to their own sites will no longer count in the rankings. And that is most likely true. But that inflated PageRank, which was a problem created by the search engines themselves, is the rotting flesh that the maggots sought out in the first place. If you ask me, I say fair trade.

In the end, of course, this isn’t the end of weblog spam. But because it completely takes away the incentive for the type of spamming we’re seeing today in the weblog world, you will probably see steady decline as many spammers find greener pastures elsewhere. That decline combined with better tools should help to make this a non-issue in the future. Every little step counts, some count more than others, and history will be the judge of all.

Very cool. Also very similar to a technique I was using a couple years back, though that was geared to blocking off areas of the site to ignore rather than affecting individual links. Either way, though, it’s a big step forward. I’m especially heartened to see the list of competing companies and weblogging systems that are participating in this.

Technorati Tags

Change of plans as far as my keywords/tags project goes.

This past week, Technorati introduced a tag search to their weblog-centric search engine. Searching for a particular tag on Technorati returns a result page that aggregates recent weblog posts, Flickr photos, and del.icio.us links from across the web that use the same tag. Very nice.

This works well for me. One of the potential downsides I’d been running into with my prior plan — integrating ishbadiddle’s local keyword search — was simply that I’d gotten very used to the Flickr/del.icio.us method of separating tags with spaces, while the local keyword search required that the tags be separated with commas. As I was starting to work my way through cleaning up the keywords for my entries here, then, I’d been using spaces within keywords on the weblog (for instance, a tag of my name would be “michaelhanscom” or “michael.hanscom” on Flickr or del.icio.us, but be “Michael Hanscom” here on my weblog). I’m anal enough, though, that this bugged me — I’d rather have one consistent tagging methodology across all the systems.

As Technorati also uses the space separated tag format, and expects multiple words to be ‘smooshed’ together (just as Flickr and del.icio.us do), I’ve decided to use that system for all my tagging, foregoing ishbadiddle’s system (sorry, M E-L! — but if your system can be tweaked to read space-delimited lists rather than comma-delimited, I can look back into it again…).

Thanks to George’s TechoratiTags plugin for MovableType, I’m now listing tags in the metadata for each post, just underneath the title. The tags are drawn from the (space-separated) keywords for each entry, and clicking on any one of them will take you to that tag’s Technorati search page.

Just another way the web is getting more and more classified. Pretty cool, in my world.

iTunesJames Brown Is Dead (Wide Awake)” by L.A. Style from the album James Brown Is Dead (1991, 5:25).

Moving to del.icio.us

As I’ve been more and more interested in using tag-based taxonomies to categorize and track things, I’ve been looking more and more often at using del.icio.us as a bookmark manager and potential replacement for my linklog.

Admittedly, when I first looked at del.icio.us a while back, I didn’t really understand what the deal was, or why it was so special. After spending time bouncing around Flickr and finding all sorts of interesting photographic work by exploring the tags people had used to categorize their photos, though, it finally clicked — del.icio.us was using the same concept to classify virtually the entire web. Oh! Now I get it!

So the old linklog has been removed from my sidebar (though the archives still exist), and has been replaced with a list of the most recent fifteen items added to my del.icio.us page. There’s an RSS feed available too, though as I’ll be using FeedBurner‘s link splicing ability to add my del.icio.us links to my main RSS feeds (just as I do for my Flickr photos), subscribing to that is definitely optional.

It may be a day or so before the links get spliced in, though — for some reason, FeedBurner keeps telling me that ‘djwudi’ isn’t a valid del.icio.us ID. Funny, del.icio.us thinks it is…I’m going to have to work on that.

Update: FeedBurner tracked down the issue they were having with connecting to del.icio.us, and I’ve updated my feeds. Both the ‘full posts’ and ‘full posts with comments’ feeds have the links spliced in, and the The ‘eclinkticism’ feed has been switched over to my del.icio.us links (if you were subscribed to either of my full post feeds and the linklog feed, you’ll be able to delete the linklog feed now). The ‘excerpts only’ feed has been left as-is (it doesn’t include my flickr photos, either).

Update 2: Well, it seemed like a good idea. However, that was a bit too much all in one feed. Links have been taken back out of the full-post and full-post-with-comments feeds, in favor of leaving them in their own separate feed. I’m also wondering if I should pull my Flickr photos out of the main streams, in favor of making everything mix-and-match. Seems better to let people pick and choose what they want to pay attention to rather than forcing everything on them all at once….

The part I’m happiest about was figuring out a very easy way to integrate my del.icio.us links into my site without having to deal with extra Movable Type plugins, installing extra software, or the like. del.icio.us provides an HTML feed of recent links, so I just set up a simple shell script, then use cron to run it every hour on the hour. Here’s the script in question:

#!/bin/sh

curl -s -f -d rssbutton=no -d tags=no -d extended=body http://del.icio.us/html/djwudi -o /Library/WebServer/Documents/eclecticism/delicious.tmp

mv -f /Library/WebServer/Documents/eclecticism/delicious.tmp /Library/WebServer/Documents/eclecticism/delicious.inc

echo &#8220;del.icio.us linklog sucessfully updated!&#8221;

The curl command retrieves the HTML feed of my links and saves it to a file, which mv then renames (this ensures that there won’t be an issue if the file is being updated at the same time that my webserver is expecting to be able to read from it), and echo returns a short message letting me know that the operation concluded successfully (cron e-mails me the confirmation message each time it runs…and I may turn that off soon now that I know everything’s working). Then, anytime someone loads my site, a simple PHP include loads the delicious.inc file into the page. Quick and simple.

Oh, and the name of the shell script?

deli.sh

iTunes867-5309 Jenny (Hot Tracks)” by Tutone, Tommy from the album Edge, The Level 2 (1995, 5:31).

Malicious Software Removal

Sure, I knew Microsoft was evil, but I never expected them to actually brag about it. Today brings the release of their Malicious Software Removal Tool, though, so I guess I was wrong.

I’m curious just who they expect to be excited about this announcement. Malicious software removal? It’s bad enough that so much of their software is fairly malicious in standard day to day operations, but now they’re actively promoting a product that, judging by its name, will gleefully and with great gusto go rampaging through your computer, removing the most useful pieces of software it can find?

What hubris! What unmitigated gall!

 

What?

It’s a tool to remove malicious software?

Oh. Well, that’s different.

(via /.)

Read more

Who are you?

So.

You’re the head of a highly secretive company.

You’re known for being temperamental and very mysterious.

The goods your company produces are highly popular, but they’re developed in secret.

When they’re introduced, they’re invariably accompanied by much anticipation, a media blitz, and fans worldwide salivating over the newest products.

Who are you?

Read more

6 year old webserver

While talking with Prairie about how Macs generally tend to have long lifespans, I looked up the original introduction date for the 350Mhz Blue and White G3 that acts as the webserver for my site, and found out that it was originally introduced on January 5th, 1995 1999.

That’s just a few days over ten six years that this machine has been around, and it’s been running pretty much 24 hours a day, 7 days a week for the four years that I’ve owned it (I bought it used in February of ’01). It’s still going strong, too, chugging right along day after day.

Not bad…not bad at all.

_Wow_, I’m a dork. I really don’t know how I managed to confuse ’99 and ’95 when I was looking up the date, but apparently I did. Thanks to Dan for pointing that out.

Six years still isn’t bad, though…

iTunesWalking on the Sun (Geek In Highwaters)” by Smashmouth from the album DJ Goodies (1995, 6:12).

Mac mini (and more)

Only the sketchiest details so far, ganked from MacRumors’ live update page, but…

Mac Mini

  • Mac Mini
  • very tiny
  • quiet, fw, usb2, video out, ethernet – very very tiny
  • pizza box style
  • analog, digital video out
  • comes with Panther & iLife 05
  • half as high as an iPod Mini, surface of a little dish
  • coming 1st half of 05
  • 1.25 Ghz G4
  • another at 1.4 Ghz
  • most important new mac ‘ever’
  • available Jan 22
  • prices for mac mini: \$499 and \$599
  • \$499 with 1.25 G4, 256, 40 gig, Combo

[Update:]{.underline}

Okay, everything’s done, and Apple’s website has been updated.

Funny: On the iPod shuffle page, there’s a picture of the iPod shuffle next to a pack of gum, with a caption that reads “Smaller than a pack of gum and much more fun.^[2]^” When you go to footnote 2 at the bottom of the page:

  1. Do not eat iPod shuffle.

Funny: On the Mac mini Design page is the notation, “Keyboard, iPod mini, dock, hands, AirPort, Bluetooth and PC sold separately.”