Too Americanized?

Pentagon officials says Americanizing Iraq is difficult because Iraqis have had little to no reliable information for the past 35 years, and have lived on a diet of innuendo, rumor, conspiracy theories, fear, and propaganda. Sounds like the problem is they’re too Americanized.

Bill Maher (No permalink, July 29^th^ entry)

It looks to me like Bill’s weblog is using MovableType. Who can we contact to at least get him (or his webmaster) to turn on permalinks?

Help wanted: Apache/PHP

I’m planning on sticking with TypePad as my weblog host once everything opens up officially (tomorrow, from the looks of it). However, this poses a bit of a problem. While I’m slowly moving all of my old posts from my old weblog to this new site, there are still lots of links scattered throughout the ‘net that point to the old addresses.

I think I know of a solution, however, I’m not well enough versed in the intricacies of Apache and PHP to pull it off on my own. So, I’m asking for help!

Here’s what I’d like to do…

All of my old posts reside at my personal server at http://www.djwudi.com/longletter/. It’s a Mac OS X computer running Apache, with PHP enabled.

I know that Apache can handle redirects, based on rules set up in the httpd.conf file. I also know that pattern matching and text string munging can be carried out in PHP.

All of my old individual entry pages are stored in my webserver with the following directory structure:

http://www.djwudi.com/longletter/archives/year/month/day/dirified_post_title.php http://www.djwudi.com/longletter/archives/2003/07/31/help_wanted_apache_php.php

All of the pages on this new site are stored using a similar, but slightly different directory structure:

http://djwudi.typepad.com/eclecticism/year/month/truncated_title.html http://djwudi.typepad.com/eclecticism/2003/07/help_wanted_apa.html

What I’m envisioning for the final system is this:

  • Anytime my webserver receives a request for a page that resides within the ‘/longletter/archives/’ directory, Apache redirects to a customised PHP script on my server.
  • That script does three things:
    1. Presents a simple page to the user with wording to the effect of “This site has moved, one moment while we redirect you…”.
    2. Looks at the requested URI and converts it to what the new URI should be. As I’ve kept post titles consistent, and the directory structures are similar, this should be fairly easy with the right regular expressions.
      1. Parse the requested URI.
      2. Remove everything before the 4-digit year and replace it with the new base address.
      3. Remove the 2-digit day.
      4. Truncate the post title to fifteen characters.
      5. Remove the .php extention and replace it with .html.
    3. Redirects the users browser to the new, correct URI.
  • Hey presto, we’re done — no matter which page was linked to at my old site, the user has been redirected to the corresponding page at my new site.

More brainstorming:

  • The above method works well for links going to individual pages, but what about category archives or the main index page itself?
  • Could the PHP script be made smarter? For instance…
    1. If the requested URI contains the year/month/day/title.php string, then the above transformation and redirect is processed.
    2. If the requested URI contains any other string (in other words, it doesn’t point to a specific post), then a page is presented that says something along the lines of “This site has moved, one moment while we redirect you to the new site…”, and a redirect is passed to the user’s browser that points to the index page of the new weblog.

Anyway, that’s what I’d like to do. It all seems straightforward enough in my brain, and I think that the technology I have available should be able to handle it all without a problem — I just don’t have the faintest idea how to code it.

Any and all advice, hints, tips, or straight-up solutions would be greatly appreciated. I’m not rich enough to offer untold wealth or cool prizes or anything, but I can offer much gratitude, public thanks and kudos, and probably pizza and beer (or a PayPal donation to a ‘pizza and beer’ fund, or some such thing).

And you won’t even have to fight me for the beer — I can’t stand the stuff. ;)

Help search engines index your site

We all know that Google is god. Chances are you’ve used Google when doing a search on the ‘net at least once, if not daily, or many times a day. If not, then I’ve heard rumors that there are other search engines out there — though I haven’t used any in so long, I can’t really vouch for the veracity of that rumor. ;)

I wanted to share a few tricks I use here to help Google (and other search engines) index my site, and to try to ensure that searches that hit my site get the most useful results.

All of the following tips and tricks do require access to your source HTML templates (in TypePad, you’ll need to be using an Advanced Template Set). While I’m writing this for an Advanced TypePad installation, the tips will work just as well in any other website or weblog application where you have access to the HTML code.

Specify which pages get indexed, and which don’t

What? One of the most important pages on a weblog from a user’s point of view is the main page. It has all your latest posts, all the links to your archives, your bio, other sites you enjoy reading, webrings, and who all knows what else. However, from the perspective of a search engine, the main page of a weblog is most likely the single least important page of the entire site!

This is simply because the main page of a weblog is always changing, but search engines can only give good results when the information that they index is still there the next time around. I’ve run into quite a few situations where I’ve done a search for one term or another, and one of the search results leads to someone’s weblog. Unfortunately, when I go to their page, the entry that Google read and indexed is no longer on the main page. At that point, I could start digging through their archives and trying to track down what I’m looking for — but I’m far more likely to just bounce back to Google and try another page.

Thankfully enough, though, there’s an extremely easy fix for this that keeps everyone happy.

How? One short line of code at the top of some of your templates is all it takes to solve the problem. We’re going to be using the robots meta tag in the head of the HTML document. The tag was designed specifically to give robots (or spiders, or crawlers — the automated programs that search engines use to read websites) instructions on what pages should or shouldn’t be indexed.

For the purposes of a weblog, with one constantly changing index page and many static archive pages, the best possible situation would be to tell the search engine to read and follow all the links on an index page (so that it finds all the other pages of a site), but not to index that page. The rest of the site, it will be free to read and index normally.

That’s very easy to set up, as it turns out. The robots meta tag allows four possible arguments:

INDEX
Read and index a page normally
NOINDEX
Do not index any of the text of the page
FOLLOW
Follow all the links on a page to read linked pages
NOFOLLOW
Ignore all links on a page

So, in order to do what we want, we add the following meta tag to our document, in the head section, right next to the meta tags that are already there:

<meta name="robots" content="noindex,follow" />

Now, when a search engine robot visits the index page of the site, it knows that it should not index the page and add it to its database, however, it should follow any links on that page to find other pages within the site. This way, searches that return hits for the site will be sure to find your archive pages for the information that is requested, rather than your front page, which may not have the information anymore.

Update: It turns out that this technique may have some side effects that I hadn’t considered, and might possibly not work at all. For more details, please scroll down to Anode’s comment and my reply in the comment thread for this post. Hopefully I’ll be able to dig up more information on this soon.

Fine tune what sections of a page get indexed

What? There is a proposed extension to the robots meta tag that allows you to not just designate which pages of a site get indexed, but also which sections of a page get indexed. I discovered this when I was setting up a shareware search engine for my old website, and have since gotten in the habit of using it. Now, this is not a formal standard, and I don’t know for sure which search engines support it and which don’t — the creator of this technique has suggested it to the major search sites, but it is not known what the final result was.

Now, why would you want to do this? Simply this: on many weblogs, including TypePad sites, the sidebar information is repeated on every page of the site. There is also certain informational text repeated on every page (for instance, the TrackBack data, the comments form, and so on). This creates a lot of extraneous, mostly useless data — doubly so when that information changes regularly.

By using these proposed tags, any search engine that supports them will only index the sections of a page that we want indexed, and will disregard the rest of the page.

How? Because this is based on the robots meta tag discussed above, it uses the same four arguments (INDEX, NOINDEX, FOLLOW, and NOFOLLOW). Instead of using a meta tag, though, we use HTML comment syntax to designate the different sections of our document.

For instance, every individual archive page on a TypePad weblog that has TrackBack enabled will have the following text (or something very similar):

Trackback
TrackBack URL for this entry:
http://www.typepad.com/t/trackback/(number)

Listed below are links to weblogs that reference (the name of the post)

In order to mark this out as a section that we wanted the search engine not to index and not to follow (as the only link is to the page that the link is on), we would surround it with the following specialized tags:

<!-- robots content="noindex,nofollow" -->
<!-- /robots -->

For example, I would change the code in the TypePad Individual Entry template to look like this:

<mtentryIfAllowPings>
<!-- robots content="noindex,nofollow" -->
<h2><a id="trackback"></a>TrackBack</h2>
TrackBack URL for this entry:<br /><$MTEntryTrackbackLink$>
Listed below are links to weblogs that reference <a href="<$MTEntryPermalink$>"><$MTEntryTitle$></a>:
<!-- /robots -->
<mtpings>

The same technique can be used wherever you have areas in your site with content that doesn’t really need to be indexed.

Now, as I stated above, this is only a proposed specification, and it is not known which (if any) search engines support it. It also requires a healthy chunk of mucking around with your template code. Because of these two factors, it may not be an approach that you want to take, instead simply using the “sledgehammer” approach of the page-level robots meta tag discussed above.

However, I do think that the possible benefits of this being used more widely would be worth the extra time and trouble (at least, for those of us obsessive about our code), and I’d also suggest that should TypePad gain a search functionality, that these codes be recognized and followed by the (purely theoretical, at this point) TypePad search engine.

Put the entry excerpt to use

What? The entry excerpt is another very handy field to use in fine tuning your site. I believe that the field is turned off on the post editing screen by default, but it can be enabled by clicking on the ‘Customize the display of this page’ link at the bottom of the post editing screen.

By default, the entry excerpt is used for two things in TypePad: when you send a TrackBack ping to another weblog, the excerpt is sent along with the ping as a short summary of your post; and it is used as the post summary in your RSS feed if you have selected the ‘excerpts only’ version of the feed in your weblog configuration. However, it can come in handy in a few other instances too. One that I’ve discussed previously is in your archive pages. However, the excerpt can also be used to help out search engines.

You may have noticed that when you do a search on Google, rather than simply returning the link and page title, Google also returns a short snippet of each page that the search finds. Normally, this text snippet is just a bit of text from the page being referenced, intended to give some amount of context to give you a better idea of how successful your search was. There is a meta tag that lets us determine exactly what text is displayed by Google for the summary, though — which is where the extended entry field comes in.

How? We’re adding another meta tag here, so this will go up in the head section of your Individual Archives template. Next to any other meta tags you have, add the following line:

<meta name="description" content="<$MTEntryExcerpt>" />

Then save, and republish your Individual Archives, and you’re done. Now, the next time that Google indexes your site, the excerpt will be saved as the summary for that page, and will display beneath the link when one of your pages comes up in a Google search.

So what happens if you don’t use the entry excerpt field? Well, TypePad is smart enough to do its best to cover for this — if you use the <$MTEntryExcerpt$> tag in a template, and no excerpt has been added to the post, TypePad automatically pulls the first 20 words of your post to be the excerpt. While this works to a certain extent, it doesn’t create a very useful excerpt (unless you’re in the habit of writing extremely short posts). It’s far better to take a moment to create an excerpt by hand, whether it’s a quick cut and paste of relevant text in the post, or whether it’s more detailed (“In which we find out that…yadda yadda yadda.”). In the end, of course, it’s your call!

Use the Keywords

What? Keywords are short, simple terms that are either used in a page, or relate to the page. The original intent was to place a line in the head of an HTML page that listed keywords for that page, which search engines could read in addition to the page content to help in indexing.

Unfortunately, keywords have been heavily abused over the years. ‘Search Engine Optimizers’ started putting everything including the kitchen sink into their HTML pages for keywords in an effort to drive their pages rankings higher in the search engines. Because of this, some of the major search engines (Google included) now disregard the ‘keywords’ meta tag — however, not all of them do, and used correctly, they can be a helpful additional resource for categorizing and indexing pages.

How? One of the various fields you can use for data in each TypePad post is the ‘Keywords’ field. I believe that it is turned off by default, however you can enable it by clicking on the ‘Customize the display of this page’ link at the bottom of your TypePad ‘Post an Entry’ screen.

Once you have the ‘Keywords’ field available, you can add specific keywords for each post. You can either use words that actually appear in the post, or words that relate closely to it — for instance, I’ve had posts where I’ve used the acronym WMD in the body of the post, then added the three keywords ‘weapons mass destruction’ to the keywords field. You never know exactly what terms someone will use in their search, might as well give them the best shot at success, right?

Okay, so now you have keywords in your posts. What now? By default, TypePad’s templates don’t actually use the data in the Keywords field at all. This is fairly easy to fix, however.

In your Individual Archives template, add the following line of code just after the meta tags that are already there:

<meta name="keywords" content="<$MTEntryKeywords$>" />

Then save your template, republish your site (you can republish everything, but doing just the Individual Archives is fine, too, as that’s all that changed), and you’re done! Now, the next time that a search engine that reads the keywords meta tag reads your site, you’ve got that much more information on every individual post to help index your site correctly.

Conclusion

So there we have it. One extremely long post from me, with four hopefully handy tips for you on how you can help Google, and the rest of the search engines out there, index your site more intelligently. If you find this information of use, wonderful! If not…well, I hope you didn’t waste too much of your day reading it. ;)

Feel free to leave any questions, comments, or words of wisdom in the comments below!

Our friend, the humble 'title' attribute

Earlier this evening, I got an e-mail from Pops asking me how I created the little tooltip-style comment text that appears when you hover over links in my posts. I ended up giving him what was probably far more information than he was expecting, but I also figured that it was information worth posting here, on the off chance it might help someone else out.

It’s actually a really easy trick, though not one built into TypePad. Simply add a title declaration to the link itself. For instance, if I wanted the text “Three martinis and a cloud of dust” to appear when someone hovered over a link to Pops’ site, I’d code it like this:

<a href="http://2hrlunch.typepad.com/" title="Three martinis and a cloud of dust">Two Hour Lunch</a>

The end result looks like this (hover over the link to see the title attribute in action):

Two Hour Lunch

That little title attribute comes in wonderfully handy, too, as it can be applied to just about any HTML tag there is.

For instance, good HTML coding includes alt text for all images, so that if someone has image loading turned off in their browser, or if the image fails to load for any other reason, there will be some descriptive text to tell them what gorgeous vistas they are missing. However, in most browsers the only time that text shows is if the image doesn’t load. Using the title attribute in addition to the alt attribute when adding images, we can create that same style of comment when someone hovers over the image. For example:

<img src="lalala.gif" width="360" height="252" alt="NOTICE: I'm not listening!" title="La la la la la la!" />

That way, when displayed in the browser, if the image didn’t load, the text ‘NOTICE: I’m not listening!’ would show instead. In addition, the text ‘La la la la la la!’ will appear if someone lets their cursor pass over the image. Not a necessary thing, but it can be fun for quick, pithy little comments. Here’s the example:

NOTICE: I'm not listening!

Another place I use title tags fairly regularly is when I make changes to a post after it’s first posted. HTML includes two tags (<ins> and <del>, for insert and delete, respectively) for marking up changes to text. When I go back in to edit a post after it first appears on my site, I use those tags with a title attribute to indicate when the change was made.

For example, suppose I posted the following:

Pops is a screaming loony, who shouldn’t be allowed within twenty yards of anyone who isn’t equipped with body armor and a machete.

Later, coming to my senses, I could change that like this:

Pops is a <del title="7/30/03 10pm: I think I was on drugs when I wrote this.">screaming loony, who shouldn't be allowed within twenty yards of anyone who isn't equipped with body armor and a machete</del> <ins title="Here's what I meant to say...">great guy, whose website has pointed me to some fascinating tidbits on a regular basis</ins>.

(I hope Pops doesn’t mind the sample text here.) ;)

On screen, after the update, the deleted text would display as struck through, and the inserted text would display underlined (standard editing notation), with the comments displaying on a cursor hover, like this:

Pops is a screaming loony, who shouldn’t be allowed within twenty yards of anyone who isn’t equipped with body armor and a machete great guy, whose website has pointed me to some fascinating tidbits on a regular basis.

So there ya go — more information on the humble little ‘title’ attribute than you probably ever wanted or needed to know. I hope it helps!

Update: (See? There’s a title attribute right there!) As of this writing, the title attribute is barely supported in Apple’s new web browser, Safari. Titles on links will appear in the status bar at the bottom of the window if the status bar is turned on, but that’s it. No other title text will be visible. I’m hoping that this is fixed in a later update to Safari, but for the moment, that’s what we have to work with.

BuyMusic ripping off artists

Another good reason to avoid the BuyMusic service: apparently, through snapping up the back catalog of a very shady distributor, much of the BuyMusic catalog comes from independent artists who are not receiving any compensation for their work.

This from recording artist Jody Whitesides:

I did a search for one of my old CD’s that will be going onto iTunes and It turns out my CD was there on BuyMusic.com. As were the CD’s of several other bands that I’m friends with. All of whom were not contacted about being placed for sale there.

Here’s what I’ve deduced… BuyMusic.com (which I will refer to as BM) got their “vast” music library of 300,000 plus songs from a company called the Orchard. The Orchard is a distribution company that has consistently shafted artists by not paying them for CD’s sold nor returning unsold CD’s or cancelling contracts. So, without the express consent of what is likely lots of the Orchards catalog, BM has put it up for sale at the bargain price of \$.79 a song.

So now, they can tout they’re selling tracks at \$.79 and they can say they have a library of music of over 300,000 songs. But what they don’t tell you is that it comes from musicians/bands that were not asked for permission, and who will likely not see a penny of any sale made through BM. By their very own site policy they are committing copyright infringement. They have done this to lure PC/windows users to their site in hopes to sell the few major label aquired songs they do have, at a price that is much higher than Apple’s \$.99.

I’m currently looking into legal means to have my music removed from their site and strongly encourage users to not browse BM’s site nor purchase from it.

(via MacSlash)