The Washington Post has published an article looking at the websites used to train “Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.” If you scroll down far enough, there’s a section titled “Is your website training AI?” that lets you drop in a URL to see if it was scraped and included in the data set.
I checked three strings — “michaelhans” (to cover both this site and its prior address at michaelhanscom.com), “djwudi” (for my DJ’ing blog), and norwescon (which I’ve written or tweaked and edited much of the content for). All three of them are represented.
- norwescon.org: 45k tokens, 0.00003% of all tokens, rank 528,147
- michaelhanscom.com: 37k tokens, 0.00002% of all tokens, rank 635,948
- djwudi.com: 3.7k tokens, 0.000002% of all tokens, rank 4,002,025
For the record, I’m not terribly excited about this. I’m also under no illusion that anything can be done; this stuff is all out on the open web, and as it’s free for actual people to browse through and read, it’s also free for bots to scrape and ingest into whatever databases they keep. Sometimes this is a good thing, for projects like the Internet Archive. Sometimes it’s unwittingly helping to train our new AI overlords.