{"id":49154,"date":"2023-04-19T11:36:18","date_gmt":"2023-04-19T18:36:18","guid":{"rendered":"https:\/\/michaelhans.com\/eclecticism\/?p=49154"},"modified":"2023-04-19T11:36:18","modified_gmt":"2023-04-19T18:36:18","slug":"im-training-ai-chat-bots-non-consensually","status":"publish","type":"post","link":"https:\/\/michaelhans.com\/eclecticism\/2023\/04\/19\/im-training-ai-chat-bots-non-consensually\/","title":{"rendered":"I&#8217;m Training AI Chat Bots (Non-Consensually)"},"content":{"rendered":"<div class='__iawmlf-post-loop-links' style='display:none;' data-iawmlf-post-links='[{&quot;id&quot;:1256,&quot;href&quot;:&quot;https:\\\/\\\/www.washingtonpost.com\\\/technology\\\/interactive\\\/2023\\\/ai-chatbot-learning&quot;,&quot;archived_href&quot;:&quot;&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[],&quot;broken&quot;:false,&quot;last_checked&quot;:null,&quot;process&quot;:&quot;done&quot;}]'><\/div>\n<p>The Washington Post has published <a href=\"https:\/\/www.washingtonpost.com\/technology\/interactive\/2023\/ai-chatbot-learning\/\">an article looking at the websites<\/a> used to train &#8220;Google\u2019s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google\u2019s T5 and Facebook\u2019s LLaMA.&#8221; If you scroll down far enough, there&#8217;s a section titled &#8220;Is your website training AI?&#8221; that lets you drop in a URL to see if it was scraped and included in the data set.<\/p>\n<p>I checked three strings &#8212; &#8220;michaelhans&#8221; (to cover both this site and its prior address at michaelhanscom.com), &#8220;djwudi&#8221; (for my DJ&#8217;ing blog), and norwescon (which I&#8217;ve written or tweaked and edited much of the content for). All three of them are represented.<\/p>\n<ul>\n<li>norwescon.org: 45k tokens, 0.00003% of all tokens, rank 528,147<\/li>\n<li>michaelhanscom.com: 37k tokens, 0.00002% of all tokens, rank 635,948<\/li>\n<li>djwudi.com: 3.7k tokens, 0.000002% of all tokens, rank 4,002,025<\/li>\n<\/ul>\n<p>For the record, I&#8217;m not terribly excited about this. I&#8217;m also under no illusion that anything can be done; this stuff is all out on the open web, and as it&#8217;s free for actual people to browse through and read, it&#8217;s also free for bots to scrape and ingest into whatever databases they keep. Sometimes this is a good thing, for projects like the Internet Archive. Sometimes it&#8217;s unwittingly helping to train our new AI overlords.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes scraping the open web is a good thing, for projects like the Internet Archive. Sometimes it&#8217;s unwittingly helping to train our new AI overlords.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2040],"tags":[1764,132],"class_list":["post-49154","post","type-post","status-publish","format-standard","hentry","category-blog","tag-ai","tag-google"],"_links":{"self":[{"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/posts\/49154","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/comments?post=49154"}],"version-history":[{"count":0,"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/posts\/49154\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/media?parent=49154"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/categories?post=49154"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelhans.com\/eclecticism\/wp-json\/wp\/v2\/tags?post=49154"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}