Skip to content

LLM training data, or, a broken virtuous cycle

Have you used ChatGPT?

I have and it’s  amazing.

From the whimsical (I asked it to write a sonnet about SAML and OIDC) to the helpful (I asked for an example of a script with async calls using Typescript), it’s impressive.

However, one issue I haven’t seen mentioned is training data. Right now, there are sets of training data used to teach these large language models (LLMs) what is correct about the world and what is not. This podcast is a great intro to the model families and training process.

But where does that training data come from? I mentioned that question here, but the answer is humans provide it. Human effort and knowledge are gathered on reddit, wikipedia, and other places.

Why did humans spend so much time and effort publishing that knowledge? Lots of reasons, but some include:

  • Making money (establishing yourself as an expert can lead to jobs and consulting)
  • Helping other humans (feels good)
  • Internet points (also feels good)

In each case, the human contributing is acknowledged in some way. Maybe not by the end user who doesn’t, for example, read through the Wikipedia wiki editing history. But someone knows. Wikipedia editors know and celebrate each other. Here’s a list of folks who have edited that site for a decade or more.

What about search engines? Google reifies knowledge in a manner similar to ChatGPT. But, cards notwithstanding, Google offers a reputational reward to publishers. It may be in money (Adwords) or site authority. Other applications like Ahrefs help you understand that authority and I can tell you as a devrel, high search engine ranking is valuable.

ChatGPT offers none of that, at least not out of the box. You can ask for links to sources, but the end user must choose to do so. I doubt most do, and, in my minimal experience, the links are often broken or made up.

This fact breaks the fundamental virtuous cycle of internet knowledge sharing.

Before, with search engines:

  • Publisher/author writes good stuff
  • Search engine discovers it
  • User reads/enjoys/learns from it on the publishers site
  • Publisher/author gains value, so publishes more
  • Search engine “sees” people are enjoying publisher, so promotes it
  • More users read it
  • Back to step one

After, with LLMs:

  • Publisher writes good stuff
  • LLM trains on it
  • User reads/enjoys/learns from it via ChatGPT
  • … crickets …

The feedback loop is broken.

Now, some say that the feedback loop is already broken because Google over optimized Adwords. Content farms, SEO focused garbage sites and tricks to rank are hard to stomach, but they do make money from Google’s traffic. This is especially acute with products and product reviews because the path to monetization is so clear; end users are looking to buy and being on page 1 will result in money. I agree with this critique; I’m not sure the current knowledge sharing experience is optimal, but humans have been working around Google’s limitations.

More human labor helps with this. I’ve seen this happen in two ways, especially around products.

  • Social media, where searchers are relying on curation from experts. Here end users aren’t searching so much as browsing from a subset of answers.
  • Reddit, where searchers are relying on the moderators and groups of redditors to keep spam out of the system. Who among us hasn’t searched for “<product name> review reddit” to avoid trash SEO sites? This also works with other sites like Stackoverflow (for programming expertise).

In contrast, the knowledge product disintermediation of ChatGPT is complete. I’ll never know who helped me with Typescript. Perhaps I can’t know, because it was one million little pieces of data all broken up and coalesced by the magic of matrix algebra.

This will work fine for now, because a large corpus of training data is out there and available. But will it work forever? I dunno. The cycle has been broken, and we will eventually feel the effects.

In the short term, I predict that within the next three months, there will be a creative commons type license which prohibits the usage of published content by LLMs.