Skip to content

LLM training data, or, a broken virtuous cycle

Have you used ChatGPT?

I have and it’s  amazing.

From the whimsical (I asked it to write a sonnet about SAML and OIDC) to the helpful (I asked for an example of a script with async calls using Typescript), it’s impressive.

However, one issue I haven’t seen mentioned is training data. Right now, there are sets of training data used to teach these large language models (LLMs) what is correct about the world and what is not. This podcast is a great intro to the model families and training process.

But where does that training data come from? I mentioned that question here, but the answer is humans provide it. Human effort and knowledge are gathered on reddit, wikipedia, and other places.

Why did humans spend so much time and effort publishing that knowledge? Lots of reasons, but some include:

  • Making money (establishing yourself as an expert can lead to jobs and consulting)
  • Helping other humans (feels good)
  • Internet points (also feels good)

In each case, the human contributing is acknowledged in some way. Maybe not by the end user who doesn’t, for example, read through the Wikipedia wiki editing history. But someone knows. Wikipedia editors know and celebrate each other. Here’s a list of folks who have edited that site for a decade or more.

What about search engines? Google reifies knowledge in a manner similar to ChatGPT. But, cards notwithstanding, Google offers a reputational reward to publishers. It may be in money (Adwords) or site authority. Other applications like Ahrefs help you understand that authority and I can tell you as a devrel, high search engine ranking is valuable.

ChatGPT offers none of that, at least not out of the box. You can ask for links to sources, but the end user must choose to do so. I doubt most do, and, in my minimal experience, the links are often broken or made up.

This fact breaks the fundamental virtuous cycle of internet knowledge sharing.

Before, with search engines:

  • Publisher/author writes good stuff
  • Search engine discovers it
  • User reads/enjoys/learns from it on the publishers site
  • Publisher/author gains value, so publishes more
  • Search engine “sees” people are enjoying publisher, so promotes it
  • More users read it
  • Back to step one

After, with LLMs:

  • Publisher writes good stuff
  • LLM trains on it
  • User reads/enjoys/learns from it via ChatGPT
  • … crickets …

The feedback loop is broken.

Now, some say that the feedback loop is already broken because Google over optimized Adwords. Content farms, SEO focused garbage sites and tricks to rank are hard to stomach, but they do make money from Google’s traffic. This is especially acute with products and product reviews because the path to monetization is so clear; end users are looking to buy and being on page 1 will result in money. I agree with this critique; I’m not sure the current knowledge sharing experience is optimal, but humans have been working around Google’s limitations.

More human labor helps with this. I’ve seen this happen in two ways, especially around products.

  • Social media, where searchers are relying on curation from experts. Here end users aren’t searching so much as browsing from a subset of answers.
  • Reddit, where searchers are relying on the moderators and groups of redditors to keep spam out of the system. Who among us hasn’t searched for “<product name> review reddit” to avoid trash SEO sites? This also works with other sites like Stackoverflow (for programming expertise).

In contrast, the knowledge product disintermediation of ChatGPT is complete. I’ll never know who helped me with Typescript. Perhaps I can’t know, because it was one million little pieces of data all broken up and coalesced by the magic of matrix algebra.

This will work fine for now, because a large corpus of training data is out there and available. But will it work forever? I dunno. The cycle has been broken, and we will eventually feel the effects.

In the short term, I predict that within the next three months, there will be a creative commons type license which prohibits the usage of published content by LLMs.

Workers in the Gig Economy Have Tremendous Autonomy

Driving a carThis essay by Bill Gurley, “The Thing I Love Most About Uber”, is well worth reading. In it he discusses the insane level of flexibility working for Uber (or, though he doesn’t state it, Lyft) gives the drivers. He also goes into some great details about the typical driver and earnings.

I have been a contractor for much of my career and when I was, I placed a large value on freedom. Freedom to choose clients, freedom to take time off, freedom to work when I needed to. As a software developer, if you are willing to accept the associated risks you’ve been able to choose autonomy for decades. (Some contractors are even more autonomous.)

But that level of autonomy still requires large blocks of contiguous time, some level of marketing capability, and specialized knowledge.

In the USA today, the ability to drive and car ownership is ubiquitous (88%). And Uber/Lyft take care of the marketing. And the demand is such that you don’t necessarily need large blocks of time. So the autonomy provided is at a much higher level than any previous type of contractor.

This is amazing. I can’t think of another market where the demand and supply pools are so large and the time and skill commitment are so small.

Blockchain now in the trough of disillusionment?

ChainsLooks like the blockchain may be headed into the trough of disillusionment. See also Kevin Owocki’s thoughts on ICOs.

This happens to every technology. There is wild optimism over the usefulness (remember Iridium?), which is overblown. It even happened in the 1800s with trains. That said, I’m sure there are uses for blockchain technology beyond stores of value, and am looking forward to seeing those emerge. For a fascinating read on the rollout of technology, see The Deployment Age.

Book Review: Working With Coders

Woman with 1s and 0sSoftware is so integral to business processes and relatively inexpensive compared to labor that I believe every company is going to be a custom software company, in the same way that every company is an accounting company or every company uses paper. I happened on an interesting blog post and saw the author had written a book, “Working With Coders”. How non technical folks interact with coders is a topic of perennial interest to me, so I picked it up after reading the first few pages on Amazon. The book is written for clients, CEOs or project managers who are going to be working with developers to deliver applications that will provide business value.

Frankly, I couldn’t put it down.

The author, Patrick, is an engaging, opinionated writer. He breaks down complicated concepts into easily digestible pieces. Where there’s more to the story, there’s a footnote with a snarky comment or a link to more information. Patrick also provides nuts and bolts examples to show why something that seems simple to change is not (scaling text in a browser, for example). He also covers how big decisions like language, frameworks and library choices at the beginning of a project constrain freedom and choices further down.

Patrick covers what developers do, how they think, and why projects often fail. I thought his explanation of the benefits of agile development was darn good, and his explanation that even agile projects fail more often then they succeed was pretty depressing. He also discusses how the house construction metaphor for building software is just a big fat untruth.

I also enjoyed the section about testing in general, the various types of testing, and where they make sense. There’s also a section on finding coders, including a good explanation of why not to hire them as employees (you might be better off just hiring a development shop, depending on your needs). The chapter on how to deal with common issues (“the team hates each other”, “we’re behind schedule”) was worthwhile. His solutions won’t work for everyone. Maybe you’ll want to deal with these issues differently, but considering them before they happen will only help you prepare.

Of course, I also enjoyed the chapter on how to keep coders happy (continuous learning, quiet, a fast computer). In general the author is careful to avoid stereotypes, but does do a good job of covering common themes. I haven’t met too many developers who love working in bullpen environments.

I am definitely not the target audience. Neither is someone who is an experienced manager of developers. However, I am a subject of the book, so it resonated with me and I definitely found myself nodding along. There aren’t too many books I have wanted to distribute copies of (the two others are “The Hard Thing About Hard Things” and “Climate Wars”), but this is one.

If you work in a consulting practice with inexperienced clients or if you work in a product company with an owner or higher up that isn’t technical, reading this book will give you insights into their questions and thought processes. And if you can find a way to give them this book without being condescending (“hey, I found this book fascinating for helping facilitate conversation, maybe you will too”), both they and you will benefit.

A follow up to “Deeply problematic but practical advice”

I linked to the first article Charity wrote, and wanted to link to her follow on piece/”post mortem”. (In technical terms, a post mortem is an examination of a problem or system failure in hopes of avoiding the situation in the future.) From the post, she encountered some very harsh words from the Internet:

I have never received textual scrutiny of this type before, where every single word was turned over and macerated and peered at for evidence of traitorous views. It sucks.

Lots of good stuff there about the reactions to her original post, her takeaways and how she would do some things differently next time. Worth the read.

“Deeply problematic but practical advice”

This post from Charity, who I believe I first started following when she presented at a gluecon about parse, is excellent and speaks to some of the strategies she’s used to succeed in technology.

If you feel like table flipping out of tech, just remember the rest of the world is at LEAST as sexist as tech is, but without the money and power and ridiculous life-coddling. Where exactly do you think you’re going to go?

Several hundred words of zero bs. Worth a read. And if you’re not a member of a marginalized group, it’s a great read to give you a taste of what it must be like. At least that’s what I took from it.

Update 3/5/18: My wife pointed out that she was offended by this post (a bit tone deaf, to paraphrase). I just want to be clear that I’m not a member of any marginalized group and just wanted to call attention to what I thought was a post documenting strategies, which I inferred were in response to some problems that have been publicized recently.

Blast from the past: 5 worlds

5 Worlds is a Joel Spolsky classic. This article needs to be updated (it’s from 2002, when shrinkwrap software was still A Thing) but it still has a lot of wisdom and illustrates just how large the scope of work available to software developers is (even more so now that software is eating the world).

Whenever you read one of those books about programming methodologies written by a full time software development guru/consultant, you can rest assured that they are talking about internal, corporate software development. Not shrinkwrapped software, not embedded software, and certainly not games. Why? Because corporations are the people who hire these gurus. They’re paying the bill.

Note that assuming a software developer is a webdev is like assuming a lawyer is a trial attorney. Just like there’s lots of ways to practice law, there are lots of ways to build software. And, to be honest, this is probably true of every profession. If you go to a party and ask someone “what do you do” and really really listen, chances are you’ll get a startling view of the world, because everyone does something interesting.

Boulder Blockchain Meetup

I went to the Boulder Blockchain Meetup a few days ago. It was fascinating. The entire room was full to standing, and they went around and asked everyone to do a quick intro. Then we separated into three grouos:

  • beginners
  • developers
  • everyone else

The beginners group, where I went, was about 10ish folks in a room discussing all different aspects of the blockchain, from who might be interested in using it to what a particular coin might be used for to ‘buying the dip’. I was surprised at how many non developers were there (40-60%). There was a lot of talk about ‘trading’ crypto currencies. To be honest, it felt a bit like the wild west, with plenty of interesting work and some scams all mixed together.

However it was interesting enough to me to take a deeper look into Ethereum (there are so many crypto currencies, but this seems like a good one to investigate, if you are a developer). This looked useful, as did this.

Finally, if you’d like a two minute intro into why this is worth investigating, here’s a video from the Meetup website (otherwise, you should totally check out the next meetup):