Skip to content

LLM training data, or, a broken virtuous cycle

Have you used ChatGPT?

I have and it’s  amazing.

From the whimsical (I asked it to write a sonnet about SAML and OIDC) to the helpful (I asked for an example of a script with async calls using Typescript), it’s impressive.

However, one issue I haven’t seen mentioned is training data. Right now, there are sets of training data used to teach these large language models (LLMs) what is correct about the world and what is not. This podcast is a great intro to the model families and training process.

But where does that training data come from? I mentioned that question here, but the answer is humans provide it. Human effort and knowledge are gathered on reddit, wikipedia, and other places.

Why did humans spend so much time and effort publishing that knowledge? Lots of reasons, but some include:

  • Making money (establishing yourself as an expert can lead to jobs and consulting)
  • Helping other humans (feels good)
  • Internet points (also feels good)

In each case, the human contributing is acknowledged in some way. Maybe not by the end user who doesn’t, for example, read through the Wikipedia wiki editing history. But someone knows. Wikipedia editors know and celebrate each other. Here’s a list of folks who have edited that site for a decade or more.

What about search engines? Google reifies knowledge in a manner similar to ChatGPT. But, cards notwithstanding, Google offers a reputational reward to publishers. It may be in money (Adwords) or site authority. Other applications like Ahrefs help you understand that authority and I can tell you as a devrel, high search engine ranking is valuable.

ChatGPT offers none of that, at least not out of the box. You can ask for links to sources, but the end user must choose to do so. I doubt most do, and, in my minimal experience, the links are often broken or made up.

This fact breaks the fundamental virtuous cycle of internet knowledge sharing.

Before, with search engines:

  • Publisher/author writes good stuff
  • Search engine discovers it
  • User reads/enjoys/learns from it on the publishers site
  • Publisher/author gains value, so publishes more
  • Search engine “sees” people are enjoying publisher, so promotes it
  • More users read it
  • Back to step one

After, with LLMs:

  • Publisher writes good stuff
  • LLM trains on it
  • User reads/enjoys/learns from it via ChatGPT
  • … crickets …

The feedback loop is broken.

Now, some say that the feedback loop is already broken because Google over optimized Adwords. Content farms, SEO focused garbage sites and tricks to rank are hard to stomach, but they do make money from Google’s traffic. This is especially acute with products and product reviews because the path to monetization is so clear; end users are looking to buy and being on page 1 will result in money. I agree with this critique; I’m not sure the current knowledge sharing experience is optimal, but humans have been working around Google’s limitations.

More human labor helps with this. I’ve seen this happen in two ways, especially around products.

  • Social media, where searchers are relying on curation from experts. Here end users aren’t searching so much as browsing from a subset of answers.
  • Reddit, where searchers are relying on the moderators and groups of redditors to keep spam out of the system. Who among us hasn’t searched for “<product name> review reddit” to avoid trash SEO sites? This also works with other sites like Stackoverflow (for programming expertise).

In contrast, the knowledge product disintermediation of ChatGPT is complete. I’ll never know who helped me with Typescript. Perhaps I can’t know, because it was one million little pieces of data all broken up and coalesced by the magic of matrix algebra.

This will work fine for now, because a large corpus of training data is out there and available. But will it work forever? I dunno. The cycle has been broken, and we will eventually feel the effects.

In the short term, I predict that within the next three months, there will be a creative commons type license which prohibits the usage of published content by LLMs.

Books and other resources to level up as a software developer

A while ago there was an HN post asking for suggestions on [r]eading material on how to be a better software engineer.

Here’s my list based in part off a comment I made there.

First, books:

  • Secrets of Consulting by Gerald Weinberg because every problem is a people problem.
  • Refactoring by Martin Fowler et. al. discusses how and why to refactor, as well as providing a nomenclature for the process.
  • Code Complete by Steve McConnell is a bit dated (the last version I could find was from 2004) but a great overview of the entire software process, from requirements to maintenance.
  • The Mythical Man-Month by Fred Brookes covers best practices about software development, written about a project from the 1960 and 1970s. Nothing new under the sun.
  • The Joel On Software Strategy Letters cover different aspects of software strategy. The link is the first one, but all of them (I think there are five) are great.
  • Letters to a New Developer is a collection of essays helpful to new developers. Note I wrote this book, but I think it does a good job of discussing the “soft skills” in an easily digestible format.
  • The Pragmatic Programmer by Dave Thomas and Andy Hunt. I haven’t read the revised 20th anniversary edition, but the first one opened my eyes to the craft of software.
  • High Output Management by Andy Grove illustrates how think about throughput.
  • The Phoenix Project by Gene Kim et. al. is a fun novel(!) about applying lean management principles to software engineering.
  • Good to Great by Jim Collins focuses on what great companies bring to the table. Helps me evaluate where to work.
  • Managing Humans by Michael Lopp. The whole Rands site is worth reading, but I enjoyed this book about how to manage teams and build software. See above.
  • Don’t Make Me Think by Steve Krug shows ways to think about usability, focusing on webapps. Short and easy.
  • Badass: Making Users Awesome, by Kathy Sierra helps you put yourself in the shoes of your users and think about how to build software they will love. Short and easy.

Then, podcasts and videos:

  • Mastery Autonomy and Purpose, a great video about what people really want in work.
  • The Manager’s Toolbox podcast, focuses on nuts and bolts skills for managing people. You didn’t say you wanted to be a people manager, but knowing what managers think about will make you more effective in any org.
  • Screaming in the Cloud is useful for keeping up with AWS and other cloud provider offerings.
  • SE Radio is a bit dry, but has a great back catalog of software engineering focused episodes.

A few other resources:

  • The Rands Leadership Slack has over 10,000 engineering leaders discussing all kinds of software related topics.
  • CTO lunches is an email list of engineering leaders. The discussions aren’t consistent, but when they happen, they’re great. Plus, it comes to your email.
  • HackerNews is a great way to burn time, but also a great way to keep on top of topics that are top of mind of some of the best developers in the world.

Reading up on software practices can help you level up as a software engineer because you’ll be able to avoid mistakes others have made before you. I can also offer a view of the big picture; knowing how your software helps your organization or company will only make you more valuable as a developer.

GitHub Actions Are Amazingly Easy

GitHub Workflows are automated jobs that can be triggered by various events against a GitHub repository. They are pretty awesome.

GitHub Actions are a way to encapsulate configuration and functionality in a way that can be easily reused in GitHub Workflows.

I was thinking it’d be fun to create some GitHub Actions (yes, I’m the life of the party), so I sat down a few mornings ago to do this. I was shocked at how easy it was.

I followed a few lines of this tutorial to create a workflow. Then I created an action by following this tutorial. Finally, I edited my workflow to use the new action. That was it.

It was amazingly simple and took me about 30 minutes. I ran into one unrelated issue (to set the executable bit on a shell script in windows, I had to modify the shell script contents in order to ensure the change was sent to the remote repo).

If you take a look, you’ll see these are both toy repositories, to be sure. However, the ability to write jobs which will be executed on a git push, pull request or other events is great and removes toil. Being able to extract common functionality to an action is even better. Finally, the ability to share the action publicly by adding it to the GitHub marketplace is fantastic.

I’ve liked CircleCI for a long time, but if I were them I’d be worried.

One issue I found is that the testing/release cycle is pretty tedious (I’ve mentioned that action debugging to be an issue for a while).

While I was troubleshooting my executable bit error, I had to do the following every time I wanted to test a change:

  • make a change in the action repository
  • create a new tag
  • push it to the remote
  • switch to the workflow repository
  • bump the action version
  • push to the remote
  • wait for the workflow to complete

Not horrific, but pretty tedious. I don’t know if there are other options such as local deployment which would reduce that cycle, but that would be swell.

Other than that, 10 out of 10, would write more actions.

Announcing: Letters to a New Developer

Blank paper, pencil, lightbulb, eraserI’ve been working on a new project that I’m excited to share here. It’s called Letters to a New Developer. It’s a blog with over forty posts full of advice for people who are just starting out in the field of development.

So much of development is not about code. It’s about teams and processes and organizations and questions. I feel like the traditional education system (whether four year college or bootcamp) does a good job of teaching folks coding fundamentals (what you need to be a programmer) but there is so much more to being a successful developer than coding. In my experience, delivering code is a necessary but not sufficient activity for success in a software company.

That’s why I’ve spent the last five months writing up some of my experiences and learnings.

The posts are mostly mine, but some are from guest posters who bring a different perspective. I also excerpt posts I find on the Internet that I find helpful, so that I can stand on the shoulders of some of the giants who have come before.

Here are some of my favorite Letters To a New Developer posts: Always Be Journaling from Brooke Kuhlmann, Get Used to Failure, and Learning to Read Code is More Important Than Learning to Write It.

I hope you find this project as fun to read as I’ve found it to write.

Throwback Thursday: What is the difference between a programmer and a developer?

BicycleOne of the nice things about blogging for so long is that you get to see the wheat separated from the chaff. If I’m still thinking the same thoughts five years or a decade later, that means that I’ve stumbled on one of my truths. (That or I’m just echoing my thoughts because I haven’t learned anything new. I believe it is the former, especially when it comes to development, something I spend a lot of time thinking about.)

Here’s my 2012 post about the difference between a programmer and a developer. A programmer can be a great coder, but stops when the code ends. A developer uses code to build a business solution. (Pssst, a programmer is much more likely to be replaced by an offshore “resource”.)

One of my favorite analogies for developers is the bike messenger. She may not be as fast as a car or as flexible as a pedestrian, but she can weave back and forth between the street and the sidewalk in a way that neither of those two other modes of transportation can. And in the end she’s all about the destination, just like great developers are.

What can you cut out?

Fractal image
Perhaps we could have made the site map a bit simpler?

“I have made this [letter] longer than usual because I have not had time to make it shorter.” – Blaise Pascal

I was a mentor for Go Code Colorado over the weekend (mentioned previously). It was a good experience. About 10 teams, 30 mentors, and a couple of hours. I had a lot of fun chatting with the teams, which were all using open data provided on the Colorado Information Marketplace to build an app that will serve a need. They divided the mentors up into functional areas (data science, marketing, developer, startup vet, etc) and let us wander amongst the teams. Sometimes I felt a bit useless (one team was trying to debug a Meteor app that would run locally but failed when deployed to a web server) and other times I felt like I was a bit of a bother (since the teams were also trying to get stuff done while being “mentored”). But for the most part I had interesting conversations about what the teams were trying to accomplish and the best means of doing so from a technical perspective.

One thing that came up again and again was “what can you cut out”. The teams have a fixed timeline (they are only allowed to work until the final competition in early June) and some of the ideas were pretty big. My continuing refrain was:

  • capture all the big ideas on a roadmap (you can always implement them later)
  • cut what you can
  • build a basic “something” and extend it as you have time
  • choose boring technology

For example, one project was going to capture some data and use the blockchain for data storage. I totally get wanting to explore new technology but for their initial MVP they could just as easily use a plain old boring database. Or frankly a spreadsheet.

Lots of developers don’t want to hear it, but when you are in the early stages of a startup, technology, while an important enabler, can get in the way of what is really important: finding customers, talking to them, and giving them something to pay for.

PS This is a great read on all the hardships of building a startup and how it is so so so important to minimize any unnecessary difficulties.

Navigating new systems

A mazeHere are some tips and tricks I have for navigating new software systems, which can sometimes be like navigating a maze. If you’re truly unlucky, it’s a maze, but you’re blindfolded and the walls are covered in randomly placed razors.

The first is to get a clear set of expectations. Will I own the system? Who owns it now? How long have they owned it? How often is it modified? When will it need to be modified again? Is it shaky or stable? Getting these questions answered helps me understand and refine further steps.

The next step is to gain access. There are a lot of different pieces of most modern systems, so access can mean different things. Here are some kinds of access which it may be worth seeking:

  • shell
  • version control
  • database
  • ftp
  • http
  • app level (admin, user)
  • documentation
  • project planning
  • different CI environments (prod, UA, staging)
  • build system
  • admin users
  • end users

After I have access, I like to look at the front end and the back end. By the front end, I mean the user interface. And by the back end I mean the data store. Just looking around and seeing what tables and pages an application has can help.

If the system in question is not entirely custom built, googling for the user guide for the default version of the application can be helpful. Finding that user guide and skimming through it can give me more high level understanding, as well as teaching me key nomenclature. Of course if there is any local documentation, that’s helpful too, but I read that with a skeptical eye, as it doesn’t always keep pace with the system.

I also like to look at logfiles. This can help me determine something as simple as if I’m on the correct server (if I reload the page and the access log file doesn’t change, I am looking at the wrong log file or am on the the wrong server). Even better if the system aggregates logs into something like an ELK system or papertrail.

Setting up a local development environment can help. Again, this lets me gain an understanding of the big picture components, and also lets me poke at various parts of a system, possibly breaking them, without affecting other developers or, worse, customers.

Asking questions is really important, but this can be hard because often the folks with the most knowhow are the busiest.

I also like to see what files or database tables change as I move through the system. With a modest sized database, I do this by taking a database dump before taking some action and then after. Then I diff the files, sometimes using sed to break the dump file apart even further (replacing all commas with commas and newlines, for example). If using mysqldump, you can target individual tables and make sure not to use extended inserts, as that makes diffing harder.

For the filesystem, it’s even easier. I touch a file (ts) and then take the action, then run find . -newer ts -print. This command will show me all the files the system has written that are newer than ts.

Hopefully some of these tips will be helpful to you as you navigate your next new system.

“Choose boring technology”

Reading a boring bookThis article on choosing boring technology is so good. It’s from 2015 so some of the tech it references is more mature, but the thesis is that the technology underlying a business should be well known and boring to implement. It also discusses how important cohesive technology can be (I love the final footnote).

I think this is an interesting contrast to microservices, which is essentially solving repeatedly for local optimizations and then relying on the API or message boundary to allow scaling. Microservices, in my experience, often get a bit hand wavey when it comes to operationalization (“just put it on Kubernetes, it’ll be fine”).

I have definitely dealt with this in previous companies where I had to put the kibosh on new technologies being introduced (“nope, we’re going to do it in java”). This meant that as a small team we had a standard deployment stack and process and could leverage our logging knowledge and debugging tools.

Here are some arguments I hear for using the latest and greatest technologies:

  • “Our developers will get bored and we’ll have a hard time recruiting”
    • This is a great reason to have regular hackfests and or support developers working on open source or side projects.
  • “New tech will help us do things faster”
    • It’s definitely worth evaluating new technology periodically and adding it to your stack (especially if it is outsourced, a la RDS, or boring, a la memcached). As the original post mentions, if you get substantial benefits from the new technology, then use it full bore and plan for a migration. Don’t end up in a state where you are half in/half out. Also consider tooling and processes to get things done faster–these may have quicker iteration times with lower operational risk.
  • “Choose the right tool for the job”
    • Most turing complete languages can be used for any purpose. And you shouldn’t be optimizing for the right tool for the job, but rather the right tool for the business for the job.

Really, you should just go read the post. It’s good.

 

“The future is already here, but it’s only available as a managed AWS service”

This entire post about how Kubernetes could become the distributed operating system of choice is worth reading.  But one statement really struck me:

Well, as they say, the future is already here, but it’s only available as an AWS managed service.

The “they” in this is apparently not William Gibson, as I thought.  More details here.

For the past couple of years the cloud providers have matured and moved from offering infrastructure as a service (disk, compute) to platform as a service offerings (sqs, which is a managed message queue like activemq, or kinesis, a managed data ingestion system like kafka, etc).  Whenever you think about installing a proprietary or open source package, you should include the cloud provider offerings in your evaluation matrix.  Of course, the features you need may not be there, or the cost may be prohibitive, but including them in an evaluation makes sense because of the speed of deployment and the scaling available.

If you think a system architecture can benefit from a message queuing system, do you want to spend time setting up and maintaining such a system, or do you want to spin up an SQS queue in a few minutes?

And the cost may not be prohibitive, depending on the skillset of your internal team and your team’s desire to run such plumbing services.  It can be really hard to estimate running costs of infrastructure services, though you can estimate it by looking at internal teams and seeing similar services they run and how much money it takes.  The nice thing about cloud services is that the costs are very transparent.  The kinesis data streams pricing example walks through a scenario and concludes:

For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

Another AWS instructor made the point that AWS and other cloud services invert the running costs of IT infrastructure.  In a typical enterprise, the running costs of your data center and infrastructure are like an iceberg–10% is explicit (server costs, electricity, etc) and 90% is implicit (payroll, time spent upgrading and integrating systems).  In the cloud world those numbers are reversed and far more of your infrastructure cost is explicitly laid out for you and your organization.

Truly the future.

Thoughtworks Radar

Last night at the Boulder Ruby Meetup, Marty from Haught Codeworks walked through the Thoughtworks Technology Radar.  This is an informative way to keep on top of the latest technology trends without spending a lot of time researching and reading.  Sure, you could get the same information from reading Hacker News regularly, but the Radar is less haphazard.

The Radar is published every six months or so, and pulls from experts inside Thoughtworks (a leading software consultancy). You can view techniques, languages and frameworks, tools, and platforms.  Each area has a number of technologies and the proposed level of adoption (hold, assess, trial, adopt).  Within each radar you can dive in and learn more about the technology, like say, weex. You can also see the movement of a technology through the proposed levels of adoption over time.

You can also search for a given technology, if you are interested in seeing what the status us.  Sometimes technologies are archived, like Elm.

Note that this is not perfect.  The lag is real, but may help you avoid chasing shiny objects.  The Radar is also inherently biased because of the methodology, but given Thoughtworks size, scope and leadership, it’s probably one of the best technology summaries.  It’s certainly the best one I’ve run across.