Skip to content

Settle the Routine, Focus on the Important

IMG_20141016_165308295
My Bike Lock

This is a picture of my bike lock.

There are many bike locks, but this one is mine.

I always put my bike lock on as seen.  My helmet with the lock running through the styrofoam (because, on the off chance someone wants to steal a bike helmet, they’d have to destroy mine to get it).  The lock running through both the front wheel and the frame, so that the wheel can’t be stolen–nothing sadder than a locked bike with a missing front wheel–quick release works well.  The actual lock (where I put my key) sheltered from the elements, both because the helmet protects it and because it is facing down.

After commuting on my bicycle hundreds of times and leaving it overnight a handful of times, I’ve determined this is the optimal lock procedure.  Not that I’m not open to new ideas–a few years ago I took my helmet with me, but then I saw someone lock their helmet like this and thought “what a wonderful idea!”.  But once I arrived at what seemed an optimal solution, I just put it on autopilot.  This frees up my mind to pursue other things.  It also means I have not analyzed the way I lock my bike in a deep way until I started to write this blog post, and once this is done, I won’t think about it for the foreseeable future.  The bike lock setup is settled.  I’m always looking for other aspects of my life to settle, while still being open to possible improvements (if you lock your bike differently, let me know!).

There’s a similar tension in software development.  On the one hand, I want to be open to new ideas, frameworks, concepts and solutions.  On the other hand, it can be easy to go after the newest ‘shiny thing’ every time, spin my wheels, and not accomplish what I need to accomplish.  One way to the latter issue is to make some decisions in my work life.  This can range from the trivial–I have used the same aliases file for over a decade–to foundational–I only work on the unix stack.

Sometimes settling constraints are imposed by the project–it’s a RoR app that needs to be upgraded, or there are already extensive java libraries that this application will use (and that can mean that you won’t get the gig or the job).  Time also can be a limiting factor–deadlines have a wonderful way of focusing you to work on the problem at hand, rather than running off to explore that new library.  It’s also important to realize that you can settle the trivial–what shell you work in, what OS you use, how you lock your bike–which lets you focus on the important–the app you are writing, the restaurant you biked to.

(Of course, what is trivial for some projects may be foundational for others.)

When you are working on a side project, sometimes settling on a technology can be hard–it is more exciting to explore that new XYZ rather than grind away at a bug or a new feature in an app that I’ve already written.  But, in the end, extending and shipping an existing app almost always is more rewarding.

We software developers live on a knife edge–on one side is irrelevance, though it may be profitable (COBOL), on the other side is flitting from new technology to new technology (whether it is Erlang, Haskell, NodeJS or something new) but never mastering them.

Fun stuff I’ve done

amusement park photo
Photo by Loozrboy

One of the companies I’ve met with wanted a bit more context around work I’ve done, so I wrote up some of the ‘greatest hits’ of tech work I’ve done in the past couple of years. It was so much fun, I thought I’d post it here so I’ll remember it in a few years.  This is just some of the tech highlights, and doesn’t include other things I’ve learned (managing folks, softer development skills, etc).

  • Colorado CSAs.info is a directory of Colorado CSAs. This site received 28k visits in 2013, and 26k visits so far in 2014. You can view the source. The source has some quick and dirty aspects, since this is a side project.
  • Home valuation processing software output. This was a fairly simple algorithm (linear regression among geographic datasets) but involved some interesting processing because of the number of records involved (1M+) and data sources. The graph shown here was generated by code I wrote. The content is human generated, however. I worked directly with the CEO on requirements for this v1 project.
  • I wrote an ebook about an aspect of mobile app development last year:
  • I built large chunks of COhomefinder (which has not been touched in 2 years, because the business has decided to move forward with an outsourced home search solution). The last code I touched was replacing google maps with mapquest. I also built a widget for featured home listings that could be dropped into other websites. You can see both on the search results page.
  • I wrote a hybrid mobile app that displays news and info about neighborhoods in the front range. This never really got much traction, unfortunately.

It’s fun to look back.

Parents In Tech Interview

baby photo
Photo by paparutzi

A few months ago I was contacted by Morgan who read a comment I’d made on Hacker News about reshuffling my work life balance.  He was starting a site for parents who work in technology, and was looking to interview such people for tips on parenting.  After a flurry of emails, we finally found a time that worked for both of us and were able to skype for an half hour.

My interview is up here.  Morgan doesn’t do a ton of editing, so it is a little rough, but you get a sense of my thought process:

M: Has having a Baby changed your worldview, beliefs, or how you treat other people? How so?

 

D: Sometimes I wonder how my parents can take me seriously, given that they saw me as an infant. You put it nicely, getting some empathy, starting out as something that just cries, poops and sleeps.

Full post here.

If you are a parent who works in technology and want to chat with Morgan, let me know and I’ll do an intro.

#TBT: Precision and Accuracy in Software

precision accuracy photo
Photo by {Guerrilla Futures | Jason Tester}

I originally wrote this in Dec of 2004. I still think that having someone who can answer engineers’ questions authoritatively increases productivity (of the engineer). However, now I’d emphasize that developers need to spend some time learning their domain to gain some intuition, and truly great business software engineers will learn when to bump a question up to a business person and when their intuition can be trusted.

————————————————————-

Back in college, when I took first year physics lab, there was a section of the course that focused on teaching the difference between precision and accuracy in measurement. This distinction was crucial in experimental physics, since measurement is the bedrock of such experimentation. Basically, precision is how many digits of a measurement actually mean something. If I’m measuring the length of a room with my stride (and found it to be 30 feet long), the precision is less than if I were to measure the length of the room with a tape measure (and found it to be 33 feet, 6 and ¾ inches long). However, it’s possible that the stride measurement is more accurate than the length found with the tape measure, that is, it reflects how long the room actually is. (Perhaps there’s clothing on the floor which adds tape measurement, but which I stride over.)

These concepts aren’t just valid in physics; I think they’re also useful in software. When building a piece of software, I am precise if I build what I say I am going to build, and I am accurate if what I build actually meets the client’s business needs, that is, it solves the business problem. Almost every development tool either makes development more precise or more accurate.

The concept of precision lends itself easily to automation. For example, unit testing is rapidly gaining credence as a useful software technique. With unit testing, a developer writes test cases for each part of their code (often at the method level). The running of these tests ensures that code is actually doing what the developer thinks it is doing. I like writing unit tests; it gives me comfort to know that corner cases are taken care of and that changes to code can be fairly easily regression tested. Other techniques besides unit testing that help ensure precision include:

Round tripping: using a tool like TogetherJ, I can ensure that the model (often described in UML) and the code are in sync. This makes it easier for me to verify my mental model against the code.

Specification writing: The more precise a spec is, the easier it is to translate into code.

Compilers: the checking that occurs at compilation time can be very helpful in ensuring that the code is doing what I think it is doing–at a very low level. Obviously, this technique depends on the language used.

Now, precision is needed, because if I am not confident that I understand what the code is doing, then I’m in real trouble. However, accuracy is much more important. Having a customer onsite is a great example of a technique to ensure accuracy: you have a business domain expert available all the time for developers’ questions. In this situation, when a developer stumbles across a part of the business problem that they don’t quite understand, the don’t do what developers normally do (in order of decreasing accuracy):

  1. Ask another developer, which works great if the target audience is developers, but not so well otherwise.
  2. 2Best approximation (read: guess at the correct answer).
  3. Ignore the issue. (‘I’ve got a lot more code to write before I can go home today, and we’re shipping in two weeks. We’ll just let the customer discover it and deal with it as a bug.’)

Instead, they have a real live business person, to whom this software really matters (hopefully), who they can ask. Doing this makes it much more likely that the final solution will actually solve the business problem. Other techniques to help improve accuracy include:

Issue tracking software (I use Bugzilla): Having a place where questions and conversations are recorded is truly helpful in making sure the mental model of the business user and the programmer are in sync. Using a web based tool means that non-technical users can participate and contribute.

Specification writing: A well written spec allows both the business user and developer to have a sense of what is being built, which means that the business user can correct invalid notions at an early stage. However, if a spec is too detailed, it can be used to justify precision at the cost of accuracy (‘hey, the code does exactly what’s specified’ is the excuse you’ll hear).

Spring and other dependency injection tools, as well as IDEs: These tools help accuracy by decreasing the costs of changing code.

Precision and accuracy are both important in software engineering. Perhaps the best way to characterize the two concepts is that precision is the mapping of the programmer’s model of the problem to the computer’s model, whereas accuracy is the mapping of the business’ needs to the programmer’s model. However, though both are needed, accuracy is much harder to obtain. Knowing that I’m building precisely what I think I’m building is beneficial only insofar as what I think I’m building is actually what the customer needs.

Get Developers Going with Vagrant

computer photo
Photo by jurvetson

I just ran across the Data Science at the Command Line, a new book, website and VM, which gives you many, many unix tools to manipulate large data sets.

I have played around with the tools, but what really stood out to me was how easy it was to get them installed. While the author does provide install instructions, the VM is ready to go in only five steps (two if you already have vagrant and virtual box installed).

I have used Vagrant in the past and it makes managing VMs very easy. If you are distributing a developer tool, make it easy to get up and running by letting the user download a virtual environment that is correctly configured. Then trying out your software becomes a 30 second test, not a 10 minute test.

Of course, the author of Data Science at the Command Line is not alone.  I noticed Ionic is providing a VM for android development too.  It looks like Vagrantcloud lets you share virtual instances with the world as well.

Frankly, I used to download and install cygwin first thing on every Windows computer I bought. But VMs are so easy now it’s easier to just install VirtualBox, Vagrant and a version of Linux (CentOS or Ubuntu).

Gardening and software development

It’s the end of spring/early summer in the northern hemisphere, so it’s time to plan the vegetable garden again. I was putting some tomatoes in the other day and musing about the similarities between gardening and software development. To wit:

  • I have a lot of hesitancy about planting–especially perennials.  It feels so permanent, and I might screw things up, and maybe I should go back to the drawing board, or maybe just do it next weekend….  But just starting makes the problem so much easier–it loses its weight.  Your garden will never be perfect, but an imperfect garden is 100% better than no garden.  Similarly, when confronted with a new project or feature, half the battle is just starting.
  • You will have ample opportunity to make mistakes in both gardening and software development, so feel free to learn from them.  I don’t know where I heard it, but “it’s fine to make mistakes, just try not to make the same ones.”
  • Automate, automate, automate.  The more you can rely on machinery to free you from the drudge of gardening, the more you can rest assured that you will have a great crop.  Similarly, the more you can rely on automated testing and scripts, the more complex you can make systems, and the more freely you can change them.
  • Trying something different is fun.  I planted artichokes this year.  I also played around with easyrec.  I can’t speak for the artichokes yet, but exploring a new tool was interesting and fun.  Look up from your code once and a while and visit hackernews (thanks to Jeff Beard for turning me on to that resource) to find something new to learn about.

I think that many software developers are obsessed with passive income, but I think that gardening is the original passive income stream–food grown for you while you are doing something else!

Testing with Pentaho Kettle – adding new logic

We finally have a working test suite, so let’s break some code.  We have a new requirement that we greet users who are under the age of 30 with ‘howdy’ because that’s how the kids are saying ‘hello’ nowadays.

You just jumped into a series of blog posts on testing ETL transformations written with Pentaho Data Integration. Previous posts have covered:

The first thing we should do is write a test that exercises the logic we are trying to write.  We make a directory with a name descriptive of the behavior we are trying to test, and add a row to the tests.csv driver file pointing to the files in that directory. Here’s how the additional line will look:

agebasedgreeting,agebasedgreeting/input.txt,agebasedgreeting/expected.txt

And we will copy over the data files from the first test case we had (simplerun) and modify them to exhibit the expected behavior (a new greeting for users under 30). We don’t have to modify my input file, since it has people both under 30 and over 30 in it, but just to catch any crazy boundary conditions, we will add someone who is 30 and someone who is 31 (we already have Jane Doe, who is 29).

Then we need to modify the expected output file to reflect the howdyification of the greeting. You can check out both source files on github.

Then we run the tests.

pentaho-failed-test-75

You can see the failure in the log file that kettle generates and in the build/results directory.  You can also see that we added a job entry to clean up the build directory so that when we run tests each time, we have a clean directory into which to write our output file.

pentaho-failed-test-75

Now that we have a failing test, we can modify the core logic to make the test pass. Writing the logic is an exercise left to the reader. (Or you could look at the github project :).

We re-run the tests to see if they pass, but it looks like simplerun fails before we can even test agebasedgreeting:

pentaho-failed-test-2-75

We can do a diff of the expected and output files and see that, whoops, the simplerun testcase had some users that were under 30 and affected by the logic change.

This points out two features of this method of testing:

  1. Regression testing is built in.
  2. Because of the way we are abort tests, TestSuiteRunner only runs until our first failure.

The easiest way to fix this issue is to inspect output.txt and verify that it is as expected for the simplerun test.  If so, we can simply copy it over to simplerun/expected.txt and use that file as the new golden table.

We also realize that we are passing in the hello field to the output.txt file and that doing so is no longer required.  So we can update the expected.txt in both directories to reflect that.  Running the tests again gives us success.

pentaho-success-75

Now that we’ve added code, I’ll look at some next steps you can take if you are interested in further testing your ETL processes.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – business logic

So, the first step in building the test harness is to create a skeleton of the transformations we will need to run.  These transforms contain the business logic of your ETL process.

Pssssst. This article is part of a series.  Previous posts covered:

Typically, I find that my processing jobs break down into 4 parts:

  • setup (typically job entries)
  • loading data to a stream (extract)
  • processing that data (transform)
  • saving that data to a persistent datastore (load)

Often, I combine the last two steps into a single transformation.

So, for this sample project (final code is here), we will create a couple of transformations containing business logic.  (All transformations are built using Spoon on Windows with Pentaho Data Integration version 4.4.0.)

The business needs to greet people appropriately, so our job will take a list of names and output that same list with a greeting customized for each person.  This is the logic we are going to be testing.

First, the skeleton of the code that takes our input data and adds a greeting.  This transformation is called ‘Greet The World’.

pentaho-basic-logic-75

I also created a ‘Load People to Greet’ transformation that is just a text file input step and a copy rows to results step.pentaho-basic-logic-load-75

The last piece you can see in this is the ‘GreetFolks’ job which merely strings together these two transformations.  This would be the real job that would be run regularly to serve the business’ needs.

pentaho-basic-logic-job-75

This logic is not complicated, but could grow to be quite complex.  Depending on the data we are being passed in, we could grow the logic in the ‘Greet The World’ transformation to be quite complex–the variety of greetings could depend on the time of year, any special holidays happening, the gender or age or occupation of the person, etc, etc.

Astute observers may note that I didn’t write a test first.  The reason for this is that getting the test harness right before you write these skeletons is hard.  It’s easier to write the simplest skeleton, add a test to it, and then for all future development, right a failing test first.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle/PDI

Pentaho Kettle (more formally called Pentaho Data Integration) is an ETL tool for working with large amounts of data.  I have found it a great solution for building data loaders to integrate external data sources into .  I’ve used it to pull data from remote databases, flat files and web services, and munge that data, and then push it into a local data store (typically a SQL database).

However, transforming data can be complex.  This is especially true when the transformation process builds up over time–new business rules come into play, exceptions are made, and the data transformation process slowly becomes more and more complex.  This is true of all software, but data transformation has built in complexity and a time component that other software processes can minimize.

This complexity in turn leads to a fear of change–the data transformation becomes fragile and brittle.  You have to spend a lot of time thinking about changes.  Testing such changes becomes a larger and larger effort, because you need to cover all the edge cases.  In some situations, you may want to let your transform run for a couple of days in a staging environment (you have one of those, right?) to see what effect changes to the processing have on the data.

What is the antidote for that fear?  Automated testing!

While automated testing for Pentaho Kettle is not as easy as using Junit or Ruby on Rails, it can be done.  There are four major components.

  • First, the logic you are testing.  This is encapsulated in a job or transform.  It should take data in, transform it and then output it.  If you depend on databases, files or other external resources for input, mock them up using text files and the Get rows from result step.  Depending on your output, you may want to mock up some test tables or other resources.
  • Second, the test case runner.  This is a job that is parameterized and sets up the environment in which your logic runs, including passing the needed input data.  It also should check that the output is expected, and succeed or fail based on that.
  • Third, the test suite runner.  This takes a list of tests, sets up parameters and passes them to the test case runner.  This is the job that is run using kitchen.sh or kitchen.bat.  You will need a separate test case and test suite runner for each type of logic you are testing.
  • Finally, some way to run the job from the command line so that it can be automated.  Integration into a CI environment like Hudson or Jenkins is highly recommended.

It can be time consuming and frustrating to set up these test harnesses, because you are essentially setting up a separate job to run your logic and therefore doing twice the work.  In addition, true unit testing, like the frameworks mentioned above, is impossible with Kettle due the way columns are treated–if you modify your data structure, you have to make those changes for all the tests.  However, setting up automated testing will save you time in the long run because:

  • the speed at which you can investigate “interesting” data (aka data that breaks the system) is greatly increased, as is your control (“I want to see what happens if I change this one field” becomes a question you can ask)
  • regression tests become far easier
  • if you run into weird data and need to add special case logic, you can also add a test to ensure that this logic hangs around
  • you can test edge cases (null values, large fields, letters instead of where numbers are) without running the entire job
  • you can mimic time lag without waiting by varying inputs

I hope I’ve convinced you to consider looking at testing for Pentaho Kettle.  In the next couple of posts I will examine various facets of testing with Pentaho Kettle, including building out code for a simple test.

Signup for my infrequent emails about pentaho testing.

JFreeChart for the win

test

I wanted to make some charts showing linear regressions, but they needed to be images (so libraries like d3.js were out).  My first inclination was to use Google’s Image Charts API, but it has been deprecated.  I looked around and couldn’t find anything like it.

So, it looked like I’d need to code something up.  Some searching around turned up some great open source projects.  Using JFreeChart and Apache Commons Math, and some blog posts (this one about regression, this one about drawing dashed lines) and a post on StackOverflow about scatter plot basics, I was able to put together the above regression line chart in about half a day.