Skip to content

All posts by moore - 49. page

An open letter to Robert Reich about BDNT

Hi Robert,

I attended another BDNT on June 4, as I do every quarter or two.  You asked some questions tonight of the community that I think deserve a more measured response than I could muster yelling out in the auditorium.  Questions like: why do you come here?  What does the future of BDNT look like?  Jeez, won’t anyone volunteer to take video?  How can we leverage all the great people at BDNT during the time when we aren’t all in the same room?

First, I want to thank you, Robert, and all the many volunteers and sponsors of BDNT that make it possible.  I have been to a number of them, presented at two, and know some of the volunteers.  I can’t say I’ve met friends there, but it is a great place to go with existing friends to get pumped up about the Colorado tech scene, and new technology in general.

BDNT is, and has currently been, a fantastic presentation venue and gathering place for local tech community.  The focus has always been on building community and helping presenters (and their companies) get better (check out the second to last question on the ‘submit a presentation’ form).

I see two major BDNT constituencies: fly by nighters and regulars.  I’m a fly by nighter–I won’t attend when I get busy or BDNT falls off my radar, so I make it about 2-4 times a year.  Beyond speaking and attending, I have posted a few jobs on the job board, some reviews on my blog, some tweets, and exchanged some cards and some emails from people I’ve met there, but that’s been the limit of my involvement.

The quality and diversity of the presentations is BDNT’s biggest strength–the five minute format and enforced time limits (as well as the coaching) make presentations so tight.  And if a snoozer slides in, the audience only waits for five minutes.  Therefore, BDNT is a quality, time efficient event where I can check on the pulse of the tech community (is technology XXX going to be big?  how many jobs were mentioned for technology YYY?).

Because the presentations are so important, the biggest service BDNT could provide to us fly by nighters is to video tape the presentations.  I understand, Robert, that BDNT is a shoestring operation and that video takes time and money.  I don’t know exactly how to tackle that–two ideas that jump to mind are: ask a local video production company for sponsorship (People Productions jumps to mind) or set up an ipad and share to youtube, and provide cheaper, lower quality video.

As for the regulars, I don’t have the faintest idea of what they need.  Robert, you or the other volunteers probably do–they reach out to you with requests for features, help, etc.  So, I’ll have to rely on you to guide BDNT to serve their needs.

A caution: please don’t turn BDNT into another local, professional social network.  I already have too many ‘networks’.  I also fear that BDNT doesn’t have the mass to avoid being a ghost town.  (How many of those 10k members have only been to one meetup?  how many people who are not recruiters post to the message boards?)  We have all seen digital ghost towns before and they aren’t much fun to be around.  And I don’t want another place to keep a profile up to date–please ask to pull from LinkedIn and StackOverflow all you want, but please don’t make me fill out another skills list.  (I just joined the BDNT LinkedIn group (well, I applied for permission to join) because that’s the right place to do professional social networking.)

I will say that I’ve enjoyed the various experiments I’ve been a part of through BDNT (e.g, the twitter backplane, the non profit hack fest, the map of tech in Colorado).  Robert, if you want to experiment with a social network because of what the regulars or your gut is saying, do so!  Just don’t be surprised if us fly by nighters don’t really participate.  But whatever you do, please don’t stop experimenting.

It is worth asking how BDNT could be better, but, Robert, don’t forget that being ‘only’ the premier technology meetup in Colorado and a place where many many people come to check in on the tech community, present ideas, meet peers, and learn is quite an achievement.  Ask Ignite and Toastmasters about being ‘just’ a successful presentation organization–it is a success in this world of infinite opportunity and limited attention.

Bask in the glory of creating a successful community.

Finally, for everyone who wasn’t there, some fun facts from the June 4 2013 BDNT:

  • The unemployment rate for software engineers in the USA is 0.2%
  • The New Tech Meetup site code is available on github (no license I could see, however)
  • There was a really cool robot company (Taleus Robotics?  I couldn’t find a website for them) that is selling the computer needed to drive robotics for $299 that will expose servos and motors as linux devices.

Gardening and software development

It’s the end of spring/early summer in the northern hemisphere, so it’s time to plan the vegetable garden again. I was putting some tomatoes in the other day and musing about the similarities between gardening and software development. To wit:

  • I have a lot of hesitancy about planting–especially perennials.  It feels so permanent, and I might screw things up, and maybe I should go back to the drawing board, or maybe just do it next weekend….  But just starting makes the problem so much easier–it loses its weight.  Your garden will never be perfect, but an imperfect garden is 100% better than no garden.  Similarly, when confronted with a new project or feature, half the battle is just starting.
  • You will have ample opportunity to make mistakes in both gardening and software development, so feel free to learn from them.  I don’t know where I heard it, but “it’s fine to make mistakes, just try not to make the same ones.”
  • Automate, automate, automate.  The more you can rely on machinery to free you from the drudge of gardening, the more you can rest assured that you will have a great crop.  Similarly, the more you can rely on automated testing and scripts, the more complex you can make systems, and the more freely you can change them.
  • Trying something different is fun.  I planted artichokes this year.  I also played around with easyrec.  I can’t speak for the artichokes yet, but exploring a new tool was interesting and fun.  Look up from your code once and a while and visit hackernews (thanks to Jeff Beard for turning me on to that resource) to find something new to learn about.

I think that many software developers are obsessed with passive income, but I think that gardening is the original passive income stream–food grown for you while you are doing something else!

Testing with Pentaho Kettle – next steps

So, to review, we’ve taken a (very simple) ETL process and written the basic logic, constructed a test case harness around it, built a test suite harness around that test case, and added some logic and a new test case to the suite.  In normal development, you’d continue on, adding more and more test cases and then adding to your core logic to make those test cases pass.

This is the last in a series of blog posts on testing Pentaho Kettle ETL transformations. Past posts include:

Here are some other production ready ETL testing framework enhancements.

  • use database tables instead of text files for your output steps (both regular and golden), if the main process will be writing to a database.
  • run the tests using kitchen instead of spoon, using ant or whatever build system is best for your operation
  • integrate with a continuous integration system like hudson or jenkins to be aware when changes break the system
  • mock up external resources like database tables and web services calls

If you are interested in setting up a test of your ETL processes, here are some tips:

  • use a file based repository, and version your kettle files.  Being XML, job and transformation files don’t handle diffs well, but a file based repository is still far easier to version than in the database. You may want to try an XML aware diff tool to help with versioning difficultties.
  • let your testing infrastructure grow with your code–don’t try to write your entire harness in a big upfront effort.

By the way, testing isn’t cost free.  I went over some of the benefits in this post, but it’s worth examining the costs.  They include:

  • additional time to build the harness
  • hassle when you add fields to the output, because you have to go back and add them to all the test data as well
  • additional thought required to decide what to test
  • running the tests takes time (I have about 35 tests in one of my kettle projects and it can take about 10 minutes to run them all)

However, I still think, for any ETL project of decent size (more than one transformation) or that will be around for a while (any time long enough to evolve), an automated testing approach makes sense. 

Unless you can guarantee that business requirements won’t change (and I have news for you, you can’t!), testing can give you the ability to explore data changes and the confidence to make logic changes.

Happy testing!

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – adding new logic

We finally have a working test suite, so let’s break some code.  We have a new requirement that we greet users who are under the age of 30 with ‘howdy’ because that’s how the kids are saying ‘hello’ nowadays.

You just jumped into a series of blog posts on testing ETL transformations written with Pentaho Data Integration. Previous posts have covered:

The first thing we should do is write a test that exercises the logic we are trying to write.  We make a directory with a name descriptive of the behavior we are trying to test, and add a row to the tests.csv driver file pointing to the files in that directory. Here’s how the additional line will look:

agebasedgreeting,agebasedgreeting/input.txt,agebasedgreeting/expected.txt

And we will copy over the data files from the first test case we had (simplerun) and modify them to exhibit the expected behavior (a new greeting for users under 30). We don’t have to modify my input file, since it has people both under 30 and over 30 in it, but just to catch any crazy boundary conditions, we will add someone who is 30 and someone who is 31 (we already have Jane Doe, who is 29).

Then we need to modify the expected output file to reflect the howdyification of the greeting. You can check out both source files on github.

Then we run the tests.

pentaho-failed-test-75

You can see the failure in the log file that kettle generates and in the build/results directory.  You can also see that we added a job entry to clean up the build directory so that when we run tests each time, we have a clean directory into which to write our output file.

pentaho-failed-test-75

Now that we have a failing test, we can modify the core logic to make the test pass. Writing the logic is an exercise left to the reader. (Or you could look at the github project :).

We re-run the tests to see if they pass, but it looks like simplerun fails before we can even test agebasedgreeting:

pentaho-failed-test-2-75

We can do a diff of the expected and output files and see that, whoops, the simplerun testcase had some users that were under 30 and affected by the logic change.

This points out two features of this method of testing:

  1. Regression testing is built in.
  2. Because of the way we are abort tests, TestSuiteRunner only runs until our first failure.

The easiest way to fix this issue is to inspect output.txt and verify that it is as expected for the simplerun test.  If so, we can simply copy it over to simplerun/expected.txt and use that file as the new golden table.

We also realize that we are passing in the hello field to the output.txt file and that doing so is no longer required.  So we can update the expected.txt in both directories to reflect that.  Running the tests again gives us success.

pentaho-success-75

Now that we’ve added code, I’ll look at some next steps you can take if you are interested in further testing your ETL processes.

Signup for my infrequent emails about pentaho testing.

Companies to come out of XOR

I read Startup Communities by Brad Feld a few months ago. I found it to be interesting even for me–someone who is only on the periphery of the VC/startup community in Boulder. I especially enjoyed his first chapter, where he examined the startup history of Boulder, from StorageTek to Celestial Seasonings.

I cut my teeth working as an employee of a startup in Boulder, XOR. We were a consulting company, and I was able to watch, fresh out of college and wet behind the ears, as we went from a small profitable company of 60 to a VC funded agglomeration of 500 employees spread across the country, and through some of the layoffs and consolidation.

I was talking to another XOR employee who co-founded the company I currently work for about companies that spun out of XOR, and thought it’d be fun to collect a list.

To make this list, you have to meet the following criteria:

  • founded by someone who worked at XOR
  • had at least one employee or two founders–I started a sole proprietorship myself, but it is hard to distinguish between freelancing (which is hard, but not as hard as a company) and a one person company

To make the list, a company does not have to still be alive nor profitable–I’m interested in the failures as well as the successes. In addition, it doesn’t matter if the founding happened a few years or jobs after the person worked at XOR–again, I’m interested in lineage, not in direct causation.

Here are the companies I know (including XOR founders where known–there may have been other founders not listed).  In no particular order…

If you know one that is not listed, please contact me and I’ll add your suggestion.

Testing with Pentaho Kettle – the test suite runner

After we can run one test case, the next step is to run a number of test cases at one time.

Heads up, this article is part of a series on testing ETL transformations written with Pentaho Kettle. Previous posts covered:

Running multiple tests allows us to exercise logic in transformations by adjusting the input and expected output files. This allows you to test a number of edge cases easily.

pentaho-testsuite-runner-75

First, we need to build a CSV to drive the test cases.  Here is the test list file.  This file is read by a transformation that loads the rows and passes them to the next job entry.  The next job entry is the TestCaseRunner we saw in the last post, once for each line in the csv file.  As you can see below, we filter any rows that start with a #.  This behavior helps immensely when you are developing a new test, and don’t want to run all the other tests in your suite (typically because of how long it can take).

pentaho-load-tests-from-file-75

In order to drive each test case from the rows output by the Load Tests From File transformation, we need to modify the job settings of the TestCaseRunner.  Below we’ve checked the ‘copy previous results to parameters’ checkbox which takes the results output from the tests.csv file loaded by the previous transformations and uses them as parameters for the TestCaseRunner job.  We also checked the ‘execute for every input row’ checkbox which will execute the testcase once for each row. This lets us add a new test by adding a line to the file.

pentaho-testsuite-runner-drive-testcase-75

Obviously, taking these parameters requires modifications to the TestCaseRunner job.  Rather than have the input.file and expected.file variables hardcoded as we did previously, we need to take them as parameters:

pentaho-testcase-runner-modified-setvars-75

We also pass a test.name parameter, so that we can distinguish between tests that fail and those that succeed.  We also create a directory for test results that we don’t delete after the test suite is run, and output a success or failure marker file after a test is run.

pentaho-testcase-runner-modified-for-suite-75

You can run the TestSuiteRunner job in Spoon by hitting the play button or f9.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days–we’ll cover how to add new logic.  But if you can’t wait, the full code for the Pentaho ETL Testing Example is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – the test case runner

Now that we have our business logic, we need to build a test case that can exercise that logic.

FYI, this article is part of a series. Previous posts covered:

First, build out a job that looks almost like our regular job, but has a few extra steps. Below I’ll show you screen captures from spoon as we build out the business logic, but you can view the complete set of code on github.

pentaho-testcase-runner-75

It sets some variables for input, output and expected files.  You can see below that we also set a base.job.dir variable which is used as a convenience elsewhere in the TestCaseRunner (for pulling in sample data, for example).

pentaho-testrunner-variables-75

The job also creates a temp directory for output files, then calls the two transformations that are at the heart of our business logic.  After that, the TestCaseRunner compares the output and expected files, and signals either success or failure.

To make the business logic transformations testable, we have to be able to inject test files for processing. At the same time, in the main job/production, we obviously want to process real data. The answer is to modify the transformations to read the file to process from named parameters.  We do this on both the job entry screen:

pentaho-parameter-job-screen-75

and on the transformation settings screen:

pentaho-parameter-transformation-screen-75

We also need to make sure to change the main GreetFolks main job to pass the needed parameters into the updated transformations.

Once these parameters are passed to the transformations, you need to modify the steps inside to use the parameters rather than hardcoded values.  Below we show the modified Text File Input step in the Load People To Greet transformation.

pentaho-parameter-text-input-file-75

The input and expected files are added to our project in the src/test/data directory and are placed under source control.  These are the data sets we vary to test interesting conditions in our code.  The output file is sent to a temporary directory.

So, now we can run this single test case in spoon and see if our expected values match the output values.  You can see from the logfile below that this particular run was successful.

pentaho-testcase-runner-success-75

The compare step at the end is our ‘assert’ statement.  In most cases, it will be comparing two files.  The expected output file (also called ‘golden’) and the output of the transformation.  The job step of File Compare works well if you are testing a single file.  If the comparison is between two database tables, you can use a Merge Rows step, and if all rows aren’t identical, fail.

You can run the TestCaseRunner job in spoon by hitting the play button or f9.

Next time we will look at how to run multiple tests via one job.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – business logic

So, the first step in building the test harness is to create a skeleton of the transformations we will need to run.  These transforms contain the business logic of your ETL process.

Pssssst. This article is part of a series.  Previous posts covered:

Typically, I find that my processing jobs break down into 4 parts:

  • setup (typically job entries)
  • loading data to a stream (extract)
  • processing that data (transform)
  • saving that data to a persistent datastore (load)

Often, I combine the last two steps into a single transformation.

So, for this sample project (final code is here), we will create a couple of transformations containing business logic.  (All transformations are built using Spoon on Windows with Pentaho Data Integration version 4.4.0.)

The business needs to greet people appropriately, so our job will take a list of names and output that same list with a greeting customized for each person.  This is the logic we are going to be testing.

First, the skeleton of the code that takes our input data and adds a greeting.  This transformation is called ‘Greet The World’.

pentaho-basic-logic-75

I also created a ‘Load People to Greet’ transformation that is just a text file input step and a copy rows to results step.pentaho-basic-logic-load-75

The last piece you can see in this is the ‘GreetFolks’ job which merely strings together these two transformations.  This would be the real job that would be run regularly to serve the business’ needs.

pentaho-basic-logic-job-75

This logic is not complicated, but could grow to be quite complex.  Depending on the data we are being passed in, we could grow the logic in the ‘Greet The World’ transformation to be quite complex–the variety of greetings could depend on the time of year, any special holidays happening, the gender or age or occupation of the person, etc, etc.

Astute observers may note that I didn’t write a test first.  The reason for this is that getting the test harness right before you write these skeletons is hard.  It’s easier to write the simplest skeleton, add a test to it, and then for all future development, right a failing test first.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – current options

Before we dive into writing a custom test suite harness, it behooves us to look around and see if anyone else has solve the problem in a more general fashion.  This question has been asked in the kettle forums before as well.

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs , and the second, talking about what parts of ETL processes to test.

Below are the options I was able to find.  (If you know of any others, let me know and I’ll update this list.)

Other options outlined on a StackOverflow question include using DBUnit to populate databases.

A general purpose framework for testing ETL transformations suffers from a few hindrances:

  • it is easy to have side effects in a transform and in general transformations are a higher level of abstraction than java classes (which is why we can be more productive using them)
  • inputs and outputs differ for every transform
  • correctness is a larger question than a set of assert statements that unit testing frameworks provide

As we build out a custom framework for testing, we’ll follow these principles:

  • mock up outside data sources as CSV files
  • break apart the ETL process into a load and a transform process
  • use golden data that we know to be correct as our “assert” statements

As a reminder, I’ll be publishing another installment in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – what to test

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs.

Especially since you have to manually create a test harness for each type of transformation, it is an effort to create a testsuite.  So, what should you test?

You should test ETL code that is:

  • complex
  • likely to change over time
  • key to what you are doing
  • will fail in subtle ways

So, for instance, I don’t test code that loads data from a file.  I do test business logic.  I don’t test code that reads from a database or writes to a database.  I do test anything that has a Filter rows step in it.  I don’t test connectivity to needed resources, because I think a failure there would be spectacular enough that our ops team will notice.  I do test anything I think might change in the future.

It’s a balancing act, and choosing what to test or not to test can become an excuse for not testing at all.

So, if this decision is overwhelming, but you want to try automated testing, pick a transform with logic that you currently maintain, refactor it to accept input from a Get rows from result step (or if your dataset is large enough that you get OutOfMemory errors with this step, serialize/de-serialize the data) and wrap it with a test suite.  When you think of another “interesting” set of data, add that to the suite. See if this gives you more confidence to change the transformation in question.

In the next post, we’ll start building out such a testing framework.

Signup for my infrequent emails about pentaho testing.