Skip to content

Testing with Pentaho Kettle – next steps

So, to review, we’ve taken a (very simple) ETL process and written the basic logic, constructed a test case harness around it, built a test suite harness around that test case, and added some logic and a new test case to the suite.  In normal development, you’d continue on, adding more and more test cases and then adding to your core logic to make those test cases pass.

This is the last in a series of blog posts on testing Pentaho Kettle ETL transformations. Past posts include:

Here are some other production ready ETL testing framework enhancements.

  • use database tables instead of text files for your output steps (both regular and golden), if the main process will be writing to a database.
  • run the tests using kitchen instead of spoon, using ant or whatever build system is best for your operation
  • integrate with a continuous integration system like hudson or jenkins to be aware when changes break the system
  • mock up external resources like database tables and web services calls

If you are interested in setting up a test of your ETL processes, here are some tips:

  • use a file based repository, and version your kettle files.  Being XML, job and transformation files don’t handle diffs well, but a file based repository is still far easier to version than in the database. You may want to try an XML aware diff tool to help with versioning difficultties.
  • let your testing infrastructure grow with your code–don’t try to write your entire harness in a big upfront effort.

By the way, testing isn’t cost free.  I went over some of the benefits in this post, but it’s worth examining the costs.  They include:

  • additional time to build the harness
  • hassle when you add fields to the output, because you have to go back and add them to all the test data as well
  • additional thought required to decide what to test
  • running the tests takes time (I have about 35 tests in one of my kettle projects and it can take about 10 minutes to run them all)

However, I still think, for any ETL project of decent size (more than one transformation) or that will be around for a while (any time long enough to evolve), an automated testing approach makes sense. 

Unless you can guarantee that business requirements won’t change (and I have news for you, you can’t!), testing can give you the ability to explore data changes and the confidence to make logic changes.

Happy testing!

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – adding new logic

We finally have a working test suite, so let’s break some code.  We have a new requirement that we greet users who are under the age of 30 with ‘howdy’ because that’s how the kids are saying ‘hello’ nowadays.

You just jumped into a series of blog posts on testing ETL transformations written with Pentaho Data Integration. Previous posts have covered:

The first thing we should do is write a test that exercises the logic we are trying to write.  We make a directory with a name descriptive of the behavior we are trying to test, and add a row to the tests.csv driver file pointing to the files in that directory. Here’s how the additional line will look:


And we will copy over the data files from the first test case we had (simplerun) and modify them to exhibit the expected behavior (a new greeting for users under 30). We don’t have to modify my input file, since it has people both under 30 and over 30 in it, but just to catch any crazy boundary conditions, we will add someone who is 30 and someone who is 31 (we already have Jane Doe, who is 29).

Then we need to modify the expected output file to reflect the howdyification of the greeting. You can check out both source files on github.

Then we run the tests.


You can see the failure in the log file that kettle generates and in the build/results directory.  You can also see that we added a job entry to clean up the build directory so that when we run tests each time, we have a clean directory into which to write our output file.


Now that we have a failing test, we can modify the core logic to make the test pass. Writing the logic is an exercise left to the reader. (Or you could look at the github project :).

We re-run the tests to see if they pass, but it looks like simplerun fails before we can even test agebasedgreeting:


We can do a diff of the expected and output files and see that, whoops, the simplerun testcase had some users that were under 30 and affected by the logic change.

This points out two features of this method of testing:

  1. Regression testing is built in.
  2. Because of the way we are abort tests, TestSuiteRunner only runs until our first failure.

The easiest way to fix this issue is to inspect output.txt and verify that it is as expected for the simplerun test.  If so, we can simply copy it over to simplerun/expected.txt and use that file as the new golden table.

We also realize that we are passing in the hello field to the output.txt file and that doing so is no longer required.  So we can update the expected.txt in both directories to reflect that.  Running the tests again gives us success.


Now that we’ve added code, I’ll look at some next steps you can take if you are interested in further testing your ETL processes.

Signup for my infrequent emails about pentaho testing.

Companies to come out of XOR

I read Startup Communities by Brad Feld a few months ago. I found it to be interesting even for me–someone who is only on the periphery of the VC/startup community in Boulder. I especially enjoyed his first chapter, where he examined the startup history of Boulder, from StorageTek to Celestial Seasonings.

I cut my teeth working as an employee of a startup in Boulder, XOR. We were a consulting company, and I was able to watch, fresh out of college and wet behind the ears, as we went from a small profitable company of 60 to a VC funded agglomeration of 500 employees spread across the country, and through some of the layoffs and consolidation.

I was talking to another XOR employee who co-founded the company I currently work for about companies that spun out of XOR, and thought it’d be fun to collect a list.

To make this list, you have to meet the following criteria:

  • founded by someone who worked at XOR
  • had at least one employee or two founders–I started a sole proprietorship myself, but it is hard to distinguish between freelancing (which is hard, but not as hard as a company) and a one person company

To make the list, a company does not have to still be alive nor profitable–I’m interested in the failures as well as the successes. In addition, it doesn’t matter if the founding happened a few years or jobs after the person worked at XOR–again, I’m interested in lineage, not in direct causation.

Here are the companies I know (including XOR founders where known–there may have been other founders not listed).  In no particular order…

If you know one that is not listed, please contact me and I’ll add your suggestion.

Testing with Pentaho Kettle – the test suite runner

After we can run one test case, the next step is to run a number of test cases at one time.

Heads up, this article is part of a series on testing ETL transformations written with Pentaho Kettle. Previous posts covered:

Running multiple tests allows us to exercise logic in transformations by adjusting the input and expected output files. This allows you to test a number of edge cases easily.


First, we need to build a CSV to drive the test cases.  Here is the test list file.  This file is read by a transformation that loads the rows and passes them to the next job entry.  The next job entry is the TestCaseRunner we saw in the last post, once for each line in the csv file.  As you can see below, we filter any rows that start with a #.  This behavior helps immensely when you are developing a new test, and don’t want to run all the other tests in your suite (typically because of how long it can take).


In order to drive each test case from the rows output by the Load Tests From File transformation, we need to modify the job settings of the TestCaseRunner.  Below we’ve checked the ‘copy previous results to parameters’ checkbox which takes the results output from the tests.csv file loaded by the previous transformations and uses them as parameters for the TestCaseRunner job.  We also checked the ‘execute for every input row’ checkbox which will execute the testcase once for each row. This lets us add a new test by adding a line to the file.


Obviously, taking these parameters requires modifications to the TestCaseRunner job.  Rather than have the input.file and expected.file variables hardcoded as we did previously, we need to take them as parameters:


We also pass a parameter, so that we can distinguish between tests that fail and those that succeed.  We also create a directory for test results that we don’t delete after the test suite is run, and output a success or failure marker file after a test is run.


You can run the TestSuiteRunner job in Spoon by hitting the play button or f9.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days–we’ll cover how to add new logic.  But if you can’t wait, the full code for the Pentaho ETL Testing Example is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – the test case runner

Now that we have our business logic, we need to build a test case that can exercise that logic.

FYI, this article is part of a series. Previous posts covered:

First, build out a job that looks almost like our regular job, but has a few extra steps. Below I’ll show you screen captures from spoon as we build out the business logic, but you can view the complete set of code on github.


It sets some variables for input, output and expected files.  You can see below that we also set a base.job.dir variable which is used as a convenience elsewhere in the TestCaseRunner (for pulling in sample data, for example).


The job also creates a temp directory for output files, then calls the two transformations that are at the heart of our business logic.  After that, the TestCaseRunner compares the output and expected files, and signals either success or failure.

To make the business logic transformations testable, we have to be able to inject test files for processing. At the same time, in the main job/production, we obviously want to process real data. The answer is to modify the transformations to read the file to process from named parameters.  We do this on both the job entry screen:


and on the transformation settings screen:


We also need to make sure to change the main GreetFolks main job to pass the needed parameters into the updated transformations.

Once these parameters are passed to the transformations, you need to modify the steps inside to use the parameters rather than hardcoded values.  Below we show the modified Text File Input step in the Load People To Greet transformation.


The input and expected files are added to our project in the src/test/data directory and are placed under source control.  These are the data sets we vary to test interesting conditions in our code.  The output file is sent to a temporary directory.

So, now we can run this single test case in spoon and see if our expected values match the output values.  You can see from the logfile below that this particular run was successful.


The compare step at the end is our ‘assert’ statement.  In most cases, it will be comparing two files.  The expected output file (also called ‘golden’) and the output of the transformation.  The job step of File Compare works well if you are testing a single file.  If the comparison is between two database tables, you can use a Merge Rows step, and if all rows aren’t identical, fail.

You can run the TestCaseRunner job in spoon by hitting the play button or f9.

Next time we will look at how to run multiple tests via one job.

Signup for my infrequent emails about pentaho testing.