Testing with Pentaho Kettle – what to test

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs.

Especially since you have to manually create a test harness for each type of transformation, it is an effort to create a testsuite.  So, what should you test?

You should test ETL code that is:

  • complex
  • likely to change over time
  • key to what you are doing
  • will fail in subtle ways

So, for instance, I don’t test code that loads data from a file.  I do test business logic.  I don’t test code that reads from a database or writes to a database.  I do test anything that has a Filter rows step in it.  I don’t test connectivity to needed resources, because I think a failure there would be spectacular enough that our ops team will notice.  I do test anything I think might change in the future.

It’s a balancing act, and choosing what to test or not to test can become an excuse for not testing at all.

So, if this decision is overwhelming, but you want to try automated testing, pick a transform with logic that you currently maintain, refactor it to accept input from a Get rows from result step (or if your dataset is large enough that you get OutOfMemory errors with this step, serialize/de-serialize the data) and wrap it with a test suite.  When you think of another “interesting” set of data, add that to the suite. See if this gives you more confidence to change the transformation in question.

In the next post, we’ll start building out such a testing framework.

Signup for my infrequent emails about pentaho testing.


Testing with Pentaho Kettle/PDI

Pentaho Kettle (more formally called Pentaho Data Integration) is an ETL tool for working with large amounts of data.  I have found it a great solution for building data loaders to integrate external data sources into .  I’ve used it to pull data from remote databases, flat files and web services, and munge that data, and then push it into a local data store (typically a SQL database).

However, transforming data can be complex.  This is especially true when the transformation process builds up over time–new business rules come into play, exceptions are made, and the data transformation process slowly becomes more and more complex.  This is true of all software, but data transformation has built in complexity and a time component that other software processes can minimize.

This complexity in turn leads to a fear of change–the data transformation becomes fragile and brittle.  You have to spend a lot of time thinking about changes.  Testing such changes becomes a larger and larger effort, because you need to cover all the edge cases.  In some situations, you may want to let your transform run for a couple of days in a staging environment (you have one of those, right?) to see what effect changes to the processing have on the data.

What is the antidote for that fear?  Automated testing!

While automated testing for Pentaho Kettle is not as easy as using Junit or Ruby on Rails, it can be done.  There are four major components.

  • First, the logic you are testing.  This is encapsulated in a job or transform.  It should take data in, transform it and then output it.  If you depend on databases, files or other external resources for input, mock them up using text files and the Get rows from result step.  Depending on your output, you may want to mock up some test tables or other resources.
  • Second, the test case runner.  This is a job that is parameterized and sets up the environment in which your logic runs, including passing the needed input data.  It also should check that the output is expected, and succeed or fail based on that.
  • Third, the test suite runner.  This takes a list of tests, sets up parameters and passes them to the test case runner.  This is the job that is run using kitchen.sh or kitchen.bat.  You will need a separate test case and test suite runner for each type of logic you are testing.
  • Finally, some way to run the job from the command line so that it can be automated.  Integration into a CI environment like Hudson or Jenkins is highly recommended.

It can be time consuming and frustrating to set up these test harnesses, because you are essentially setting up a separate job to run your logic and therefore doing twice the work.  In addition, true unit testing, like the frameworks mentioned above, is impossible with Kettle due the way columns are treated–if you modify your data structure, you have to make those changes for all the tests.  However, setting up automated testing will save you time in the long run because:

  • the speed at which you can investigate “interesting” data (aka data that breaks the system) is greatly increased, as is your control (“I want to see what happens if I change this one field” becomes a question you can ask)
  • regression tests become far easier
  • if you run into weird data and need to add special case logic, you can also add a test to ensure that this logic hangs around
  • you can test edge cases (null values, large fields, letters instead of where numbers are) without running the entire job
  • you can mimic time lag without waiting by varying inputs

I hope I’ve convinced you to consider looking at testing for Pentaho Kettle.  In the next couple of posts I will examine various facets of testing with Pentaho Kettle, including building out code for a simple test.

Signup for my infrequent emails about pentaho testing.


Running async tests with qunit and sinon

I ran into an issue running async tests with qunit and sinon.  For a primer on doing this, see this great article.

Basically, the asyncTest never returned.  This manifested itself in the html view of the tests like this (left is success, right is where the test never returns–note the white):

This gallery contains 2 photos.

Turns out that sinon-qunit adapter has sinon fake the browser timer, so setTimeout doesn’t work as expected.

To fix, just turn off the timer faking for a single test: this.clock.restore() or sinon.config.useFakeTimers = false to disable this for all tests.



© Moore Consulting, 2003-2019