Pentaho Kettle (more formally called Pentaho Data Integration) is an ETL tool for working with large amounts of data. I have found it a great solution for building data loaders to integrate external data sources into . I’ve used it to pull data from remote databases, flat files and web services, and munge that data, and then push it into a local data store (typically a SQL database).
However, transforming data can be complex. This is especially true when the transformation process builds up over time–new business rules come into play, exceptions are made, and the data transformation process slowly becomes more and more complex. This is true of all software, but data transformation has built in complexity and a time component that other software processes can minimize.
This complexity in turn leads to a fear of change–the data transformation becomes fragile and brittle. You have to spend a lot of time thinking about changes. Testing such changes becomes a larger and larger effort, because you need to cover all the edge cases. In some situations, you may want to let your transform run for a couple of days in a staging environment (you have one of those, right?) to see what effect changes to the processing have on the data.
What is the antidote for that fear? Automated testing!
- First, the logic you are testing. This is encapsulated in a job or transform. It should take data in, transform it and then output it. If you depend on databases, files or other external resources for input, mock them up using text files and the Get rows from result step. Depending on your output, you may want to mock up some test tables or other resources.
- Second, the test case runner. This is a job that is parameterized and sets up the environment in which your logic runs, including passing the needed input data. It also should check that the output is expected, and succeed or fail based on that.
- Third, the test suite runner. This takes a list of tests, sets up parameters and passes them to the test case runner. This is the job that is run using kitchen.sh or kitchen.bat. You will need a separate test case and test suite runner for each type of logic you are testing.
- Finally, some way to run the job from the command line so that it can be automated. Integration into a CI environment like Hudson or Jenkins is highly recommended.
It can be time consuming and frustrating to set up these test harnesses, because you are essentially setting up a separate job to run your logic and therefore doing twice the work. In addition, true unit testing, like the frameworks mentioned above, is impossible with Kettle due the way columns are treated–if you modify your data structure, you have to make those changes for all the tests. However, setting up automated testing will save you time in the long run because:
- the speed at which you can investigate “interesting” data (aka data that breaks the system) is greatly increased, as is your control (“I want to see what happens if I change this one field” becomes a question you can ask)
- regression tests become far easier
- if you run into weird data and need to add special case logic, you can also add a test to ensure that this logic hangs around
- you can test edge cases (null values, large fields, letters instead of where numbers are) without running the entire job
- you can mimic time lag without waiting by varying inputs
I hope I’ve convinced you to consider looking at testing for Pentaho Kettle. In the next couple of posts I will examine various facets of testing with Pentaho Kettle, including building out code for a simple test.
Signup for my infrequent emails about pentaho testing.