We finally have a working test suite, so let’s break some code. We have a new requirement that we greet users who are under the age of 30 with ‘howdy’ because that’s how the kids are saying ‘hello’ nowadays.
You just jumped into a series of blog posts on testing ETL transformations written with Pentaho Data Integration. Previous posts have covered:
- The benefits of automated testing for ETL jobs
- what parts of ETL processes to test
- current options and frameworks for testing Kettle
- writing testable business logic
- running one test using TestCaseRunner
- running multiple tests using TestSuiteRunner
The first thing we should do is write a test that exercises the logic we are trying to write. We make a directory with a name descriptive of the behavior we are trying to test, and add a row to the tests.csv driver file pointing to the files in that directory. Here’s how the additional line will look:
And we will copy over the data files from the first test case we had (simplerun) and modify them to exhibit the expected behavior (a new greeting for users under 30). We don’t have to modify my input file, since it has people both under 30 and over 30 in it, but just to catch any crazy boundary conditions, we will add someone who is 30 and someone who is 31 (we already have Jane Doe, who is 29).
Then we need to modify the expected output file to reflect the howdyification of the greeting. You can check out both source files on github.
Then we run the tests.
You can see the failure in the log file that kettle generates and in the build/results directory. You can also see that we added a job entry to clean up the build directory so that when we run tests each time, we have a clean directory into which to write our output file.
Now that we have a failing test, we can modify the core logic to make the test pass. Writing the logic is an exercise left to the reader. (Or you could look at the github project :).
We re-run the tests to see if they pass, but it looks like simplerun fails before we can even test agebasedgreeting:
We can do a diff of the expected and output files and see that, whoops, the simplerun testcase had some users that were under 30 and affected by the logic change.
This points out two features of this method of testing:
- Regression testing is built in.
- Because of the way we are abort tests, TestSuiteRunner only runs until our first failure.
The easiest way to fix this issue is to inspect output.txt and verify that it is as expected for the simplerun test. If so, we can simply copy it over to simplerun/expected.txt and use that file as the new golden table.
We also realize that we are passing in the hello field to the output.txt file and that doing so is no longer required. So we can update the expected.txt in both directories to reflect that. Running the tests again gives us success.
Now that we’ve added code, I’ll look at some next steps you can take if you are interested in further testing your ETL processes.
Signup for my infrequent emails about pentaho testing.