Before we dive into writing a custom test suite harness, it behooves us to look around and see if anyone else has solve the problem in a more general fashion.  This question has been asked in the kettle forums before as well.

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs , and the second, talking about what parts of ETL processes to test.

Below are the options I was able to find.  (If you know of any others, let me know and I’ll update this list.)

Other options outlined on a StackOverflow question include using DBUnit to populate databases.

A general purpose framework for testing ETL transformations suffers from a few hindrances:

  • it is easy to have side effects in a transform and in general transformations are a higher level of abstraction than java classes (which is why we can be more productive using them)
  • inputs and outputs differ for every transform
  • correctness is a larger question than a set of assert statements that unit testing frameworks provide

As we build out a custom framework for testing, we’ll follow these principles:

  • mock up outside data sources as CSV files
  • break apart the ETL process into a load and a transform process
  • use golden data that we know to be correct as our “assert” statements

As a reminder, I’ll be publishing another installment in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

© Moore Consulting, 2003-2017 +