Before we dive into writing a custom test suite harness, it behooves us to look around and see if anyone else has solve the problem in a more general fashion. This question has been asked in the kettle forums before as well.
This article is part of a series. Here’s the first part, explaining the benefits of automated testing for ETL jobs , and the second, talking about what parts of ETL processes to test.
Below are the options I was able to find. (If you know of any others, let me know and I’ll update this list.)
- In chapter 11, Pentaho Kettle Solutions gives an overview of testing and debugging ETL transformations.
- TestKitchen, a framework that combines some other tools with PDI to help test. This hasn’t been updated since 2010. I have not had a chance to download this and play around with it, but it is probably worth a look.
- PDI Black Box Testing is an article from 2007 talking about a framework for PDI testing, but has no code. Here’s a blog post with some comments about this framework.
- The data grid step lets you enter reference or test data, so could play a part in a test.
- Here is a blog post describing building a test harness around ETL transformations using Hibernate.
Other options outlined on a StackOverflow question include using DBUnit to populate databases.
A general purpose framework for testing ETL transformations suffers from a few hindrances:
- it is easy to have side effects in a transform and in general transformations are a higher level of abstraction than java classes (which is why we can be more productive using them)
- inputs and outputs differ for every transform
- correctness is a larger question than a set of assert statements that unit testing frameworks provide
As we build out a custom framework for testing, we’ll follow these principles:
- mock up outside data sources as CSV files
- break apart the ETL process into a load and a transform process
- use golden data that we know to be correct as our “assert” statements
As a reminder, I’ll be publishing another installment in a couple of days. But if you can’t wait, the full code is on github.
Signup for my infrequent emails about pentaho testing.