This article is part of a series. Here’s the first part, explaining the benefits of automated testing for ETL jobs.
Especially since you have to manually create a test harness for each type of transformation, it is an effort to create a testsuite. So, what should you test?
You should test ETL code that is:
- likely to change over time
- key to what you are doing
- will fail in subtle ways
So, for instance, I don’t test code that loads data from a file. I do test business logic. I don’t test code that reads from a database or writes to a database. I do test anything that has a Filter rows step in it. I don’t test connectivity to needed resources, because I think a failure there would be spectacular enough that our ops team will notice. I do test anything I think might change in the future.
It’s a balancing act, and choosing what to test or not to test can become an excuse for not testing at all.
So, if this decision is overwhelming, but you want to try automated testing, pick a transform with logic that you currently maintain, refactor it to accept input from a Get rows from result step (or if your dataset is large enough that you get OutOfMemory errors with this step, serialize/de-serialize the data) and wrap it with a test suite. When you think of another “interesting” set of data, add that to the suite. See if this gives you more confidence to change the transformation in question.
In the next post, we’ll start building out such a testing framework.
Signup for my infrequent emails about pentaho testing.