Skip to content

Testing with Pentaho Kettle – business logic

So, the first step in building the test harness is to create a skeleton of the transformations we will need to run.  These transforms contain the business logic of your ETL process.

Pssssst. This article is part of a series.  Previous posts covered:

Typically, I find that my processing jobs break down into 4 parts:

  • setup (typically job entries)
  • loading data to a stream (extract)
  • processing that data (transform)
  • saving that data to a persistent datastore (load)

Often, I combine the last two steps into a single transformation.

So, for this sample project (final code is here), we will create a couple of transformations containing business logic.  (All transformations are built using Spoon on Windows with Pentaho Data Integration version 4.4.0.)

The business needs to greet people appropriately, so our job will take a list of names and output that same list with a greeting customized for each person.  This is the logic we are going to be testing.

First, the skeleton of the code that takes our input data and adds a greeting.  This transformation is called ‘Greet The World’.

pentaho-basic-logic-75

I also created a ‘Load People to Greet’ transformation that is just a text file input step and a copy rows to results step.pentaho-basic-logic-load-75

The last piece you can see in this is the ‘GreetFolks’ job which merely strings together these two transformations.  This would be the real job that would be run regularly to serve the business’ needs.

pentaho-basic-logic-job-75

This logic is not complicated, but could grow to be quite complex.  Depending on the data we are being passed in, we could grow the logic in the ‘Greet The World’ transformation to be quite complex–the variety of greetings could depend on the time of year, any special holidays happening, the gender or age or occupation of the person, etc, etc.

Astute observers may note that I didn’t write a test first.  The reason for this is that getting the test harness right before you write these skeletons is hard.  It’s easier to write the simplest skeleton, add a test to it, and then for all future development, right a failing test first.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – current options

Before we dive into writing a custom test suite harness, it behooves us to look around and see if anyone else has solve the problem in a more general fashion.  This question has been asked in the kettle forums before as well.

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs , and the second, talking about what parts of ETL processes to test.

Below are the options I was able to find.  (If you know of any others, let me know and I’ll update this list.)

Other options outlined on a StackOverflow question include using DBUnit to populate databases.

A general purpose framework for testing ETL transformations suffers from a few hindrances:

  • it is easy to have side effects in a transform and in general transformations are a higher level of abstraction than java classes (which is why we can be more productive using them)
  • inputs and outputs differ for every transform
  • correctness is a larger question than a set of assert statements that unit testing frameworks provide

As we build out a custom framework for testing, we’ll follow these principles:

  • mock up outside data sources as CSV files
  • break apart the ETL process into a load and a transform process
  • use golden data that we know to be correct as our “assert” statements

As a reminder, I’ll be publishing another installment in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – what to test

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs.

Especially since you have to manually create a test harness for each type of transformation, it is an effort to create a testsuite.  So, what should you test?

You should test ETL code that is:

  • complex
  • likely to change over time
  • key to what you are doing
  • will fail in subtle ways

So, for instance, I don’t test code that loads data from a file.  I do test business logic.  I don’t test code that reads from a database or writes to a database.  I do test anything that has a Filter rows step in it.  I don’t test connectivity to needed resources, because I think a failure there would be spectacular enough that our ops team will notice.  I do test anything I think might change in the future.

It’s a balancing act, and choosing what to test or not to test can become an excuse for not testing at all.

So, if this decision is overwhelming, but you want to try automated testing, pick a transform with logic that you currently maintain, refactor it to accept input from a Get rows from result step (or if your dataset is large enough that you get OutOfMemory errors with this step, serialize/de-serialize the data) and wrap it with a test suite.  When you think of another “interesting” set of data, add that to the suite. See if this gives you more confidence to change the transformation in question.

In the next post, we’ll start building out such a testing framework.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle/PDI

Pentaho Kettle (more formally called Pentaho Data Integration) is an ETL tool for working with large amounts of data.  I have found it a great solution for building data loaders to integrate external data sources into .  I’ve used it to pull data from remote databases, flat files and web services, and munge that data, and then push it into a local data store (typically a SQL database).

However, transforming data can be complex.  This is especially true when the transformation process builds up over time–new business rules come into play, exceptions are made, and the data transformation process slowly becomes more and more complex.  This is true of all software, but data transformation has built in complexity and a time component that other software processes can minimize.

This complexity in turn leads to a fear of change–the data transformation becomes fragile and brittle.  You have to spend a lot of time thinking about changes.  Testing such changes becomes a larger and larger effort, because you need to cover all the edge cases.  In some situations, you may want to let your transform run for a couple of days in a staging environment (you have one of those, right?) to see what effect changes to the processing have on the data.

What is the antidote for that fear?  Automated testing!

While automated testing for Pentaho Kettle is not as easy as using Junit or Ruby on Rails, it can be done.  There are four major components.

  • First, the logic you are testing.  This is encapsulated in a job or transform.  It should take data in, transform it and then output it.  If you depend on databases, files or other external resources for input, mock them up using text files and the Get rows from result step.  Depending on your output, you may want to mock up some test tables or other resources.
  • Second, the test case runner.  This is a job that is parameterized and sets up the environment in which your logic runs, including passing the needed input data.  It also should check that the output is expected, and succeed or fail based on that.
  • Third, the test suite runner.  This takes a list of tests, sets up parameters and passes them to the test case runner.  This is the job that is run using kitchen.sh or kitchen.bat.  You will need a separate test case and test suite runner for each type of logic you are testing.
  • Finally, some way to run the job from the command line so that it can be automated.  Integration into a CI environment like Hudson or Jenkins is highly recommended.

It can be time consuming and frustrating to set up these test harnesses, because you are essentially setting up a separate job to run your logic and therefore doing twice the work.  In addition, true unit testing, like the frameworks mentioned above, is impossible with Kettle due the way columns are treated–if you modify your data structure, you have to make those changes for all the tests.  However, setting up automated testing will save you time in the long run because:

  • the speed at which you can investigate “interesting” data (aka data that breaks the system) is greatly increased, as is your control (“I want to see what happens if I change this one field” becomes a question you can ask)
  • regression tests become far easier
  • if you run into weird data and need to add special case logic, you can also add a test to ensure that this logic hangs around
  • you can test edge cases (null values, large fields, letters instead of where numbers are) without running the entire job
  • you can mimic time lag without waiting by varying inputs

I hope I’ve convinced you to consider looking at testing for Pentaho Kettle.  In the next couple of posts I will examine various facets of testing with Pentaho Kettle, including building out code for a simple test.

Signup for my infrequent emails about pentaho testing.

Solution for Time Machine error “Unable to complete backup. An error occurred while copying files to the backup volume”

My SO has a Mac, and she was using Time Machine to back it up.  As someone who cut his teeth with Amanda backups back in the day, Time Machine is a beautiful, intuitive backup solution.  Really, it is.

However, a while ago the backups stopped working.  She saw this error message: “Unable to complete backup. An error occurred while copying files to the backup volume”.

I googled around for it and saw this KB article from Apple, which wasn’t too helpful, as the only troubleshooting suggestion was a reboot (hmmm, sounds a bit like Windows).  I tried doing that multiple times, and still saw the error message.

So, we tried a different hard drive.  That still didn’t seem to work–same error message.

Finally, I did some more googling and ran across this forum post (yes Jeff Atwood, forums are indeed the dark matter of the web), which gave me additional troubleshooting tips.

Basically, if you are seeing this error on time machine:

  1. connect your time machine disk drive to your Mac
  2. turn off time machine by opening the time machine prefs and select ‘none’ for the backup disk
  3. open up your console
  4. click on ‘system.log’
  5. click ‘clear display’
  6. turn on time machine by opening the time machine prefs and selecting your disk drive
  7. watch the system log for errors that look likeMar 9 12:14:14 computer-name /System/Library/CoreServices/backupd[905]: Error: (-36) SrcErr:YES Copying /path/to/file/file.name to (null)
  8. remove those files
  9. restart time machine by repeating steps 2 and 6.

I am not sure how to fix the issue if you can’t remove those files.  The files that were causing issues for our Mac were imap mailabox files from Thunderbird, so I just uninstalled Thunderbird and removed the mailbox files.