Pentaho Data Integration

Why Use an ETL Tool?

transformation photo — Photo by AlicePopkorn

I’m a big fan of ETL tools. The one with which I am most familiar is Kettle, aka Pentaho Data Integration. When I was working for 8z, we used it heavily to pull data from other systems, process it, and update our databases. While ETL systems are not without their flaws, I think their strengths are such that everyone who is moving data around should consider them. This is more true now than in the past because there is a lot more data flowing everywhere, and there are several viable open source ETL tools, so you don’t have to spend thousands or tens of thousands of dollars to get started.

What are the benefits of ETL tools?

There are pre-built components for common data tasks (connecting to a database, parsing a flat file) that have been tested and debugged by many many people. It’s hard to over emphasize how much time this can save, allowing you to focus on business logic.
You operate at a higher level of abstraction.
There is support for other performance features like parallel jobs that you can configure.
The GUI makes data flow obvious.
You can write your own components that leverage existing libraries.

What are the detriments?

Possible to version control, impossible to merge.
Limits of components mean you sometimes have to contort your data flows, or drop down to write your own component.
Some components (at least for Kettle) are not open source.
You have to roll your own testing framework. I did.
You have to learn another tool.

Don’t re-invent the wheel! Your data movement problem may very well be a super special snowflake, but chances are it isn’t. Every line of code you write is another you have to maintain. When you are confronted with a data movement problem, take a look at an ETL tool like Kettle and see if you can stand on the shoulders of giants. Here’s a list of open source ETL tools to evaluate.

Databases Pentaho Data Integration | moore | November 24, 2014

Slides from my ‘Transforming Data with Kettle’ talk

Here is a PDF of my slides from the ‘Transforming Data with Kettle’ talk I did last night at the Boulder Java Users’ Group.

I really enjoyed some of the discussion, which centered around topics like how to version control output of graphical code generation tools, error handling and transactions, and who had interacted with the EDI data format. As far as my audience survey questions, I’d say around 40% of people had written scripts to munge data from one datasource to another, and about 15% had used java code and about 20% had used an ETL tool.

Thanks to everyone who attended.

BJUG Pentaho Data Integration | moore | August 14, 2013

An Intro to Pentaho Data Integration

I’m presenting on transforming data with Pentaho Data Integration (aka Kettle) at the Boulder Java Users Group this Tuesday.

I hope you can join us as I present a gentle introduction to Kettle, including a couple of demos. I hear there will be pizza as well.

BJUG Pentaho Data Integration | moore | August 9, 2013

Results from the PDI/Kettle user survey

I just wanted to post results from the PDI/Kettle user survey I ran a few weeks ago.

I marketed this on the pentaho forums and twitter (got a nice retweet from @pentaho). I wasn’t sure how to reach more of the PDI audience.

I received almost 20 responses, but I know that this is only a tiny portion of the PDI userbase (after all, in the last week, kettle 4.4.0 has been downloaded over 110,000 times since it was released last November). So these results are of mostly anecdotal interest.

Here are the results.

The ‘other’ category ranged from “I do everything” to “Data Warehouse Architect”. It was interesting to me that so many folks were developers, but that the resources for testing PDI (and version control and other kinds of developer type tasks) are so sparse. Or maybe I’m just not aware of these resources.

Again, it is interesting to me that most of the people (over 75%) who responded to my survey have over two years of PDI experience. I would have thought that there would be a higher percentage of new users, at least from the forums. And I’m bummed Matt Casters didn’t fill it out!

Thanks to everyone who participated!

If you’d like to get on a low traffic list and hear more of my thoughts about Pentaho Data Integration Development, sign up for the PDI development newsletter.

Pentaho Data Integration | moore | July 21, 2013

Survey on pentaho data integration/kettle users

As I posted in the Pentaho Forums last week, I’m running a survey on who uses Kettle to help with a presentation I’m doing for the Boulder Java Users’ Group. It’s only 2 questions long, so if you use Pentaho Data Integration/Kettle, please consider filling the survey out.

Pentaho Data Integration | moore | July 8, 2013

Testing time dependent kettle transformations

Testing transformations that depend on the date will often be required when you only want to process new data, or if you want to treat events that happened in the past differently depending on how long ago they occurred.

I have handled the time dimension in one of two ways.

The first is to have a SQL statement that is pulled in via a ‘Get Variables’ step. This statement is then executed. For the production job, the statement simply pulls the current date from the database: ‘select curdate()‘ for mysql. For testing, the statement returns some known date: ‘select str_to_date(‘2012-05-27′,’%Y-%m-%d’)‘ for mysql.

The benefit to this is that you can make this SQL call in your transformation, and everything stays tidily in there. The disadvantage is that you’re making another database call and mostly just for testing purposes.

The second is just to have a variable that is set previously in the job and is passed in to a transformation as a named parameter. This date can be pulled from a file (for test), or using the ‘Get System Info’ step, or a database lookup (for production). The benefit to this is that you aren’t necessarily making another database call and it is more understandable. I can’t think of any downside, so this is my recommended method.

After this setup is done, you can pivot your test data around the hardcoded test date. For example, if your data should change state one year after insertion, you can set the date in your input data rows to 364, 365 and 366 days from your test date. This kind of condition testing ensures that when the logic changes (you should change state two years after insertion), your test will fail, and you will know about the issue before your users do.

This is content from my email newsletter about Pentaho Kettle Testing. To receive similar emails in your inbox, sign up below.

Signup for my infrequent emails about pentaho testing.

Pentaho Data Integration Testing | moore | June 16, 2013

Testing with Pentaho Kettle – next steps

So, to review, we’ve taken a (very simple) ETL process and written the basic logic, constructed a test case harness around it, built a test suite harness around that test case, and added some logic and a new test case to the suite. In normal development, you’d continue on, adding more and more test cases and then adding to your core logic to make those test cases pass.

This is the last in a series of blog posts on testing Pentaho Kettle ETL transformations. Past posts include:

Here are some other production ready ETL testing framework enhancements.

use database tables instead of text files for your output steps (both regular and golden), if the main process will be writing to a database.
run the tests using kitchen instead of spoon, using ant or whatever build system is best for your operation
integrate with a continuous integration system like hudson or jenkins to be aware when changes break the system
mock up external resources like database tables and web services calls

If you are interested in setting up a test of your ETL processes, here are some tips:

use a file based repository, and version your kettle files. Being XML, job and transformation files don’t handle diffs well, but a file based repository is still far easier to version than in the database. You may want to try an XML aware diff tool to help with versioning difficultties.
let your testing infrastructure grow with your code–don’t try to write your entire harness in a big upfront effort.

By the way, testing isn’t cost free. I went over some of the benefits in this post, but it’s worth examining the costs. They include:

additional time to build the harness
hassle when you add fields to the output, because you have to go back and add them to all the test data as well
additional thought required to decide what to test
running the tests takes time (I have about 35 tests in one of my kettle projects and it can take about 10 minutes to run them all)

However, I still think, for any ETL project of decent size (more than one transformation) or that will be around for a while (any time long enough to evolve), an automated testing approach makes sense.

Unless you can guarantee that business requirements won’t change (and I have news for you, you can’t!), testing can give you the ability to explore data changes and the confidence to make logic changes.

Happy testing!

Signup for my infrequent emails about pentaho testing.

Pentaho Data Integration Testing | moore | May 11, 2013

Testing with Pentaho Kettle – adding new logic

We finally have a working test suite, so let’s break some code. We have a new requirement that we greet users who are under the age of 30 with ‘howdy’ because that’s how the kids are saying ‘hello’ nowadays.

You just jumped into a series of blog posts on testing ETL transformations written with Pentaho Data Integration. Previous posts have covered:

The first thing we should do is write a test that exercises the logic we are trying to write. We make a directory with a name descriptive of the behavior we are trying to test, and add a row to the tests.csv driver file pointing to the files in that directory. Here’s how the additional line will look:

agebasedgreeting,agebasedgreeting/input.txt,agebasedgreeting/expected.txt

And we will copy over the data files from the first test case we had (simplerun) and modify them to exhibit the expected behavior (a new greeting for users under 30). We don’t have to modify my input file, since it has people both under 30 and over 30 in it, but just to catch any crazy boundary conditions, we will add someone who is 30 and someone who is 31 (we already have Jane Doe, who is 29).

Then we need to modify the expected output file to reflect the howdyification of the greeting. You can check out both source files on github.

Then we run the tests.

You can see the failure in the log file that kettle generates and in the build/results directory. You can also see that we added a job entry to clean up the build directory so that when we run tests each time, we have a clean directory into which to write our output file.

Now that we have a failing test, we can modify the core logic to make the test pass. Writing the logic is an exercise left to the reader. (Or you could look at the github project :).

We re-run the tests to see if they pass, but it looks like simplerun fails before we can even test agebasedgreeting:

We can do a diff of the expected and output files and see that, whoops, the simplerun testcase had some users that were under 30 and affected by the logic change.

This points out two features of this method of testing:

Regression testing is built in.
Because of the way we are abort tests, TestSuiteRunner only runs until our first failure.

The easiest way to fix this issue is to inspect output.txt and verify that it is as expected for the simplerun test. If so, we can simply copy it over to simplerun/expected.txt and use that file as the new golden table.

We also realize that we are passing in the hello field to the output.txt file and that doing so is no longer required. So we can update the expected.txt in both directories to reflect that. Running the tests again gives us success.

Now that we’ve added code, I’ll look at some next steps you can take if you are interested in further testing your ETL processes.

Signup for my infrequent emails about pentaho testing.

Pentaho Data Integration Programming Testing | moore | May 8, 2013

Testing with Pentaho Kettle – the test suite runner

After we can run one test case, the next step is to run a number of test cases at one time.

Heads up, this article is part of a series on testing ETL transformations written with Pentaho Kettle. Previous posts covered:

Running multiple tests allows us to exercise logic in transformations by adjusting the input and expected output files. This allows you to test a number of edge cases easily.

First, we need to build a CSV to drive the test cases. Here is the test list file. This file is read by a transformation that loads the rows and passes them to the next job entry. The next job entry is the TestCaseRunner we saw in the last post, once for each line in the csv file. As you can see below, we filter any rows that start with a #. This behavior helps immensely when you are developing a new test, and don’t want to run all the other tests in your suite (typically because of how long it can take).

In order to drive each test case from the rows output by the Load Tests From File transformation, we need to modify the job settings of the TestCaseRunner. Below we’ve checked the ‘copy previous results to parameters’ checkbox which takes the results output from the tests.csv file loaded by the previous transformations and uses them as parameters for the TestCaseRunner job. We also checked the ‘execute for every input row’ checkbox which will execute the testcase once for each row. This lets us add a new test by adding a line to the file.

Obviously, taking these parameters requires modifications to the TestCaseRunner job. Rather than have the input.file and expected.file variables hardcoded as we did previously, we need to take them as parameters:

We also pass a test.name parameter, so that we can distinguish between tests that fail and those that succeed. We also create a directory for test results that we don’t delete after the test suite is run, and output a success or failure marker file after a test is run.

You can run the TestSuiteRunner job in Spoon by hitting the play button or f9.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days–we’ll cover how to add new logic. But if you can’t wait, the full code for the Pentaho ETL Testing Example is on github.

Signup for my infrequent emails about pentaho testing.

Pentaho Data Integration Testing | moore | May 5, 2013

Testing with Pentaho Kettle – the test case runner

Now that we have our business logic, we need to build a test case that can exercise that logic.

FYI, this article is part of a series. Previous posts covered:

First, build out a job that looks almost like our regular job, but has a few extra steps. Below I’ll show you screen captures from spoon as we build out the business logic, but you can view the complete set of code on github.

It sets some variables for input, output and expected files. You can see below that we also set a base.job.dir variable which is used as a convenience elsewhere in the TestCaseRunner (for pulling in sample data, for example).

The job also creates a temp directory for output files, then calls the two transformations that are at the heart of our business logic. After that, the TestCaseRunner compares the output and expected files, and signals either success or failure.

To make the business logic transformations testable, we have to be able to inject test files for processing. At the same time, in the main job/production, we obviously want to process real data. The answer is to modify the transformations to read the file to process from named parameters. We do this on both the job entry screen:

and on the transformation settings screen:

We also need to make sure to change the main GreetFolks main job to pass the needed parameters into the updated transformations.

Once these parameters are passed to the transformations, you need to modify the steps inside to use the parameters rather than hardcoded values. Below we show the modified Text File Input step in the Load People To Greet transformation.

The input and expected files are added to our project in the src/test/data directory and are placed under source control. These are the data sets we vary to test interesting conditions in our code. The output file is sent to a temporary directory.

So, now we can run this single test case in spoon and see if our expected values match the output values. You can see from the logfile below that this particular run was successful.

The compare step at the end is our ‘assert’ statement. In most cases, it will be comparing two files. The expected output file (also called ‘golden’) and the output of the transformation. The job step of File Compare works well if you are testing a single file. If the comparison is between two database tables, you can use a Merge Rows step, and if all rows aren’t identical, fail.

You can run the TestCaseRunner job in spoon by hitting the play button or f9.

Next time we will look at how to run multiple tests via one job.

Signup for my infrequent emails about pentaho testing.

Pentaho Data Integration Testing | moore | May 2, 2013

Pentaho Data Integration

Why Use an ETL Tool?

Slides from my ‘Transforming Data with Kettle’ talk

An Intro to Pentaho Data Integration

Results from the PDI/Kettle user survey

Survey on pentaho data integration/kettle users

Testing time dependent kettle transformations

Testing with Pentaho Kettle – next steps

Testing with Pentaho Kettle – adding new logic

Testing with Pentaho Kettle – the test suite runner

Testing with Pentaho Kettle – the test case runner

Letters to a New Developer

Pages

Subscribe

Socials

Categories

Archives