Skip to content

Companies to come out of XOR

I read Startup Communities by Brad Feld a few months ago. I found it to be interesting even for me–someone who is only on the periphery of the VC/startup community in Boulder. I especially enjoyed his first chapter, where he examined the startup history of Boulder, from StorageTek to Celestial Seasonings.

I cut my teeth working as an employee of a startup in Boulder, XOR. We were a consulting company, and I was able to watch, fresh out of college and wet behind the ears, as we went from a small profitable company of 60 to a VC funded agglomeration of 500 employees spread across the country, and through some of the layoffs and consolidation.

I was talking to another XOR employee who co-founded the company I currently work for about companies that spun out of XOR, and thought it’d be fun to collect a list.

To make this list, you have to meet the following criteria:

  • founded by someone who worked at XOR
  • had at least one employee or two founders–I started a sole proprietorship myself, but it is hard to distinguish between freelancing (which is hard, but not as hard as a company) and a one person company

To make the list, a company does not have to still be alive nor profitable–I’m interested in the failures as well as the successes. In addition, it doesn’t matter if the founding happened a few years or jobs after the person worked at XOR–again, I’m interested in lineage, not in direct causation.

Here are the companies I know (including XOR founders where known–there may have been other founders not listed).  In no particular order…

If you know one that is not listed, please contact me and I’ll add your suggestion.

Testing with Pentaho Kettle – the test suite runner

After we can run one test case, the next step is to run a number of test cases at one time.

Heads up, this article is part of a series on testing ETL transformations written with Pentaho Kettle. Previous posts covered:

Running multiple tests allows us to exercise logic in transformations by adjusting the input and expected output files. This allows you to test a number of edge cases easily.

pentaho-testsuite-runner-75

First, we need to build a CSV to drive the test cases.  Here is the test list file.  This file is read by a transformation that loads the rows and passes them to the next job entry.  The next job entry is the TestCaseRunner we saw in the last post, once for each line in the csv file.  As you can see below, we filter any rows that start with a #.  This behavior helps immensely when you are developing a new test, and don’t want to run all the other tests in your suite (typically because of how long it can take).

pentaho-load-tests-from-file-75

In order to drive each test case from the rows output by the Load Tests From File transformation, we need to modify the job settings of the TestCaseRunner.  Below we’ve checked the ‘copy previous results to parameters’ checkbox which takes the results output from the tests.csv file loaded by the previous transformations and uses them as parameters for the TestCaseRunner job.  We also checked the ‘execute for every input row’ checkbox which will execute the testcase once for each row. This lets us add a new test by adding a line to the file.

pentaho-testsuite-runner-drive-testcase-75

Obviously, taking these parameters requires modifications to the TestCaseRunner job.  Rather than have the input.file and expected.file variables hardcoded as we did previously, we need to take them as parameters:

pentaho-testcase-runner-modified-setvars-75

We also pass a test.name parameter, so that we can distinguish between tests that fail and those that succeed.  We also create a directory for test results that we don’t delete after the test suite is run, and output a success or failure marker file after a test is run.

pentaho-testcase-runner-modified-for-suite-75

You can run the TestSuiteRunner job in Spoon by hitting the play button or f9.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days–we’ll cover how to add new logic.  But if you can’t wait, the full code for the Pentaho ETL Testing Example is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – the test case runner

Now that we have our business logic, we need to build a test case that can exercise that logic.

FYI, this article is part of a series. Previous posts covered:

First, build out a job that looks almost like our regular job, but has a few extra steps. Below I’ll show you screen captures from spoon as we build out the business logic, but you can view the complete set of code on github.

pentaho-testcase-runner-75

It sets some variables for input, output and expected files.  You can see below that we also set a base.job.dir variable which is used as a convenience elsewhere in the TestCaseRunner (for pulling in sample data, for example).

pentaho-testrunner-variables-75

The job also creates a temp directory for output files, then calls the two transformations that are at the heart of our business logic.  After that, the TestCaseRunner compares the output and expected files, and signals either success or failure.

To make the business logic transformations testable, we have to be able to inject test files for processing. At the same time, in the main job/production, we obviously want to process real data. The answer is to modify the transformations to read the file to process from named parameters.  We do this on both the job entry screen:

pentaho-parameter-job-screen-75

and on the transformation settings screen:

pentaho-parameter-transformation-screen-75

We also need to make sure to change the main GreetFolks main job to pass the needed parameters into the updated transformations.

Once these parameters are passed to the transformations, you need to modify the steps inside to use the parameters rather than hardcoded values.  Below we show the modified Text File Input step in the Load People To Greet transformation.

pentaho-parameter-text-input-file-75

The input and expected files are added to our project in the src/test/data directory and are placed under source control.  These are the data sets we vary to test interesting conditions in our code.  The output file is sent to a temporary directory.

So, now we can run this single test case in spoon and see if our expected values match the output values.  You can see from the logfile below that this particular run was successful.

pentaho-testcase-runner-success-75

The compare step at the end is our ‘assert’ statement.  In most cases, it will be comparing two files.  The expected output file (also called ‘golden’) and the output of the transformation.  The job step of File Compare works well if you are testing a single file.  If the comparison is between two database tables, you can use a Merge Rows step, and if all rows aren’t identical, fail.

You can run the TestCaseRunner job in spoon by hitting the play button or f9.

Next time we will look at how to run multiple tests via one job.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – business logic

So, the first step in building the test harness is to create a skeleton of the transformations we will need to run.  These transforms contain the business logic of your ETL process.

Pssssst. This article is part of a series.  Previous posts covered:

Typically, I find that my processing jobs break down into 4 parts:

  • setup (typically job entries)
  • loading data to a stream (extract)
  • processing that data (transform)
  • saving that data to a persistent datastore (load)

Often, I combine the last two steps into a single transformation.

So, for this sample project (final code is here), we will create a couple of transformations containing business logic.  (All transformations are built using Spoon on Windows with Pentaho Data Integration version 4.4.0.)

The business needs to greet people appropriately, so our job will take a list of names and output that same list with a greeting customized for each person.  This is the logic we are going to be testing.

First, the skeleton of the code that takes our input data and adds a greeting.  This transformation is called ‘Greet The World’.

pentaho-basic-logic-75

I also created a ‘Load People to Greet’ transformation that is just a text file input step and a copy rows to results step.pentaho-basic-logic-load-75

The last piece you can see in this is the ‘GreetFolks’ job which merely strings together these two transformations.  This would be the real job that would be run regularly to serve the business’ needs.

pentaho-basic-logic-job-75

This logic is not complicated, but could grow to be quite complex.  Depending on the data we are being passed in, we could grow the logic in the ‘Greet The World’ transformation to be quite complex–the variety of greetings could depend on the time of year, any special holidays happening, the gender or age or occupation of the person, etc, etc.

Astute observers may note that I didn’t write a test first.  The reason for this is that getting the test harness right before you write these skeletons is hard.  It’s easier to write the simplest skeleton, add a test to it, and then for all future development, right a failing test first.

As a reminder, I’ll be publishing another installment of this tutorial in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – current options

Before we dive into writing a custom test suite harness, it behooves us to look around and see if anyone else has solve the problem in a more general fashion.  This question has been asked in the kettle forums before as well.

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs , and the second, talking about what parts of ETL processes to test.

Below are the options I was able to find.  (If you know of any others, let me know and I’ll update this list.)

Other options outlined on a StackOverflow question include using DBUnit to populate databases.

A general purpose framework for testing ETL transformations suffers from a few hindrances:

  • it is easy to have side effects in a transform and in general transformations are a higher level of abstraction than java classes (which is why we can be more productive using them)
  • inputs and outputs differ for every transform
  • correctness is a larger question than a set of assert statements that unit testing frameworks provide

As we build out a custom framework for testing, we’ll follow these principles:

  • mock up outside data sources as CSV files
  • break apart the ETL process into a load and a transform process
  • use golden data that we know to be correct as our “assert” statements

As a reminder, I’ll be publishing another installment in a couple of days.  But if you can’t wait, the full code is on github.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle – what to test

This article is part of a series.  Here’s the first part, explaining the benefits of automated testing for ETL jobs.

Especially since you have to manually create a test harness for each type of transformation, it is an effort to create a testsuite.  So, what should you test?

You should test ETL code that is:

  • complex
  • likely to change over time
  • key to what you are doing
  • will fail in subtle ways

So, for instance, I don’t test code that loads data from a file.  I do test business logic.  I don’t test code that reads from a database or writes to a database.  I do test anything that has a Filter rows step in it.  I don’t test connectivity to needed resources, because I think a failure there would be spectacular enough that our ops team will notice.  I do test anything I think might change in the future.

It’s a balancing act, and choosing what to test or not to test can become an excuse for not testing at all.

So, if this decision is overwhelming, but you want to try automated testing, pick a transform with logic that you currently maintain, refactor it to accept input from a Get rows from result step (or if your dataset is large enough that you get OutOfMemory errors with this step, serialize/de-serialize the data) and wrap it with a test suite.  When you think of another “interesting” set of data, add that to the suite. See if this gives you more confidence to change the transformation in question.

In the next post, we’ll start building out such a testing framework.

Signup for my infrequent emails about pentaho testing.

Testing with Pentaho Kettle/PDI

Pentaho Kettle (more formally called Pentaho Data Integration) is an ETL tool for working with large amounts of data.  I have found it a great solution for building data loaders to integrate external data sources into .  I’ve used it to pull data from remote databases, flat files and web services, and munge that data, and then push it into a local data store (typically a SQL database).

However, transforming data can be complex.  This is especially true when the transformation process builds up over time–new business rules come into play, exceptions are made, and the data transformation process slowly becomes more and more complex.  This is true of all software, but data transformation has built in complexity and a time component that other software processes can minimize.

This complexity in turn leads to a fear of change–the data transformation becomes fragile and brittle.  You have to spend a lot of time thinking about changes.  Testing such changes becomes a larger and larger effort, because you need to cover all the edge cases.  In some situations, you may want to let your transform run for a couple of days in a staging environment (you have one of those, right?) to see what effect changes to the processing have on the data.

What is the antidote for that fear?  Automated testing!

While automated testing for Pentaho Kettle is not as easy as using Junit or Ruby on Rails, it can be done.  There are four major components.

  • First, the logic you are testing.  This is encapsulated in a job or transform.  It should take data in, transform it and then output it.  If you depend on databases, files or other external resources for input, mock them up using text files and the Get rows from result step.  Depending on your output, you may want to mock up some test tables or other resources.
  • Second, the test case runner.  This is a job that is parameterized and sets up the environment in which your logic runs, including passing the needed input data.  It also should check that the output is expected, and succeed or fail based on that.
  • Third, the test suite runner.  This takes a list of tests, sets up parameters and passes them to the test case runner.  This is the job that is run using kitchen.sh or kitchen.bat.  You will need a separate test case and test suite runner for each type of logic you are testing.
  • Finally, some way to run the job from the command line so that it can be automated.  Integration into a CI environment like Hudson or Jenkins is highly recommended.

It can be time consuming and frustrating to set up these test harnesses, because you are essentially setting up a separate job to run your logic and therefore doing twice the work.  In addition, true unit testing, like the frameworks mentioned above, is impossible with Kettle due the way columns are treated–if you modify your data structure, you have to make those changes for all the tests.  However, setting up automated testing will save you time in the long run because:

  • the speed at which you can investigate “interesting” data (aka data that breaks the system) is greatly increased, as is your control (“I want to see what happens if I change this one field” becomes a question you can ask)
  • regression tests become far easier
  • if you run into weird data and need to add special case logic, you can also add a test to ensure that this logic hangs around
  • you can test edge cases (null values, large fields, letters instead of where numbers are) without running the entire job
  • you can mimic time lag without waiting by varying inputs

I hope I’ve convinced you to consider looking at testing for Pentaho Kettle.  In the next couple of posts I will examine various facets of testing with Pentaho Kettle, including building out code for a simple test.

Signup for my infrequent emails about pentaho testing.

Solution for Time Machine error “Unable to complete backup. An error occurred while copying files to the backup volume”

My SO has a Mac, and she was using Time Machine to back it up.  As someone who cut his teeth with Amanda backups back in the day, Time Machine is a beautiful, intuitive backup solution.  Really, it is.

However, a while ago the backups stopped working.  She saw this error message: “Unable to complete backup. An error occurred while copying files to the backup volume”.

I googled around for it and saw this KB article from Apple, which wasn’t too helpful, as the only troubleshooting suggestion was a reboot (hmmm, sounds a bit like Windows).  I tried doing that multiple times, and still saw the error message.

So, we tried a different hard drive.  That still didn’t seem to work–same error message.

Finally, I did some more googling and ran across this forum post (yes Jeff Atwood, forums are indeed the dark matter of the web), which gave me additional troubleshooting tips.

Basically, if you are seeing this error on time machine:

  1. connect your time machine disk drive to your Mac
  2. turn off time machine by opening the time machine prefs and select ‘none’ for the backup disk
  3. open up your console
  4. click on ‘system.log’
  5. click ‘clear display’
  6. turn on time machine by opening the time machine prefs and selecting your disk drive
  7. watch the system log for errors that look likeMar 9 12:14:14 computer-name /System/Library/CoreServices/backupd[905]: Error: (-36) SrcErr:YES Copying /path/to/file/file.name to (null)
  8. remove those files
  9. restart time machine by repeating steps 2 and 6.

I am not sure how to fix the issue if you can’t remove those files.  The files that were causing issues for our Mac were imap mailabox files from Thunderbird, so I just uninstalled Thunderbird and removed the mailbox files.

Easyrec: a recommendation engine worth looking at

I love recommendation engines.  These are the software that Amazon has everywhere showing “users who bought this also bought” recommendations.

I love them because they are an easy way to leverage the wisdom of the crowd to help users.  They also get better the more data you feed into them, so once you set one up, it just makes your site better and better.

For a while, I’ve wanted to explore mahout as a recommendation engine solution, but felt intimidated by how much work integration would be.  Luckily, I did a bit of searching and turned up this stackoverflow question about java recommendation engines.

Looking at some of the alternatives, I dug up easyrec, an open source recommendation engine.  Rather than solving a couple of different machine learning problems like mahout does, easyrec focuses on recommendations.

It also has a javascript API (for both sending information and displaying recommendations) and a demo installation you can use on your site, so it is trivial to integrate into a website to see if it works for you.  I did run into an issue with the demo server, but a post to the forums got it resolved in a few days.

Easyrec has support for generating recommendations for more than one kind of item (so if you want to display different recommendations within specific categories of an ecommerce site, that is possible) and is self hostable in any java container (which is recommended if you are going to use it in any commercial capacity).  You can also build the recommendations off of the following actions: views, rating, or purchase.

You can also customize easyrec with java plugins, though mahout definitely offers far more options for configuruation.

I haven’t noticed any speed changes to my site with the javascript installed, though I’m sure adding some more remote javascript code didn’t speed up page rendering.  I noticed an uptick in time on site after I installed it (small, on the order of 5%).

If you have a set of items that are viewed together, using easyrec can leverage the wisdom of the crowds with not much effort on your part.  It’s not as powerful or configurable as alternatives, but it drop dead simple to get started with.  It’s worth a look.

My ODesk experience

A few months ago, I had a friend who mentioned that he was investigating ODesk to find help for his software project.  I’d heard of ODesk before and was immediately interested.  I have a directory of Colorado farm shares which requires a lot of data entry in the early months of each year, as I get updated information from farmers.  So, I thought I’d try ODesk and see if someone could help with this task.

Because this was my first time, I was cautious.  I worked with only one contractor, and only used about 17 hours of her time.  We worked off and on for about 3 months.  She was based in the Philippines, so everything was asynchronous.  We communicated only through the ODesk interface (which was not very good for a messaging system).

I chose her based on her hourly rate (pretty cheap), skillset (data entry) and reviews (very good).  I made the mistake of contacting her to apply for the job, but letting others apply as well, and in the space of 3 days had over 90 applicants for the position.

After selecting her, and her accepting my offer, I created a data entry account, and described what I wanted.  This was actually good for me, as it forced me to spell out in detail how to add, update or remove an entry, which is the start of the operations manual for my site.

After that, I managed each task I’d assigned to her through a separate email thread.  I did light review of her work after she told me she was done updating, and did go back and forth a couple of times over some of the tasks.  In general, she was very good at following instructions, and OK at taking some initiative (updating fields beyond what I specified, for example).  There were some bugs in my webapp that caused her some grief, and some updates I just did myself, as it was quicker to do them than to describe how to do them.

The variety of work you can get done via ODesk is crazy, and the overall ODesk process was very easy.  You just need to have a valid credit card.  If you are looking to start on a project right away, be aware that some lead time is required (they charge your card $10 to validate your account, and that takes some time to process).

Even though it didn’t save me a ton of time, it was a useful experiment and I’ll do it again next year.  For simple tasks that can be easily documented and outsourced, it’s a worthwhile option to explore.  Though be careful you don’t outsource too much!