RailsConf 2008

Reproducible ETL (aka: "Evaluating ActiveWarehouse-ETL vs. a bunch of GUI bloatware")

Posted by rick Mon, 23 Jul 2007 13:29:00 GMT

Our data migration problem

At my day job (large mental healthcare non-profit) we’re pushing out a Rails-based electronic health record. We have a number of partners who will be using the same system, hosted separately, and contributing data into a common anonymized warehouse of clinical research data. Every installation will end up with not only a production database for day-to-day transaction use, but also a read-only reporting database to offload longer-running report queries onto. Of course current and ongoing development requires database instances for developers, for testing, and for staging prior to production release. Some of the data which feeds the production and development instances comes from static sources (fixture files or similar), while other data comes from conversion of data from a legacy database with significantly (bizarrely, some (e.g., I) might say) different layouts. All of this constitutes a significant data management problem (each of the arrows in the below diagram comprises a data migration task that varies in some way from each of the others).

We have been converting data on a nightly basis for well over a year now, but a couple of months ago, the multiplication of data sources and sinks, along with schema migrations, began to get overwhelming enough that we couldn’t keep up and the ball was, as they say, “dropped”. Realizing the scope of the data management problem, we sketched out what it would take to manage our data if we had to write a tool to do it ourselves. We took a look around at what products (open source and proprietary) were available for data migration (or as the martketroids call it, “ETL”—“extract, transform, and load”). In the midst of all this a few of us went to RailsConf, where I was able to sit in on Anthony Eden’s talk Data Warehouses with ActiveWarehouse. Part of ActiveWarehouse is an ETL tool, ActiveWarehouse-ETL, which looked interesting from we come from (Ruby development). If we had to write our own we would’ve written it in Ruby, with an eye towards testing, automation, reproducibility, and version control of the ETL specs, so finding a Ruby tool for ETL already out there seemed like a step in the right direction.

Learning via reproducible data migration

In doing our research on ActiveWarehouse-ETL we noticed that the documentation is a bit weak but getting better. After grovelling around the web for a while we decided that it was unclear whether AW-ETL was mature enough to handle the data migration work we wanted to throw at it, and we had questions as to whether the ETL tool was too geared for data warehousing to do data migration in a mostly transaction database environment. That is, only one of the many databases with which we’ll be working will have any data warehouse “cubes”, “dimensions”, “facts”, etc. The rest are Rails-ish or reporting-ish standard databases. With this in mind we set out to evaluate AW-ETL systematically.

Thought there’s been some talk of how to unit test an AW-ETL transformation (specified as a “control file”), nothing sufficiently concrete was available. We set about writing an automation tool to allow us to try different types of data migrations. We have gone TDD (and some might say we are sometimes BDD) on our project, so we wanted to be able to specify the behavior of the migration via tests before writing our control files. We mocked something up with some shell scripts and then quickly moved to a Rakefile which would manage our databases and run our tests for ETL, throwing an exception whenever we have a test failure.

We group our tests into “scenarios”, which are simply directories with enough information to specify database schemas, table data, the ETL transformation(s) to use, and the tests we are trying to satisfy:

(If you are interested in downloading and playing with/using this little AW-ETL testing framework, you can find a tarball here)

This turned out to be a valuable methodology, as the documentation was even weaker than it appeared at first glance. While some areas are sufficiently documented, when we would need an example most none were easy to find. We did a lot of spelunking through the AW-ETL code to get us by. We covered some of the basic ground over the course of a day of testing, but ultimately were stumped when it came to figuring out how to preserve a foreign key relationship in a migration. We contacted the AW mailing list, which seems to be responsive, judging from the archives, but so far haven’t made any progress towards figuring out how to specify a foreign key relationship in an AW-ETL control file.

How do we represent a foreign key relationship?

Here’s the core problem ,which is pretty critical for us to be able to solve if we are going to be able to add new “core data” to systems which have been running in production for some time already. Given two tables in a source database and the same two tables in a destination database, where one of the source tables has a column which references the primary key (“id” column in a normal Rails table schema) of the other table, we want to copy our source records into the destination database, but we want to make sure the foreign key relationship between the two tables remains intact, even if the primary keys on the destination database vary wildly. Maybe a picture would help:

So, put a couple of records for each table in the source database, as well as putting an existing record in the destination database. What we want to see after the ETL process is the following:

That is, the source database stays the same, but the destination database now has 5 records in total: the existing 1 record that we started with, and the 4 new records we brought over from the source database.

In our little framework we can specify this with a test like so:

require File.dirname(__FILE__) + '/../test_helper.rb'

class ConversionTest < Test::Unit::TestCase
  def test_conversion_should_copy_data_from_source_to_destination
    with_db 'data_destination_development' do
      assert_equal 2, ActiveRecord::Base.connection.select_value("select count(*) from foos").to_i
      assert_equal 3, ActiveRecord::Base.connection.select_value("select count(*) from bars").to_i
    end
  end

  def test_conversion_should_leave_source_data_intact
    with_db 'data_source_development' do
      assert_equal 2, ActiveRecord::Base.connection.select_value("select count(*) from foos").to_i
      assert_equal 2, ActiveRecord::Base.connection.select_value("select count(*) from bars").to_i
    end
  end

  def test_primary_keys_should_be_distinct
    with_db 'data_destination_development' do
      assert_equal [1, 2], ActiveRecord::Base.connection.select_values("select id from foos").sort.map(&:to_i)
      assert_equal [1, 2, 3], ActiveRecord::Base.connection.select_values("select id from bars").sort.map(&:to_i)
    end
  end

  def test_additional_data_should_be_converted
    with_db 'data_destination_development' do
      assert_equal ['chronic', 'buddha', 'four twenty'], ActiveRecord::Base.connection.select_values("select name from bars order by id")
    end
  end

  def test_foreign_key_relationship_should_be_preserved
    with_db 'data_destination_development' do
      assert_equal [3, 2], ActiveRecord::Base.connection.select_values("select bar_id from foos order by id").to_i
    end
  end
end

After extensive code groveling and experimentation (the documentation is weak in this particular area), the best we’ve been able to get is the following:

Which, isn’t going to suffice since the foreign key references are now misdirected. What we want is to be able to say “hey, the natural key to look up the bar on the other end of this foo.bar_id is the bar.name column, so when you find a matching record with that key, use its id for the foo.bar_id” ... but, we’ve been unable to figure out how to do that in AW-ETL, or if it can be done.

Why bother with all this?

We’re in the processing of choosing a tool for our data migration needs. During the development process (which, as long as the application is in use, will never truly end) developers are highly concerned with issues relating to data—ultimately they want to know what “buckets” data falls into, what are the steps they need to follow to make sure they don’t trash development or production data, and how do we minimize the complexity of the data we have to manage.

The organization is concerned with a bigger issue—with this system running in many instances, how do we have confidence that our data is correct, that reporting and warehousing have clearly defined (and well optimized) schemas, that the data we have can answer the questions we ask, and that we won’t suffer downtime due to data migration problems.

These two perspectives are related but have differing priorities, and different needs which may even be irreconcilable. We’re finding it difficult, in surveying the ETL tools on the market, to find a tool which is usable by the developers on a daily basis, but which is also usable at the organization level in a comprehensible way. Or, put another way, the way that a decent-sized organization (> 1000 employees) tends to use data tools is very different from the way a development team of a dozen tends to use data tools.

What am I talking about? We’ve seen something about how we can take a tool like AW-ETL (whether or not it can handle a foreign key relationship properly—my guess is that it can but the documentation is too sparse and the code base too unfamiliar for us to find it) and subject it to modern agile methodologies: test-driven development, version control, pairing, continuous integration (after all, we have a single command that we can use to launch all our ETL tasks: rake). Given that AW-ETL’s control files are essentially code, and most likely, can even be generated by code in an appropriate ActiveRecord-aware Ruby DSL, we can apply all the techniques we have adopted for the management of code in our team. Productivity and reproducibility are clear benefits.

On the other hand, when we examine other ETL tools on the market we find a different set of features and tradeoffs. Of the general-purpose ETL tools we can find we quickly rule out database-specific tools that tie us to a single vendor’s database technology. In the next pass we curb those which are clearly immature or unworkable after a quick evaluation. Remaining are the set of ETL products which typically support a wide range of extractions (XML, database tables from any number of different databases, various file formats, HTTP requests, SOA services, etc.), a wide range of transformations, and a wide range of load targets (including various non-database targets). Some of these tools have various automation concepts built in, ranging from job sequencing to load clustering to script execution. In other words, they are leagues ahead in terms of overall ETL functionality than anything we would write ourselves, as well as anything AW-ETL is likely to have in the near future, just due to the disparity in the number of paid developers working on these bigger tools (these tools are typically part of a suite of other “business intelligence” (heh), reporting or data mining tools, and there’s mad loot at stake in there somewhere).

But, as we look at how these tools can be used, we tend to find the same things over and over. The primary interface is an all-powerful GUI, sometimes a Windows-only interface, but usually a Java interface that runs on a number of platforms, including the Linux/FreeBSD/OSX platforms we use in-house. There will sometimes be some sort of scripting interface built into the GUI, which suffers from the same problems scripting interfaces tend to suffer from: the authors think Visual Basic is frikkin’ HOT, their language is completely home-grown, there are no good editors, no good testing or version-control tools, the libraries you must draw on are weird and subject to change at every release, etc. These tools will typically have some sort of “automation” functionality, either built into the GUI itself (which presumes you run a server somewhere on your network and let the ETL beast run its own little kingdom), or a very limited “start me as a batch script / cron job” hook somewhere. Here is one of the more compelling (in a “please don’t let Zeldman or Jason Fried see this” sort of way) scheduling interfaces I ran across last week:

That’s what I’m talking about.

Anyway, most of the tools are written in Java, and therefore some of them expose a “Java API”. Downsides: the APIs are typically horrible, the APIs are brittle (i.e., when version 5.3 comes out the stuff you used in 5.2 doesn’t typically work—hope you had a test suite, thx), and often the gap between the basic concepts you’d want to automate and how they are represented in the GUI is so large as to be impassible. Here’s an example from one of the better APIs, and maybe I’m just spoiled by Ruby, but doesn’t it seem like that should be, oh, 5 lines in a decent Ruby DSL?

Back to the point (and you thought there wasn’t one!)... The tools we find, if they have the functionality necessary to handle non-trivial migrations, are geared towards an obsolete, unworkable, and unproductive process of managing data. Enterprise(y) data management is geared around a very small number (ideally, 1) of “experts” using a GUI interface to manage large amounts of data graphically. This means that, without massive effort, the migration definitions are mostly invisible, the process is mostly unmeasurable, and the migrations are mostly untestable (before running, and after running). Without massive effort, the user of the tool has little knowledge of state: what the current state of data is, what would happen if we ran a migration, and what happened the last time we ran one. The changes between one migration set and the next (i.e., the output of the user’s work) are not version-controlled, which means that errors go undetected, it is nearly impossible to “branch” the migration sets for releases or other purposes, rollback is a snapshot-at-best process (a friend of ours calls this “save-game”/”load-game” functionality, and we’ve considered giving a PlayStation + joystick interface to any tools we encounter which have this level of quality assurance).

In short, without applying massive effort, using these tools in your organization has the warm and fuzzy feel of adopting the 1995-era “let’s bet our company’s future on having our file clerks hack our workflow out in a gnarly Visual Basic app that will one day hang like a lead albatross around our collective necks” so-called “strategy”. It should go without saying that an organization willing to grab such a tool and run with it is not typically the organization which sees the point in applying massive additional effort to data migration.

What we’re looking to do is to find a sweet spot, some middle ground. How can we apply modern software methodologies to the process of data management, while using a tool which is powerful enough to do the job, but affords us the ability to actually run it in a lightweight manner and to write a data migration specification that is short, visible, and readable? Perhaps this is too tall an order. If I have good news I’ll let you know :-/

Tags , , , , , ,  | no comments


banned vocabulary
"+1" (and "-1")
"existential"
"onboarding"
"ferret"
"finesse"
"yeah."
<blank> <units> thin
<blank> warriors
<blank> years young
<blank>'s team
<x> of the moment
(it|)'s all good
(noo|new)b(|ie|y)
I.T.
ROI
P.D.I.
[web] portal
accountab(le|ility)
actuate
advocate (v. and n.)
anyhoo
assessment
belief system
best practice(s)
best practice(s)
blog(|ger|ging)
business rules
cautiously optimistic
celebrate
closure
construct (n.)
creative(s) (n.)
dialogue
divers(e|ity)
diversity
document (n.)
emerg(ing|ent)
emoticon(s)
enabler
eponymous
everyday heroes
extreme <blank>
facilitate
faith-based
foment
gestalt
git 'r done
gradation(|s) [sic]
guiding vision
hoi polloi
human drama
ill-fated
incentivise
jejune
kerfuffle
killer app
kudos
leverage
marginalize(d)
matriculate
merch
monetize
mouth-feel
multitask(|ing|er) (n.t.)
n(oo|ew)b(ie|)
network(|ing) (n.t.)
nexus
outsider art
podcast(ing)
proggy
protocol
quantum leap
reflect (v.)
repurpose
revamp
river system
schadenfreude
sea change
shopping (etc.) culture
shout-out
some <blank>-action
sophomore effort
strategic repositioning
synergy
team members / partners
the <blank> arena
the <x> Street
tix
under 30-set
value system
vertical
where('|i)s the outrage?
win-win
winders
Weltanschauung