Category Archives: Data Integration (Kettle)

Simple Chart from CSV File

There was some requests on the Pentaho Data Integration forums for an example of how to generate a simple chart from a CSV file.

Sometimes people get so lost in the technology, that it’s tough to just do something pretty simple.  I totally get that.  Pentaho still has plenty of room for improvement on the usability, especially for people coming to the platform for the first time.

Well, here tis.

Unzip examplecsv.zip to pentaho-demo/pentaho-solutions/samples/etl/ in the sample server.

Basically, the idea is to turn a csv file (example.csv):

Year,PresentsNickReceived,PresentsRequested
2003,7,4
2004,9,8
2005,8,9
2006,11,8

into this chart

The confusing part, I’m guessing from the thread, was how data gets from KETTLE to PENTAHO.

Not hard at all actually.

Pentaho initiates (ie, calls Kettles API) the Kettle transform and then “slurps” in memory records from a specified step.  It’s the UNIX equivalent of the “tee” utility where you’re just watching data arriving at a certain place.  In this example, I’ve made it even more explicit by naming the “dummy” step “for_pentaho” so that it’s clear the step that Pentaho is “slurping” the data from.

After that, it’s just a matter of building a chart like any other in the platform.

Command line ETL Job Execution

I know this might seem pretty obivous to those that use Kettle frequently, but there’s a VERY easy way to execute Kettle jobs at the command line. Kitchen is the command line interface and is quite convenient for executing that ETL job you’ve built. Crontab anyone?

kitchen.sh -file=/mnt/pentaho-professional/pentaho-solutions/software-quality/data/etl/jira_do_everything.kjb
-OR-
kitchen.bat -file=c:\dir\jira_do_everything.kjb
kitchen.bat /file:”c:\dir\jira_do_everything.kjb” (from comments below, thanks!!!)

Does anyone use kitchen or pan and have any best practices or suggestions to offer?

Kettle and Pentaho: 1+1=3

Like all great open source products, Pentaho Data Integration (Kettle) is a functional product in and of itself.  It has a very productive UI and delivers exceptional value as a tool in and of itself.  Most pieces of the Pentaho platform reflect a desire to keep the large communities around the original projects (Mondrian, JFree, etc) engaged; they are complete components in and of themselves.

When used together their value, as it relates to building solutions increases and exceeds their use independently.  I’ll be the first to admit that Pentaho is still fairly technical, but we’re rapidly building more and more graphical interfaces and usability features on top of the platform (many in the open source edition, but much is in the professional edition).  Much of this work involves making the "whole" (Pentaho)  work together to exceed the value of the pieces (Mondrian, Kettle, JFree, …).

A few things immediately come to mind of why Pentaho and Kettle together provide exceptional value as compared to used individually or with another open source reporting library:

  1. Pentaho abstracts data access (optionally) from report generation which gives report developers the full POWER of Kettle for building reports.

    There are some things that are tough, if not downright impossible to do in SQL.  Ever do an HTTP retrieval of an XML doc, slurp in a custom lookup from Excel, do a few database joins and analytical calculations in a SQL statement?  I bet not.  Report developers are smart data dudes; having access to a tool that allows them to sort/pivot/group/aggregate/lookup/iterate/list goes on and on/etc empowers report developers in a way that a simple "JDBC" or "CSV" or "XQuery" alone can accomplish. 
    How is this made possible?
    Pentaho abstracts (optionally, it isn’t forced on customers) the data retrievals to lookup components.  This allows BI developers to use either a SQL lookup (DB), XQuery lookup(XML), MDXLookup (OLAP), or Kettle lookup (EII) to populate a "ResultSet."  Here’s the beauty; reports are generated off a result set instead of directly accessing the sources.  This means that a user can use the same reporting templates, framework, designer, etc and feed/calculate data from wherever they desire.  Truly opens a world of possibiliy where before there was "just SQL" or "ETL into DB tables."

  2. Ability to manage the entire solution in one place

    Pentaho has invested greatly in the idea of the solution being a set of "things" that make up your BI, reporting, DW solution.  This means you don’t have ETL in one repository, reports managed somewhere else, scheduling managed by a third party, etc.  It’s open source so that’s obviously a choice, but we can add much value by ensuring that someone who has to transform data, schedule that, email and monitor, secure, build reports, administer email bursting, etc can do some from one "solution repository." Managing an entire BI solution from one CVS repository?  Now that’s COOL (merge diff/patch anyone?).

  3. Configuration Management

    Kettle is quite flexible; the 2.3.0 release extends the scope and locations where you can use variable substitution.  From a practical standpoint this means that an entire Chef job can be parameterized and called from a Pentaho action sequence.  For instance, because you can do your DW load from inside Pentaho action sequences that means you can secure it, schedule it, monitor it, initiate it from an outside workflow via web service, etc.  In one of my recent Kettle solutions ALL OF THE PHYSICAL database, file, and security information was managed by Pentaho so the Kettle mappings can literally be moved from place to place and work inside of Pentaho. 

  4. Metadata and Additional Integration

    Pentaho is investing in making the tools more seamless.  In practice (this is not a roadmap or product direction statement) this means being able to interact with tables, connections, business views inside of Kettle in an identical (at least similar way) in the report designer.  For example, if you’ve defined the business name for a column to be "Actual Sales" Kettle and the Report Designer can now key off that same metadata and present a "consistent" view to the report/ETL developer instead of knowing that "ACT_SL_STD_CURR" is actual sales. 
    Another example is the plans to do some additional Mondrian/Kettle integration to make the building of Dimensions, Cubes, and Aggregates easier.

Roland Bouman on Pentaho Data Integration

I just got on to Roland Boumans blog.

He has an excellent write up on how to get started with Pentaho Data Integration (aka Kettle):

Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".

….

An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all – rather, it’s a purely declarative task which results in a model.

He wonders at the end about how Kettle is deployed inside of Pentaho.  That’s a great question and something Pentaho/Matt have been working at over the past few months.  Perhaps I can shed a bit of light on this.

In the latest build of Pentaho (I used 1.1.6 for the below screenshots)  we ship an example of a Kettle mapping returning a result set which demonstrates the great architectures of both Kettle and Pentaho.  Kettle provides an easy way for creating plugins and interfaces for steps allowing Pentaho to access data at the "end" of a transformation.  Pentaho has a multipurpose result set object which allows for reports to key off of "data" instead of a SQL Result or an MDX Result, or in this case, a real time data slurp (EII result?!?). 

The transformation in spoon looks like this:

It reads data from the Pentaho sample database (QUADRANT_ACTUALS) filters, and does some calculcations and places then in an operator XML Output.  This output operator is superflous, the KettleComponent (from Pentaho) sources the data DIRECTLY from the in memory XML Output object in Kettle.

The ETL example is basic; it doesn’t even place the data slurped from Kettle into a nicely formatted report.

Just to be clear on what you’re seeing…

Pentaho users gain the full power of the Pentaho Data Ingegration tool (proper ETL sorts, groups, XML, csv, xml, filters, calculations, database lookups, aggregations, etc) when used as a SOURCE for their reports.  A full ETL tool where you don’t HAVE to persist the data to tables or files or anything. 

If it suits your needs, you can simply run your ETL everytime your report is run and you never have to build a staging area, data warehouse, or OLAP cubes.  Your mileage may vary and I don’t think this is wholesale replacement for a data warehouse at all!  Just a great way to provide additional options for BI developers.

Going beyond a single transformation in Pentaho we are currently augmenting this component to allow you to execute Chef Jobs from inside Pentaho.  I have no idea when that code is included in a release but I’ll be sure and blog about it here.  That means that you can schedule your "DW load process" or "Data Integration Routines" to run from inside Pentaho.  I’ll post an example of that shortly…

Again, check out Rolands article!  It’s a great addition to the pentaho blogosphere.

UPSERTS using Kettle/MySQL

Oracle Readers: Think MERGE Statement. 🙂

It’s common to need to generate surrogate keys as part of a Data Warehouse / Business Intelligence solution. Protecting your users from souce system changes, or using time variant data are just a couple of the reasons why one needs to generate a surrogate key when loading data into a DW.

Here is a simple example. I’m loading Product data from an XML file into a MySQL table. The transformation in Kettle looks like this:

upsert_kettle_transform.png

The output coming out of the XML file has been flattened using XQuery so that it looks like a flat table struture. Here’s a screen capture of a nifty feature of Kettle where you can preview the output of a step:

upsert_xml_data.png

Ok, this is our source data. The ID you’re seeing is the OLTP key. We’ll refer to it as the Natural Key. This is the ORDER_ID, or CUSTOMER_ID, or well, you get the idea. This is the key that your source application uses in it’s relational structures.

What we want is to INSERT or UPDATE (aka UPSERT) this record into our DW. We’ll INSERT if we don’t already have this NATURAL key in the DW and generate a surrogate primary key. We’ll UPDATE if we already have this NATURAL key in the warehouse.
Our table looks like:
upsert_product_erd.png

The DDL to create the TABLE is important:

CREATE TABLE INT_PRODUCT (
PRODUCT_ID INTEGER NOT NULL AUTO_INCREMENT,
PRODUCT_NAT_ID INTEGER NOT NULL,
TITLE TINYTEXT,
CONSTRAINT INT_PRODUCT_PK PRIMARY KEY (PRODUCT_ID),
UNIQUE INT_PRODUCT_NK (PRODUCT_NAT_ID)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Notice that our PRIMARY KEY is the Surrogate we’ve set up to be an AUTO_INCREMENTING column. This means that MySQL will generate a surrogate when we INSERT a record. All we have to do is get Kettle to match our unique contrainst (INT_PRODUCT_NK) and either INSERT/UPDATE.

I’ve used the INSERT / UPDATE operator in Kettle. The configuration looks like this:
upsert_upsert_config.png
What we’re doing is configuring Kettle to do a lookup on the table to see if the natural key already exists. If it does, we’ll perform DML (insert or update) on the fields in the lower window.

After running the transformation my INT_PRODUCT table looks like this.
upsert_surrogate.png
Now, I realized I didn’t choose very illustrative natural keys (all are 1,2,3) but the PRODUCT_ID comes from MySQL as a surrogate.

Subsequent executions of this Kettle transformation will INSERT records as they arrive or UPDATE. MySQL and Kettle in combination are making quick work out of generating these surrogates!

I invite Kettle experts to chime in on any optimizations/quirks.