Monthly Archives: May 2006

OLAP Survey

Mark Rittman on Oracle BI Suite EE

Mark Rittman has worked up an excellent article on the new Oracle BI Landscape. He basically covers the whole range of products and technologies and how they’re being repackaged and rebranded into a new product set. Great read.
I thought the following was of interest:

The Analytic Server is effectively a ROLAP (Relational OLAP) engine, that connects to data sources but does not have any local storage in the traditional, database sense, although it does have a cache which holds copies of data previously retrieved, so that subsequent requests for the same data are returned faster.

What struck me about this statement is that from a technology perspective this is nearly identical to Mondrian‘s (Open Source OLAP engine) architecture. While proprietary vendors (Oracle included) had been touting the benefits of a true MOLAP solution instead of just ROLAP. However, with the processing headed to mid tier servers (instead of OLAP crunching happening on a Pentium I desktop) one can leverage the mid tier for caching, and analytical processing (MOLAP-esque stuff). In other words, you can get MOST of the benefits of MOLAP with a well designed and performant ROLAP server proactively caching/crunching on your behalf! Cool stuff!

UPSERTS using Kettle/MySQL

7 Replies

Oracle Readers: Think MERGE Statement. 🙂

It’s common to need to generate surrogate keys as part of a Data Warehouse / Business Intelligence solution. Protecting your users from souce system changes, or using time variant data are just a couple of the reasons why one needs to generate a surrogate key when loading data into a DW.

Here is a simple example. I’m loading Product data from an XML file into a MySQL table. The transformation in Kettle looks like this:

The output coming out of the XML file has been flattened using XQuery so that it looks like a flat table struture. Here’s a screen capture of a nifty feature of Kettle where you can preview the output of a step:

Ok, this is our source data. The ID you’re seeing is the OLTP key. We’ll refer to it as the Natural Key. This is the ORDER_ID, or CUSTOMER_ID, or well, you get the idea. This is the key that your source application uses in it’s relational structures.

What we want is to INSERT or UPDATE (aka UPSERT) this record into our DW. We’ll INSERT if we don’t already have this NATURAL key in the DW and generate a surrogate primary key. We’ll UPDATE if we already have this NATURAL key in the warehouse.
Our table looks like:

The DDL to create the TABLE is important:

CREATE TABLE INT_PRODUCT (
PRODUCT_ID INTEGER NOT NULL AUTO_INCREMENT,
PRODUCT_NAT_ID INTEGER NOT NULL,
TITLE TINYTEXT,
CONSTRAINT INT_PRODUCT_PK PRIMARY KEY (PRODUCT_ID),
UNIQUE INT_PRODUCT_NK (PRODUCT_NAT_ID)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Notice that our PRIMARY KEY is the Surrogate we’ve set up to be an AUTO_INCREMENTING column. This means that MySQL will generate a surrogate when we INSERT a record. All we have to do is get Kettle to match our unique contrainst (INT_PRODUCT_NK) and either INSERT/UPDATE.

I’ve used the INSERT / UPDATE operator in Kettle. The configuration looks like this:

What we’re doing is configuring Kettle to do a lookup on the table to see if the natural key already exists. If it does, we’ll perform DML (insert or update) on the fields in the lower window.

After running the transformation my INT_PRODUCT table looks like this.

Now, I realized I didn’t choose very illustrative natural keys (all are 1,2,3) but the PRODUCT_ID comes from MySQL as a surrogate.

Subsequent executions of this Kettle transformation will INSERT records as they arrive or UPDATE. MySQL and Kettle in combination are making quick work out of generating these surrogates!

I invite Kettle experts to chime in on any optimizations/quirks.

Connect to your own Database

64 Replies

I get asked frequently by people who evaluate Pentaho: How do I get started? Many people (tens of thousands on a monthly basis) download the Pre Configured application and immediately get a sense for the benefits of the Pentaho open source project.

What people almost always want to do next, is get started with their OWN data. You the reader of this blog might find a pre canned demo interesting, but you’ll want to reserve judgement on the leading open source BI Project until you’ve seen your own data flowing into beautiful PDFs/Excel/OLAP Pivot Tables. I totally understand…

This is a very SIMPLE, BASIC, LIGHTWEIGHT guide on how to get the downloadable DEMO connected to YOUR DATABASE and executing YOUR SQL Query using the lynchpin of the Pentaho platform, an Action Sequence (refer to this blog for more on what Action Sequences are).

Disclaimer: Pentaho has development tools that once installed makes the development of action sequences more productive then editing xml as below. In other words, this is just rough and dirty to get you going and connect… I highly suggest downloading the Development Workbench to actually build the Action Sequences.

Step 1:
Collect the following information for your database.

Vendor (Oracle/MySQL) provided JDBC Jar file (classes12.zip, mysql-connector-java-3.1.12-bin.jar)
JDBC Connection URL (jdbc:mysql://localhost/database)
JDBC User Name (scott)
JDBC Password (tiger)
A SQL query that you are certain works in your database (select id, title from status)

This is the information needed for ANY Java Application to connect to a database, and is not really Pentaho specific.

Step 2:
Create (using notepad or vi) a file named myfirst-ds.xml in $DEMO_BASE/jboss/server/default/deploy/

Put the following into that file to create a JNDI datasource with the JNDI-NAME “myfirst”

Note: You need to replace the above items with the corresponding JDBC information you collected in the first step.

Step 3:
Edit (using notepad or vi) the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/web.xml.
Add the following right below the Shark Connection entry

Step 4:
Edit (using notepad or vi) the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/jboss-web.xml.
Add the following right below the Shark entry

Step 5:
Copy your Vendor provided jdbc driver library (classes12.zip) in the following directory:
$DEMO_BASE/jboss/server/default/lib

Congratulations, you’ve setup your JNDI datasource! Whew…. Almost there.

Step 6:
Let’s create your first action sequence!
Create (using notepad or vi) a file named MyFirst.xaction in $DEMO_BASE/pentaho-solutions/samples/datasources

Make sure and replace the select I have above with a select statement that you KNOW works in your database.

Step 7:
Restart your server using the stop-pentaho/start-pentaho commands in $DEMO_BASE.

Step 8:
Open up Firefox (cough IE) and hit the following URL:
http://localhost:8080/pentaho/ViewAction?&solution=samples&path=datasources&action=MyFirst.xaction

You should see the basic output of the Action Sequence which, in this case, is the results of the SQL you provided.

id	title
Action Successful

1	Open

That’s it! You’re connected to your own datasource and now you can start to explore the immense power of Action Sequences with your data… There are approximately 50 example .xactions in the demo so use them as a reference when trying to exploring the features of the platform. I HIGHLY SUGGEST downloading the workbench as well. It’s a visual environment for authoring .xaction files. Trust me, you’ll appreciate the GUI!

Move to WordPress

3 Replies

I’ve made the switch from Movable Type to WordPress. I’ve used Movable Type for more than two years now, quite happily actually, but thought it high time to move to WordPress. I’m rather impressed with WordPress thus far; very competitive product all in all. Some things in the installation were even EASIER than MT.

I turned on comments/trackbacks on my MT blog approximately 7 weeks ago. In that 7 weeks, I had received more than 2,000 spam comments, trackbacks, etc. I plan to use a spam service on this blog to prevent such an abuse again.

If you are encountering any feed or viewing issues please do let me know.

Goodman on BI

Thoughts on Open Source, Analytics