Category Archives: Uncategorized

Pentaho Tech Tips: Call for prioritization

Open source is democratic, open, real.

While I have a good sense for which Tech Tips would be useful, I’d also like to ask the community for what tips they’d like to see written up:

  • Mondrian: Star Schema to OLAP cubes
    A very basic Star Schema with a Fact and Two Dimensions show how this is built into a Mondrian cube and how to built a “Pivot view” Pentaho report.
  • Mondrian: Advanced MDX
    Sets, top, running totals, etc
  • Kettle: Portable ETL
    Showing how to use paramater injection to make your Kettle solution (Jobs and Transforms) executable inside of Pentaho.
  • Kettle: Custom rollups using Excel
    Showing how to build a dimension, reporting table, etc using a very easy to use interface for business users.
  • Reporting: List of Values
    Show how to use the most unfortunately named Secure Filter component to do list of values (even though you are not REQUIRED to do any security).  Not very eloquent but the suggestion has been to call it a “Prompt For” component (see below).  Think “parameter page” driven by “select distinct name from my_reporting_table.”
  • Report Designer: How to build reports with Charts
    The latest release included the charting expressions so now one can build reports with lovely looking charts.
  • Report Designer: How to pass “Pentaho” parameters to reports
    This allows the building of drill thru parameters, titles, and other “context” from the server
  • Pentaho Spreadsheet Services: Your data looking sexy in Excel
    A quick how to of how to get an instant excel analytic interface into ANY database.  Example with Oracle XE.

Comments are ON… vote, have your say.  I WANT to do all of these, and will, eventually.  What do YOU want to see?

Microsoft doing good things with their money!

I’ll pay some praise the guerrilla from Redmond:

Brilliant and Hilarious shorts featuring Ricky Gervais of Office fame:

David Brent rules!

Great use of those profits!  🙂

Open Source is agile

I’m not talking about the methodology in particular, I’m just saying compared to traditional software engineering practices with customer advisory boards vetting major features, rounds of marketing approvals of features, etc.

For instance, I submitted a Jira case to the Pentaho development staff for including a jar in our demo application need to run certain Pentaho Data Integration mappings.  In 20 hrs the jar had been included (already vetted for license since it’s part of another project) and is now part of the daily builds.  This is the oil that makes the open source machine great; ability for software (Pentaho as a project) to respond to real customer needs (from me).  It’s awesome!

Now that reminds me, I hadn’t highlighted some of the cool new “open source — eee” things at Pentaho yet:

  • Public Issue/Feature Roadmap:
    We have launched Jira as a place to track new feature requests, bug submissions, etc.  I greatly encourage you to register and begin using it to submit bugs / suggestions.  Can’t always say they’ll get fixed in 20 hours but they have a MUCH GREATER chance of being fixed if they’re in Jira in addition to the forums.
  • Public Source Control:
    While we’ve always published our source with every release that source repository wasn’t available to anyone on an anonymous basis.  We’re hosting a subversion now that allows easier access and contribution from our always valued community.  Consider this an open invitation to dig in, build a cool plugin, etc.

I’m glad these two things have happened; I think it just makes communication easier, effective, and more transparent.  What do you think?

Finally, not in lame-oh music devoid desktop

I’ve recently made the switch to Linux as many of you have read my previous blogs on the matter. 

One of the things that I missed dearly, but was not a critical priority, was getting streaming MP3 (shoutcast) on my headphones.  Too many higher priority things on my plate, but I finally got XMMS and the MP3 codecs.  What a pain those pesky patents have caused for end users like me. 

977 the Kickin Country Channel never sounded so good!

Windows never looked so GOOD!

In my last blog entry I was clear: Windows had crashed on me for the last time. I was through with the operating system from Redmond…

Except…

It’s a Microsoft world and I’m pragmatic enough to understand that there are simply SOME things that can not be done from Linux (device drivers for my all in one printer/scanner/fax are non existent for example). VMWare is invaluable in this regard and while I’ve raved about it before, I’ll say it again. It’s about the best 150 USD you can spend if you’re a developer.

So… Here’s how I’m using Windows that suits me just fine because it’s a) in VMWare so i only fire it up when need be and b) I’m using XGL and even Windows looks cool on the side of a 3D cube desktop.

Windows Looks Good

Off Topic: "OK to discriminate" referendum defeated

The great state of Washington passed a law adding  "sexual orientation" to the list of groups provided anti-discrimination protection  It’s a sad state of affairs when these measures, of any form, are needed to ensure that people are civil to other people; however there are clear needs for such measures.

There’s this polictician who thought it would be a grand idea to sponsor a referendum to put to ballot a measure that specifically excludes these protections for gay and lesbians citizens.  Sad to say more than 100,000 of my fellow Washingtonians signed the measure, but calm rationale heads prevailed: 

Referendum 65 will not appear on the ballot.

OWB 10gR2 : Real Time Data Warehousing

There’s lots of talk about real time, right time, period batch, message based in the Data Warehousing and BI circles these days. I think this is driven by quite a few reasons. Need for fresh data, need for unified reporting interfaces for users, etc. Mostly, I think it comes down to a TCO for IT assets. As the EAI/EII/ETL tools start to converge along with increased SOA-ee-ness of databases and middleware products there becomes quite a bit of overlap between the different product sets. Managing “one product” that does this data integration, calcuation, and movement between systems costs less to maintain than “multiple products.” Truthfully, I see little strategic (ie, warehouse and marts) data that needs to be computed in real time. Those cases do exist, and OWB 10gR2 has some new features for those that do have some Real Time DW/BI needs.

There are two major flavors of mappings in support of Real Time Data Warehousing in OWB:

  • PERIODIC BATCH: This is basically a batch process that runs frequently (say every minute or so) that reads data from a QUEUE or STREAM. While the data is pushed into the DW (real time), the system only processes when run (batch). These are regular mappings that use a Stream or Queue as a source instead traditional Tables/Views/etc.
  • TRICKLE FEED: This is much closer to what most people think of when we refer to real time data warehouse. Trickle feeds involve processing each individual record as it arrives, instead of waiting for them to collect. These are a special kind of OWB mapping called Real Time Mappings that run continuously and process records as they arrive.

Truthfully I’ve only kicked the tires with both of these types of mappings limitedly. I tested some of the features back in OWB Beta2 and built a conceptual mockup of how it would work for a customer of mine. What I’m presenting is a conceptual partially working mock up built using an early beta release. In other words, do not use it as reference or consider it a blueprint for how you should proceed. If there is enough interest I might submit an article to OTN on the subject. Anyone like the idea? Better yet, if you’re not one of my customers please do consider contacting me! I’d love to help build a Real Time DW solution with OWB!

OWB now includes the ability to define, deploy, and setup Streams, Queues, Queue Tables, UserDefinedTypes, and propogations within the GUI. There’s a whole set of screens that you’ll see when the community preview hits the shelves. Unlike regular OWB deployments there are some additional requirements around streams administration locations, permissions, etc, but they are easily surmountable. Also, if you’re going to be doing real time DW you need to understand a bit about the underlying technology anyhow (not tons, but enough to know why you need to have Archive Logging turned on, etc).

Refer to the following PDF for some greater details on the conceptual, but here’s a not so good screenshot:

I’ve created a mockup of a BI solution that is fed by a CRM (Customer Data Hub perhaps) and a Subscription Management Application for this example. You can see that conceptually this involves both systems sending messages either from the APPLICATION LEVEL (JMS or some other messaging technology) or the DATABASE LEVEL (with DML Stream Captures running in Oracle). In other words, we have multiple places we can get different pieces of data and the application doesn’t necessarily have to be “REAL TIME ENABLED” to send real time data. Oracle can do that on it’s behalf using the Streams technology!

Overall what this looks like is we setup the various Streams, Capture Processes (DML), Queue Tables, and Types (based on our source tables) to support our real time system. Note that the screenshot does not include the Streams on the source system or the Capture Process definitions. This only includes the DW side Streams, Queues, Dimensions, etc.

I’ve built three real time mappings (TRICKLE FEED) which in concert receive messages to add Dimension records (SCD2) and insert new Cube records (transactions). Notice this is a greatly simplified example entirely ignoring what I consider a best practice of loading into a normalized warehouse, then updating marts based on the warehouse (a la CIF methodology). Also these are all assuming to changes (ie, record corrections) just straight clean data! We should all be so lucky!

One receives updates from the CRM application and performs SCD on the appropriate Dimension objects.

The others receives event messages from a transaction based system and inserts records into Cubes.

This isn’t quite as much detail as I would like to have gone into, and I’ll quickly repeat my warning… This is just some mockups and conceptual work so don’t expect it to be accurate come OWB 10gR2 production time! I have some more thoughts on how to use this with Partition Exchange Loading to get a days “Cubes” built realtime throughout the day, and then at the end of the day move them over to the full history but that’s a whole nother article.

This blog is part of the OWB Paris Early Review series which reviews and comments on several new Paris features.