Category Archives: Open Source

Pentaho Visit : Day 1

Those that regularly visit the site know that I focus my professional consulting hours on Data Warehousing, and specifically with Oracle and Oracle Warehouse Builder. However, much of my R&D time I spend researching, downloading, and kicking the tires of Open Source projects I find cool and interesting. Pentaho is clearly one that exists at the intersection between what puts bread on the table and what stimulates the mind, so I accepted their invitation for a week of training in Orlando, FL.

Day 1, like many first days, is mostly introductory, architecture blueprints, team, background, etc. We start getting into some technical details in the subsequent days, but today was mostly an overview. I am continually impressed by this company; not the product per se, because it is a 1.0 product in a very mature market that can only now start to be compared with the “big boys.” However, the company has the mojo to pull this off, I think. They are on a first name basis with CEOs of other key open source companies. Their ranks are filled with former Business Objects and Hyperion recruits. Their board members were senior VPs at Oracle, and on and on. They are building a solid company that is healthy and, from what I can see, can deliver on their vision of an open source BI stack.

What is Pentaho? Pentaho is an open source BI stack that provides the full stack of BI components: Reports, Bursting, OLAP analysis, Dashboards, BAM, etc. Lofty goal indeed… CEO Richard Daley puts it quite simply (paraphrased): We don’t want to be a disruptive technology in just Open Source BI, we want to disrupt the entire BI market place with our technology… It’s a lot of fun… It’s process centric (workflow driven) and has conceded the fact that it won’t be a silo, as the center of the universe. It pragmatically embraces the idea that BI should be part of an overall business process, and that if it is not, then you’re not getting the full value of your BI assets.

This makes profound sense, yes? If your business process of analyzing order fulfillment efficiency is the end of your process (ok, looks like the warehouse in Toronto is 3 times slower than anyone else) then you’re hosed. The process must continue to notify someone of this result, and collaborate on a solution if the intelligence is actionable.

I continue to be critical of their regard for ETL/Data Warehouse as secondary to their platform. I think they have BI covered, and are comprehensive in this regard. What I see, like others, is that the other key piece of “doing BI” is the information integration, cleansing, and transformation. If the data is unintegrated, the business context is difficult to infuse from a “straight SQL query” then you’re in the ETL and Data Warehousing business. That being said, the architecture I’ve seen put forth (haven’t been in the details yet), allows for this relatively easily. Perhaps this will be more robust that it first appears as the details of their product are run through this week.

They released a 1.0 GA in Decemeber. If you haven’t checked it out, DO! It installs in about 2 minutes on Windows (no kidding, it’s one of the best open source demo installations I’ve ever seen).

KETL != KETTLE

I was having a discussion with the CEO of an Open Source company recently and we started discussing our opinion of the “kett uhl” Open Source ETL projects. It quickly became clear that we had different ideas about the project and its sponsorship. Until there was a clearly identifying difference… An “ah ha” moment.

Kinetic ETL (open source ETL project) is not the same as KETTLE (open source ETL project).

There’s only a handful of Open Source ETL projects currently. It seems silly that there should already be brand confusion for such a small group of players… Anyone from either of these projects care to comment on their choice of such similar names? Pure coincidence?

Open Source BI getting real

Don’t want to get too far into it, because there has been some significant developments from when I last surveyed the scene about 6 months ago. However, I did want to point out that Pentaho has made significant progress in building their Open Source BI stack and are starting to build 1.x release candidates. They’ve sorted out $5 million in financing to really allow them to take this to the next level.

Congratulations on the progress, and I personally look forward to seeing more great work coming from Orlando, FL.

Perhaps I'm mistaken on BizGres

I wasn’t exactly flattering to GreenPlum in this blog. I basically said that anything interesting they were doing was going to be in their MPP (commerical) instead of BizGres (Open Source). I’ve just looked (not used, just read in the docs) at the features for BizGres 0.7 and there are some interesting features in there.

  • Bitmap Index Scan Performance Enhancement
  • Table Partitioning
  • KETL – extract, transform and load (ETL) technology from Kinetic Networks
  • JasperReports from JasperSoft

Again, too busy at the mo’ to have a look although would love to see what the KETL is all about. We’ll see what the uptake and community acceptance is… It’ll be interesting to see what happens with the new three way partnership between Jasper/Kinetic/GreenPlum as well.

Pentaho Milestone 2 release

Since I probably piqued some interest with this blog, I figured I should post an update…

The folks at Pentaho have released some actual software. I’m head deep in an OWB Paris project so I’ve had ZERO time to have a look. I’d love for anyone who’s had a look to email me and let me know their impressions.

From their release briefing:

Using this release, you will be able to experience the streamlined install process and interact with a number of components and samples.

  • Reporting
    how to run reports, burst different content to different users, and parameterize reports.
  • Business rules
    how to include and use business rules in the creation and delivery of content.
  • Email
    how to send the results of a business rule or report creation to an email address, and how to do email bursting.
  • Printing
    how to print a report to a selected printer, how to do batch printing, and how to print bursting (applying different report parameters to individual printers).
  • Workflow
    how to initiate a workflow and pass parameters to it.
  • Bursting
    how to deliver customized versions of a generic report to different email addresses or printers
  • Scheduler
    how to schedule the actions of the Pentaho BI Platform
  • Web Services
    how to access the actions of the Pentaho BI Platform using web services
  • Navigation
    how to organize and describe content to users using Java Server Pages or portlets  
  • Many of the visual features such as wizards – you may have heard discussed or seen demonstrated are not scheduled for delivery until the next milestone release. Please bear this in mind as you use the product.

Headless VNC is MUCH faster

I’ve used RealVNC for quite sometime, and find that it is a quick and easy method for occasional remote access. Did I mention it’s free?

Since I typically only need it for Windows machines I had only used the Windows VNC Server version. With Linux, “ssh -l user -X myhostname.company.com” would usually suffice. The Windows version polls to check for updates to the window, or screen, or underneath the mouse, etc.

I’ve been duly impressed by the headless VNC server that comes with my White Box Linux. I was expecting a similar experience with delays, pixelation, screen refresh issues but I’ve experienced NONE of that. When I full screen my VNC client I notice little difference than if I were at the console. Anyhow, I just thought I’d mention that the headless Linux VNC is much much better than the Windows polling VNC. Happy network computing!

Open Source BI – I like Pentaho

Business Intelligence software, databases, and their supporting hardware are expensive. I mean really, really expensive (hundreds of thousands to millions of dollars). Many people working in the Business Intelligence/Data Warehousing fields have seen their “operational application” colleagues adopting open source solutions (Linux, JBoss, Eclipse, Apache, etc.) but have seen little attention paid to the software required to build and deliver Business Intelligence. That is beginning to change.

I’ve blogged about this before, specifically my experiences with downloading and testing Mondrian, an open source ROLAP server written in Java. It appears as if there is some gaining momentum and maturity of projects suitable for BI in the Open Source(OS) world. I’ve felt for some time that the open source community had not embraced BI in quite the same way they have other applications of technology. It is, in earnest, a technology stack to make bigger companies bigger and smart companies smarter. While these precepts aren’t in opposition of open source ideals, they aren’t what typically motivates communities of developers to band together to make software for free (ie, change the world, provide a framework used by 10,000 websites, etc.).

The state of open source BI was relatively slim not too long ago. There were a variety of possible toll sets one could use for ETL (Clover, Enhydra Octopus), some initial OLAP components (Mondrian, JPivot), some portal frameworks for dashboards (JetSpeed, JBoss Portal), and some databases with maturity for DW situations with smaller volumes (MySQL, Postgres). Things have been heating up this past year, and we should review whats going on in the Open Source BI realm. The lead is buried, make sure you check out Pentaho at the bottom.

CA’s Open Source release of Ingres
Albeit a funny OSI approved license (there are many provisions which will scare away the OS purists, and make others at least think twice about including it in their products or service) Ingres is officially open source and free. Ingres has some pretty significant “enterprise” features including replication, partitioning, and “in the works” linux clustering (a la RAC). This is great news because Ingres is a rather mature database and is better suited for large DW volumes than MySQL and PostGres. It is noticeably (and perhaps critically) lacking the vibrant community required to increase uptake. At this point it feels like CA is still the only one “interested” in Ingres. This might change, but I believe the funny CATOSL has hindered acceptance from open source communities.

Netezza/DATAllegro are using open source
These two providers of DW appliances are using open source databases as part of their solution. It’s a mixed technology stack, which means that unless you purchase the appliances you will benefit from none of the work that these two companies have put into their implementations. One uses Postgres, the other uses Ingres. There must be quite a bit of technology surrounding it to make it actually work for corporate DW environments. Netezza is actually doing rather well I believe, and some of the bigger vendors are starting to “see them on the radar” as a player in the space.

GreenPlum (aka Metapa) takes another shot
When Metapa wasn’t getting the traction with marketing their inexpensive proprietary Clustered DB implementation they figured they needed something to get more traction. Open Source is powerful enough that even a few years into the hype it still attracts attention. They relaunched themselves as an Open Source solution and are sponsoring the BizGres project (a few extensions to PostGres that are useful for BI environments) along with allowing the single instance version of their product to be used for free. I don’t think they’ll get the OS community embrace they desire because people are discerning these days; the only interesting work GreenPlum is doing is related to their MPP and shared nothing clustering technology which is very much NOT open source. I don’t think they’ll get the OS thrust they expected, because they are only opening their kimono an inch, not even a halfway mark.

Mondrian/JPivot releases
These two projects underwent new releases this year that provided the most visible part of an open source DW/BI system their legs. While not comparable to commercial OLAP interfaces they are certainly suited for ISV/Developers to embed in their application. These are great components for including in a project, and if your report consumers don’t really care to write their own reports (a la graphical report builder) and just want to pivot and page this could be an excellent, inexpensive solution.

BIRT and JasperReports are actually pretty good
Two commercially backed (one by Actuate, the other by JasperSoft) projects that are building the basis for business quality reports. Don’t turn off your Crystal installation yet because these both have a way to go, but they’re improving at a steady pace.

Pentaho Nation
This is truly the most exciting thing I’ve found in the Open Source BI space, and they’ve just begun their work so I’m running on faith at this point. Industry veterans who are passionate about BI and open source have pooled their minds and money (they’ve made $$ from previous entrepreneurial activities) to build a pure, 100% open source distribution for BI. They are collecting various open source projects, building their own components and releasing the whole thing as open source. A partial list of the projects they are planning (no official distro yet): Mondrian OLAP server, JPivot, Firebird RDBMS, Enhrydra ETL, Shark and JaWE, JBoss, Hibernate, JBoss Portal, Weka Data Mining, Eclipse, BIRT, JOSSO, Mozilla Rhino.
The company will follow in RedHat footsteps and make money on support, training, and consulting. Their plans are ambitious, but they are focused on assembling and configuring all these disparate projects into a comprehensive platform that will be at least comparable to the “big boys” at Hyperion, Cognos, Microstrategy, etc.


They are engaging the community, clearly understand the need in the space, and are committed to the ideals of getting paid for solutions instead of software. They are certainly strong in the presentation, dashboard, BPM/workflow, OLAP end of the spectrum but don’t appear to be including much in the ETL/DW end (there is some, but it appears to be for data movement and loading as opposed to building a DW). I’m not sure if it’s strategic or not, but it might makes sense. Most people adopting an open source BI platform for their reporting users will feel comfortable rolling their own ETL/DW for the backroom. It should also be noted that they haven’t made any releases yet, so what we’re seeing is all conceptual now but they’ll be rolling something out sometime in 2005. It appears as if the founders have a track record of “doing what they say they’ll do.”

What does this all mean?
There are three things that will happen as the Open Source and BI worlds start dating.

  1. Hardly anything for your current BI project and technologies. It is still emerging and is just now being utilized by early adopters.
  2. Cost pressure on the “big boys” will occur as the maturity of these components provide at least comparable options. Currently the small number of vendors along with their constantly increasing prices will show up as an area to be trimmed (ironic enough probably in a financial report provided inside the software in question). I don’t believe that it will have a significant impact, but will have a small impact over the next 3-5 years. It will also affect prices of BI OEM and inclusion of BI capabilities in vertical applications (more BI in existing products).
  3. Increased adoption of BI at small and mid sized business who can now afford to enter into the BI space. Previously inhibited by the exorbitant software costs business can now spend a few thousand dollars to start their foray into BI.

Open Source OLAP

Every month I review the web traffic reports for my blog, and I’ve always found something rather interesting. Even though I post more information about Oracle and OWB than any other subject, Google seems to send me more traffic from queries like “open source ETL” and “open source OLAP.” You know what they say, customer is king and you gotta give ’em what they want!

In other words, all I needed was just a teentsee weentsee bit of an excuse to take some time to really kick the tires of the open source OLAP server Mondrian.

Some basics… Mondrian is an open source OLAP server, written in Java. It implements an MDX engine, and also exposes an XML/A interface to clients. Mondrian uses a ROLAP architecture, and ends up issuing SQL statements to a JDBC data source to retrieve and calculate. Mondrian works with Access, MySQL, postgres, and Oracle. Refer to the Mondrian architecture pages to get some more information about the architecture.

My overall impressions were positive; it’s a good core set of functionality and performs rather well. Like any Open Source project it is an alphabet soup of supporting libraries, environment variables, generators, frameworks, and takes more than the usual 10 minute commercial product install. We’re not building a kernel here, but it’s not trivial to get the examples up and running.

Mondrian works closely (and appropriately) with an open source implementation of a JSP based Pivot and Charting project, JPivot. The demo for Mondrian includes an example with JPivot querying Mondrian and uses the well known Food Mart demo. I was expecting a bit less from JPivot and was pleasantly surprised that it’s actually rather functional (not commercial product easy to use, but really quite commendable).

JPivot allows for drilling down hierarchies (I don’t think you can use multiple hierarchies) and Pivoting and exhanging the columns and rows. It has a “CUBE” editor that allows you to edit the report. It’s not drag and drop, but definitely works if you “grok” the interface.

Also pleasantly surprising was some pretty decent charting capabilities.

There are some decent selections of charts

In order to provide an OLAP view of your data you have to define some metadata about your Dimensional model (Cubes, Measures, Dimensions) and how they map to your underlying Relational Schema. Check out the samples on the Mondrian site to see how to write your own schema.

Couple of interesting things to point out, Mondrian implements a cache of “relations” used to increase performance. This is interesting because of consistency questions (some fragments are cached, but others are current) but also because it is WICKED fast once it’s loaded into memory. There are some interesting possibilities here, including some work with some distributed P2P OLAP distributed caching research.

BAD DEVELOPER, SIT IN THE DARK

I ran across a post by Andrej Koelewijnvia on orablogs.com that made reference to an open source project named CruiseControl.

CruiseControl is a framework for a continuous build process. It includes, but is not limited to, plugins for email notification, Ant, and various source control tools. A web interface is provided to view the details of the current and previous builds.

I’ve worked in environments with automated build processes and think they are absolutely wonderful. There are some significant advantages to an automated build process:

  • Less time spent tagging and building code to servers.
  • Predictable process for build and deploy (you do it with technology, rather than admins typing commands) so that your deploys are also “managed” across environments.
  • Mitigates big unknowns during integration. I’m not saying this will decrease the integration time spent on a project, but rather it will increase the likelihood of finding a show-stopping issue early.

    Imagine a progressive work environment (a la Dot Com) where engineers are kindred spirits. They are working late hours, ordering pizza, playing with Nerf guns. This is the type of environment where the following extension would be useful.

    If you break the build, you have to sit in the dark all day.

  • Oracle on VMWare Linux, Part 2

    I’ve written about Oracle on VMWare before, but thought I would share an alternative perspective with readers. Howard Rogers has published a nice article on how to install VMWare, install White Box Linux, and suggests it works perfectly well. While I wouldn’t now suggest that production environments consider it suitable, I’m glad that someone has had a better experience with this product than I’ve had.
    Note: There’s also an interesting discussion about his favorite font, for those looking for a digression.