Monthly Archives: June 2011

PDI Loading into LucidDB

By far, the most popular way for PDI users to load data into LucidDB is to use the PDI Streaming Loader. The streaming loader is a native PDI step that:

Enables high performance loading, directly over the network without the need for intermediate IO and shipping of data files.
Lets users choose more interesting (from a DW perspective) loading type into tables. In particular, in addition to simple INSERTs it allows for MERGE (aka UPSERT) and also UPDATE. All done, in the same, bulk loader.
Enables the metadata for the load to be managed, scheduled, and run in PDI.

However, we’ve had some known issues. In fact, until PDI 4.2 GA and LucidDB 0.9.4 GA it’s pretty problematic unless you run through the process of patching LucidDB outlined on this page: Known Issues.

In some ways, we have to admit, that we released this piece of software too soon. Early and often comes with some risk, and many have felt the pain of some of the issues that have been discovered with the streaming loader.

In some ways, we’ve built an unnatural approach to loading for PDI: PDI wants to PUSH data into a database. LucidDB wants to PULL data from remote sources, with it’s integrated ELT and DML based approach (with connectors to databases, salesforce, etc). Our streaming loader “fakes” a pull data source, and allows PDI to “push” into it.

There’s mutliple threads involved, when exceptions happen users have received cruddy error messages such as “Broken Pipe” that are unhelpful at best, frustrating at worse. Most all of these contortions will have sorted themselves out and by the time 4.2 GA PDI and 0.9.4 GA of LucidDB are released the streaming loader should be working A-OK. Some users would just assume avoid the patch instructions above and have posed the question: In a general sense, if not the streaming loader how would I load data into LucidDB?

Again, LucidDB likes to “pull” data from remote sources. One of those is CSV files. Here’s a nice, easy, quick (30k r/s on my MacBook) method to load a million rows using PDI and LucidDB:

This transformation outputs to a Text File 1 million rows, waits for that to complete then proceeds to the load that data into a new table in LucidDB. Step by Step the LucidDB statements

— Points LucidDB to the directory with the just generated flat file
— LucidDB has some defaults, and we can “guess” the datatypes by scanning the file
CREATE or replace SERVER csv_file_server FOREIGN DATA WRAPPER SYS_FILE_WRAPPER OPTIONS ( DIRECTORY ‘?’ );
— Let’s create a foreign table for the data file (“DATA.txt”) that was output by PDI
>create foreign table applib.data server csv_file_server;
— Create a staging, and load the data from the flat file (select * from applib.data)
CALL APPLIB.CREATE_TABLE_AS (‘APPLIB’, ‘STAGING_TABLE’, ‘select * from applib.data’, true);

We hope to have the streaming loader ready to go in 0.9.4 (LucidDB) and 4.2 (PDI). Until then, consider this easy, straight forward method of loading data that’s high performance, proven, and stable for loading data from PDI into LucidDB.

Example file: csv_luciddb_load.ktr

Pushdown Query access to Hive/Hadoop data

SQL access to CouchDB views : Easy Reporting

0.9.4 did not hit the 1 year mark!

4 Replies

Our last LucidDB release was now, just more than 12 months ago on June 16, 2010. We were really really trying to beat the 1 year mark for our 0.9.4 release but we just couldn’t. A tenet of good, open source development is early and often and we need to do better. Since the 0.9.3 release we’ve:

Built out an entire Web Services infrastructure
Developed a wicked cool Admin user interface
Developed cool connectors to Hive, CouchDB
Built a whole ton of extensions (auto indexing, DDL generation, improved load routines)
Scriptable functions, and procedures
Updated our connectors (JDBC, Salesforce, etc)

All in this is a VERY exciting release… I apologize it’s taken this long, but please bear with us. We’ll be release in the next couple of weeks!

SQL access to CouchDB views

A different vision for the value of Big Data

4 Replies

UPDATE: Think we’re right? Think we’re wrong? Desperate to figure out a more elegant solution to self service BI on top of CouchDB, Hive, etc? Get in touch, an d please let us know!

There’s a ton of swirling about Hadoop, Big Data, and NoSQL. In short, these systems have relaxed the relational model into schema(less/minimal) to do a few things:

Achieve massive scalability, resiliency and redundancy on commodity hardware for data processing
Allow for flexible evolution and disparity in content of data, at scale, over time
Process semi-structured data and algorithms on these (token frequencies, social graphs, etc)
Provide analytics and insights into customer behaviors using an exploding amount of data now available about customers (application reports, web logs, social networks, etc)

I won’t spend that much more time delving into the specifics of all the reasons that people are going so ape-sh*t over Big Data.

Big Data has a secret though:

It’s just a bunch of technology that propeller heads (I am one myself) sling code with that crunch data to get data into custom built reporting type applications. Unlike SQL databases, they’re NOT ACCESSIBLE to analysts, and reporting tools for easy report authoring and for businesses to quickly and easily write reports.

Until businesses get to ACTUALLY USE Big Data systems (and not via proxy built IT applications) it’s value to the business will be minimal. When businesses get to use Big Data systems directly; there will be dramatic benefit to the business in terms of timeliness, decision making, and insights.

And don’t get me wrong, there’s HUGE value in what these systems can do for APPLICATION developers. Sure sure sure. There’s Hive, and Pig, and all these other pieces but here’s the deal: Not a single set of technology has assembled, from start the finish, the single system needed to quickly and easily build reports on top of these Big Data systems.

It’s starting to get real confusing since vendors see Hadoop/Big Data exactly how they want you to see it:

If you’re a BI vendor, it’s a new datasource that people can write apps and stream data into your reports.
If you’re an ETL vendor, it’s a new datasource and you’ll simply make it practical.
If you’re an EII vendor, it’s a new target for federating queries.
If you’re an analytic DB vendor, it’s an extension point to do semi-structured or massive log processing.

We have a different vision for doing BI on Big Data; it’s different than the “our product now works with Big Data too” you’ve been told from other vendors:

Metadata matters: Build a single location/catalog for it!
Where did this report on my dashboard actually come from? When I see this thing called “Average Bid Amount” on my report, which fields back in my NoSQL datastore was that calculated from? Why bother with a separate OLAP modeling tool when you already know your aggregation methods and data types. Current solutions where “it’s just another source” of data that shove summarized Big Data into MySQL databases for reporting miss a huge opportunity for data lineage, management, and understanding.
Summary levels require different “types” of data storage and access
The total number of events can be represented and aggregated to many many different levels of aggregation. Some, very highly summarized figures (such as daily event counts) should be kept in memory and accessible extremely fast for dashboards. Relatively summarized figures (10s of thousands) should be kept in a database. Datamarts (star schemas) that represent some first or second level of aggregation (100m rows) should be kept in a fast column store database. The detail transaction data, and its core processing technologies (M/R, graph analytics, etc) are in the Big Data system. Current solutions provide only tools for data slinging between these levels of aggregation; none provide the ability to access and manage them in a single system.
Inexpensive BI tools allow for cheaper, quicker and better analytic solution development
The “Business Intelligence” industry has been driving the application developer out of the required chain of events for building dashboards/analytic for years. In short, BI is a huge win for companies because it’s cheaper, helps get insights faster, and ultimately allows analysts to be somewhat self sufficient to do their job. Big Data has taken the industry back 10-15 years by making not just complicated reports but literally EVERY report be built by a developer! Current solutions allow for direct connect in reports to Big Data systems but you still have to write programs to access the data. Other solutions simply pull data out of Big Data systems and shove it at MySQL because of this exact reason!
SQL is familiar, well known, and matches with the “easy” data analytics
How easy is it to find and hire someone who can write SQL and build reports (in Crystal/Pentaho/Actuate/etc)? There are literally millions of people that can know SQL. Like it? Who knows. Know it, can use it? MILLIONS! How about hiring someone to build reports who knows the ins and outs of cluster management in Hadoop, knows how to write multi step programs and write application code to push that data into visualization components in the application? 10s of thousands, maybe. And 70% of them right now are in Silicon Valley. Trust me; these skills won’t spread outside of Silicon Valley in great numbers (or at least quickly).
Reporting Only data matters! Make it easy to access/integrate
Simple reporting rollups (say categories of products/web pages, etc) have no place being “pushed” into the Big Data system. Having a system that is doing BI on top of Big Data needs a way to easily, in a metadata driven fashion, match up the Big Data with other reporting only data. Current solutions require complex ETL development and assemble it as part of that stream of data to shove at a MySQL database.
Hot or Cold reporting, let the user choose via SQL
For dashboards the user is almost certainly willing to use the previous load (last hours) data instead of waiting 17minutes to run in the Big Data system. For Ad Hoc analysis reports need to be speed of thought; big data systems can do OK here, but on smaller datasets the best choice is a columnar, high performance BI database (ahem!). For small summaries, or consistently accessed datasets it should be stored in memory. Current solutions require someone building reports to KNOW where the data that they want is, and then connect to an entirely different system (with different query languages such as MDX, SQL, and M/R) to build a report.

We’re building a system, with LucidDB at the center, that is the most complete solution for doing real, day to day, full featured (adhoc, metadata, etc), inexpensive analytics on top of Big Data. Ok, Yawn to that. Since Hadoop and Big Data is hype-du-jour I don’t expect you to believe my bold statements. We’re releasing LucidDB 0.9.4 in the coming weeks, and this core database release will be the infrastructure for the new UI, extensions, and pieces of our solution.

In short DynamoBI’s Big Data solution provides:

Ability to use inexpensive, commodity BI tools on top of Big Data directly and cached (Pentaho, BIRT, etc). Connect with your existing BI/DW system (via JDBC/ODBC).
Ability to connect, and make accessible via SQL Big Data systems (currently Hive/HDFS, CouchDB)
Easily define and schedule incremental updates of local, fast column store caches of data from Big Data systems for Ad Hoc analysis
Ability to quickly link and include data from other places (Flat File, JDBC, Salesforce, Freebase, etc) with Big Data
Define aggregations, rollups, and reporting metadata (hierarchies, aggregations, etc)
Drag and drop reports (via Saiku, Pentaho Analyzer, BIRT)
Easy, RESTful interaction with the server so our solution fits nicely with your existing systems.

Looking forward to blogging more about it in the coming weeks; if you’re sick and tired of hacking together and spending tons of developer time on building reports on top of Big Data systems please get in touch with us.

Goodman on BI

Thoughts on Open Source, Analytics