Yesterday, I covered how you can do an initial “replication” of data from MySQL to DynamoDB and how this can improve performance, and save storage space. The follow on question becomes:
That’s Great Nick. But how do I do keep my data up to date?
We’ve got data in our Airline Performance dataset through 31-DEC-2007. I loaded 1 year, all of 2007, for the previous example. What happens when the FAA publishes their 2008 January results, and we’ve loaded the new months worth of data into MySQL?
MySQL:
select count(*) from otp.ontime; 8061223
select count(*) from ontime where FlightDate > ‘2007-12-31’; 605765
select count(*) from ontime where FlightDate <= ‘2007-12-31’; 7455458
DynamoDB:
select count(*) from FASTER.”ontime”; 7455458
So, we’ve added approximately 600k new records to our source system that we don’t have in our reporting system. How do we incrementally insert these records and get just the 600k new rows into our DynamoDB reporting instance?
Easy Easy Easy.
We’ve already done all the work, all we have to do is simply get records we haven’t processed yet! Should take just a few minutes to get our current table “up to date” with the one over in MySQL.
DynamoDB:
select max(“FlightDate”) from FASTER.”ontime”; 2007-12-31
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime” where “FlightDate” > DATE ‘2007-12-31’; 605765
In other words, let’s select from MySQL any records whose date is beyond what we have currently (2007-12-31).
select count(*) from FASTER.”ontime”; 8061223
select count(*) from FASTER.”ontime” where “FlightDate” > DATE ‘2007-12-31’; 605765
MySQL:
While the DynamoDB INSERT statement was running, the following SQL was being run on MySQL.
show processlist shows a SQL session with the following SQL:
SELECT * FROM `ontime` WHERE `FlightDate` > DATE ‘2007-12-31’;
A single SQL statement (insert into select * from table where date > last time) has you up to date for reporting! Long term we may look to work with Tungsten to be able to keep our data up to date using replication bin log records but for now, this simple pull based approach.
Hi Nick,
great post, thanks!
One of the things I’m always paranoid about in a scenario like this is the possibility that the source may have added more records at the same date after the moment we took the snapshot during the previous load. So date > max(date_of_last_load) might pull in slightly less records than you need to be fully synchronized.
Of course, in the particular case you are describing, you can trust the source and you know nothing was added at date 2007-12-31 so you’re safee. But do you have any tips in case you’re not entirely sure? My approach is usually to sync up to the date of loading, but excluded any rows added that day. But this is only acceptable if you can afford to run one day late. I can think of ways to pull everything and up to the exact moment of loading and deal with any rows added after that at the time of the next load, but I am curious to hear about your insights in this matter.
kind regards,
Roland Bouman
I agree with Roland here. Think of these two transactions in Innodb.
1 – Transaction 1 occurs at 11:59. Does not commit.
2 – Transaction 2 occur at 12:00. Commits.
3. Pull data into the reporting solution.
4. Transaction 1 commits.
If the system records the last load date as 12:00 the first transaction will be missing. My general solution is to ignore the last 10 minutes or so of data to let transactions commit, but this depends very much on the timeout the insert/update code is using. If you have long running transactions you may need to exclude more than 10 minutes, or perhaps better in the long run, break the long running transactions into shorter transactions.
And Roland, your solution of just pulling the previous day can run into issues if run too close to midnight for above reason.
Hi Brooks!
“And Roland, your solution of just pulling the previous day can run into issues if run too close to midnight for above reason.”
Good point! thanks!
Brooks and Roland,
Thanks for the comments.
The example was simple, and yes, expected to be settled data. In fact, the example was more of a “triage” situation where MySQL was acting as the DW already (and ETL was flowing into MySQL) and we were positioning DynamoDB as the “reporting appliance” on the back side of it. That’s a perfectly comfortable topology for us…
If we were talking about data that could be moving at our source, and the transact time was not a reasonable way to detect old records, you’re both correct, there could be issues.
DynamoDB has capabilities for this too… perhaps, if you know that data can change underneath for up to 24 hours afterwards you can do
merge into table select * from source_table where date > max_date – 1 day
Our intelligent MERGE operation (insert/update) can pick up any changes and new data if we have a key to pick up on. All in, I’d love to see an approach explored with Tungsten operating on binary/row change records. 🙂
The issue identified by Roland is a common one in ETL. Even when working with microsecond timestamps there is the possibility of a duplicate being missed. The easy solution : get the latest timestamp, drop the associated record(s), then upload everything >= that last timestamp (or the next latest timestamp, whichever is faster in your set-up).