When testing and improving performance of a system, it's often crucial to compare data generated by the system before and after an intervention was made. For example consider a End-of-Day (EOD) processing batch which generates reports. If this process runs slowly and you make performance improvements, it's key to make sure that the data generated is the same as before.
Data reconciliation is another name for comparing two data tables. It’s often used as part of migration from one system to another or when testing a system data against expected results. Let’s imagine some theoretical requirements - I want to:
compare two large data tables
use keys to match data
allow for use of tolerances
see resulting breaks in summary and in detail
schedule the comparison and/or run on demand
have a central place to see the results easily
Let’s imagine we have two data tables which differ slightly:
There are several simple and out-of-the-box solutions we could consider:
Unix diff command
Not bad but not great. It compares the data as text relies on the data being sorted the same way.
Beyond Compare
This is a bit better - we can now compare column by column, specify keys etc. You can get a time-limited free version here.
Excel
It’s doable in Excel. You could either do a lookup from one sheet to another which is quite painful or you could put the two sheets into one and add a Source column identifying where the data came from:
Then pivot the data on the key and add a diff column:
You can then drill down to the actual difference using the drill-down feature of Excel pivot table. This is however not ideal - you have to manually include all the columns you want to compare etc.
Code it
It’s not very difficult, for example in Python there is the datacompy library which does this pretty efficiently but then you’d need to know python and NumPy to do it. Here are two articles on how to do it:
Datascade
You may not have heard about Datascade. It's a fairly new tool but I've used it on several projects already - it's free for basic usage.
In Datascade it’s all quite simple. Once you have defined a datasource (of which there are many types eg files system, database, sftp etc):
You just need to reference the files in a new job like so:
Once you’ve created references to both files - here I’ve also renamed them to SheetA and SheetB, you can create a new sheet with the comparison results:
Once that’s done, you can create a comparison summary:
It will look something like this:
It works similar to a pivot table - you can drill down by double-clicking - in this case I have double-clicked on Total Mismatches cell to see:
You can find out more about Datascade here
Conclusion
So how did we do regarding the requirements we set above? Let's use a scale of 0-9:
There are obviously many more solutions to the problem. I wanted to show a few which allow you to start quickly, be able to do all of the work yourself and be free or free to start with.
コメント