Data migration is common. Moving data from one database to another, either via Datapump, Goldengate, unload/load scripts or some other method has risk.
- Did you move everything?
- How do you know everything was captured?
- How do you ensure the source and target data is the same?
- Exactly the same?
You are going to have to check. Frequently, I see the rows being counted. Any nothing else. If we have 10 rows in the source, and 10 rows in the destination, we must have copied everything! Right?
So, what’s the easiest way to do that? Well, if it’s numeric you could just add all of the values in a particular column and see if the totals match in the source and destination. That would work. but it’s not going to work for characters. What if the input is just binary code? How will you deal with that?
The easiest solution is usually to create a hash of the data set. So what is hashing?
Hashing
Hashing basically converts input values into a fixed length text string. Doesn’t really matter what the input is, numeric, character or any old binary. It does this by chopping the input into fixed length chunks and hashing each chunk, including the previous hash, to come out with a deterministic output.
MD5 Hash
MD5 is a common hashing algorithm, and it’s easy to check out how it works on Linux using the md5sum command. MD5 chops the input up into 512 byte blocks and hashes the first block. It then hashes the second block, plus the hash from the first block, and so on until it has hashed the entire file. It doesn’t matter how big the file it, you will end up with a a single fixed length hash. If the hashes from different files do not match, the files are different.
Important! I know MD5 is largely deprecated. It should not be used for cryptography now. However, it’s simple, it’s relatively fast and importantly all we are doing here is checking to see if the source and destination data matches within our organization. Security may insist you use SHA384 or stronger, but this will be slower and burn more CPU.
If we look at the following 2 files, it’s tricky to see if they are different. However, if we run each file through an MD5 algo, we can see very clearly that the files are different.
$ cat text1
The yolk of an egg is white
$ cat text2
The yolk of an egg is White
$ md5sum text1
16d9638de9adb5d3ca2dd8d11db0b7aa text1
$ md5sum text2
1048d02bc43b27d993ace89822bd649e text2
It is worth noting, that whilst we can tell that the 2 files are not identical (capital W in the 2nd file), we cannot tell if the data is correct. The hash function does not know that the yolk of an egg is yellow.
If doesn’t matter how big the files is, we get a single fixed length output. Here we have 3 almost identical files – textA, textB and textC, each of which is over 256K in size. Which is different? It may be impossible to tell visually but using md5sum…
$ wc -c textA
282828 textA
$ wc -c textB
282828 textB
$ wc -c textC
282828 textC
$ md5sum textA
0c72b53d19f8a97a83b2dd0c1f098737 textA
$ md5sum textB
0c72b53d19f8a97a83b2dd0c1f098737 textB
$ md5sum textC
40b81609195acc5b1cb57aba6a881861 textC
(note: this file has a 1 character difference - a capital W on 1 row)
In Oracle?
So, how can we do this within Oracle? Fortunately, Oracle provides us with a function to allow us to compare data in 2 Oracle databases as easily as Linux, using dbms_sqlhash
DBMS_SQLHASH
Oracle provides a remarkably useful function to allow us to generate a hash from a SQL statement. We can supply a SQL statement directly into the DBMS_SQLHASH.GETHASH packaged procedure to reveal a hashed output.
The algorithm used is specified in the DIGEST_TYPE.
As we can see, "tab_check" has a variety of data types, including raw binary data.
desc tab_check
Name Null? Type
----------------------- -------- ----------------
OBJECT_ID NOT NULL NUMBER
OWNER VARCHAR2(128)
OBJECT_NAME VARCHAR2(128)
SUBOBJECT_NAME VARCHAR2(128)
OBJECT_TYPE VARCHAR2(23)
CREATED DATE
BIGGIE CLOB
BIGGIE2 PUBLIC.XMLTYPE STORAGE BINARY
BIGGIE4 BLOB
SELECT DBMS_SQLHASH.GETHASH
(SQLTEXT => 'SELECT * FROM tab_check ORDER BY 1',
DIGEST_TYPE => 2) AS hash
FROM dual;
HASH
--------------------------------
2BE06D9F4F81D2AB13F8909D298AD7D4
Run this SQL in the source and the target databases, and you can deterministically demonstrate that the data in the table is identical.
The DBMS_SQLHAS function does have some limitations, notably around the buffer size used, so if you have particularly wide tables, or large LOBs, or just lots of data per row, you may find you need to pull specified columns for input into the function.
You will need to grant execute on dbms_sqlhash to <query-user> ;
as it is not granted by default for security reasons.
WARNING! In order for the hash to be deterministic, you MUST ensure that you access the same data in the same order with the same National Language Support. If you access in a different order (e.g. by using an index rather than a full table scan), or if you change any formats (e.g. date) between runs, you will get different values!
SQL> desc DBMS_SQLHASH
FUNCTION GETHASH RETURNS RAW
Argument Name Type In/Out Default?
------------------------- ----------------------- ------ --------
SQLTEXT VARCHAR2 IN
DIGEST_TYPE BINARY_INTEGER IN
CHUNK_SIZE NUMBER IN DEFAULT
note the chunk size defaults to 128M, the amount of data processed in each pass. Changing this will change the hash!
Hash Functions from DBMS_CRYPTO
HASH_MD4 CONSTANT PLS_INTEGER := 1;
HASH_MD5 CONSTANT PLS_INTEGER := 2;
HASH_SH1 CONSTANT PLS_INTEGER := 3;
HASH_SH256 CONSTANT PLS_INTEGER := 4;
HASH_SH384 CONSTANT PLS_INTEGER := 5;
HASH_SH512 CONSTANT PLS_INTEGER := 6;
Cross Database Migrations
Well obviously if you are migrating from different one database type to another, you are not going to have the same functions available to you. However, there is nothing stopping you from writing your own hashing functions against the data. Most relational databases have built-in hashing functions [e.g. Postgres has MD5() ]Just make sure you are using exactly the same algorithms, pulling the data from the same character sets in the same order with the same data masks!
Compare a Linux File with Oracle
By way of demonstration, lets see if the md5sum in Linux matches the data in an Oracle database. Note that we are using the “simpler” DBMS_CRYPTO method to generate the hash. Using DBMS_SQLHASH will generate a different hash.
SQL > select c1,
lower(dbms_crypto.hash(rawtohex(c1),2)) c1_hash
from simple;
C1 C1_HASH
------------------------------ --------------------------------
The yolk of an egg is white 42fabcc1aa41c79d6f9c33ddf51e011b
For visual purposes, lets check the RAW output from C1
SQL > select rawtohex(c1) c1_raw from simple;
C1_RAW
-------
54686520796F6C6B206F6620616E20656767206973207768697465
The same data in a text file on Linux:
$ xxd text_file
0000000: 5468 6520 796f 6c6b 206f 6620 616e 2065 The yolk of an e
0000010: 6767 2069 7320 7768 6974 65 gg is white
$ md5sum text_file
42fabcc1aa41c79d6f9c33ddf51e011b text_file
And the MD5 hashes match!
Important! The text file *does not* contain a line feed! That would add "0a" to the data, and change the hash.
$ xxd text_file_edited
0000000: 5468 6520 796f 6c6b 206f 6620 616e 2065 The yolk of an e
0000010: 6767 2069 7320 7768 6974 650a gg is white.
$ diff text_file text_file_edited
1c1
< The yolk of an egg is white
\ No newline at end of file
---
> The yolk of an egg is white
or… you could always pay for Oracle Veridata 🙂
or… look at Open Source options like this (Thanks Neil!):
https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/examples.md#run-a-row-hash-validation-for-all-rows
Happy Hashing!









Leave a comment