Oracle Data Migration Validation

Data migration is common. Moving data from one database to another, either via Datapump, Goldengate, unload/load scripts or some other method has risk.

Did you move everything?
How do you know everything was captured?
How do you ensure the source and target data is the same?
Exactly the same?

You are going to have to check. Frequently, I see the rows being counted. Any nothing else. If we have 10 rows in the source, and 10 rows in the destination, we must have copied everything! Right?

So, what’s the easiest way to do that? Well, if it’s numeric you could just add all of the values in a particular column and see if the totals match in the source and destination. That would work. but it’s not going to work for characters. What if the input is just binary code? How will you deal with that?

The easiest solution is usually to create a hash of the data set. So what is hashing?

Hashing

Hashing basically converts input values into a fixed length text string. Doesn’t really matter what the input is, numeric, character or any old binary. It does this by chopping the input into fixed length chunks and hashing each chunk, including the previous hash, to come out with a deterministic output.

MD5 Hash

MD5 is a common hashing algorithm, and it’s easy to check out how it works on Linux using the md5sum command. MD5 chops the input up into 512 byte blocks and hashes the first block. It then hashes the second block, plus the hash from the first block, and so on until it has hashed the entire file. It doesn’t matter how big the file it, you will end up with a a single fixed length hash. If the hashes from different files do not match, the files are different.

Important! I know MD5 is largely deprecated. It should not be used for cryptography now. However, it’s simple, it’s relatively fast and importantly all we are doing here is checking to see if the source and destination data matches within our organization. Security may insist you use SHA384 or stronger, but this will be slower and burn more CPU.

If we look at the following 2 files, it’s tricky to see if they are different. However, if we run each file through an MD5 algo, we can see very clearly that the files are different.

$ cat text1
The yolk of an egg is white

$ cat text2
The yolk of an egg is White

$ md5sum text1
16d9638de9adb5d3ca2dd8d11db0b7aa  text1

$ md5sum text2
1048d02bc43b27d993ace89822bd649e  text2

^{It is worth noting, that whilst we can tell that the 2 files are not identical (capital W in the 2nd file), we cannot tell if the data is correct. The hash function does not know that the yolk of an egg is yellow.}

If doesn’t matter how big the files is, we get a single fixed length output. Here we have 3 almost identical files – textA, textB and textC, each of which is over 256K in size. Which is different? It may be impossible to tell visually but using md5sum…

$ wc -c textA
282828 textA

$ wc -c textB
282828 textB

$ wc -c textC
282828 textC

$ md5sum textA
0c72b53d19f8a97a83b2dd0c1f098737  textA

$ md5sum textB
0c72b53d19f8a97a83b2dd0c1f098737  textB

$ md5sum textC
40b81609195acc5b1cb57aba6a881861  textC

(note: this file has a 1 character difference - a capital W on 1 row)

In Oracle?

So, how can we do this within Oracle? Fortunately, Oracle provides us with a function to allow us to compare data in 2 Oracle databases as easily as Linux, using dbms_sqlhash

DBMS_SQLHASH

Oracle provides a remarkably useful function to allow us to generate a hash from a SQL statement. We can supply a SQL statement directly into the DBMS_SQLHASH.GETHASH packaged procedure to reveal a hashed output.
The algorithm used is specified in the DIGEST_TYPE.

As we can see, "tab_check" has a variety of data types, including raw binary data.

desc tab_check

 Name                    Null?    Type
 ----------------------- -------- ----------------
 OBJECT_ID               NOT NULL NUMBER
 OWNER                            VARCHAR2(128)
 OBJECT_NAME                      VARCHAR2(128)
 SUBOBJECT_NAME                   VARCHAR2(128)
 OBJECT_TYPE                      VARCHAR2(23)
 CREATED                          DATE
 BIGGIE                           CLOB
 BIGGIE2                          PUBLIC.XMLTYPE STORAGE BINARY
 BIGGIE4                          BLOB

SELECT DBMS_SQLHASH.GETHASH
      (SQLTEXT     => 'SELECT * FROM tab_check ORDER BY 1',
       DIGEST_TYPE => 2) AS hash
 FROM dual;

HASH
--------------------------------
2BE06D9F4F81D2AB13F8909D298AD7D4

Run this SQL in the source and the target databases, and you can deterministically demonstrate that the data in the table is identical.

The DBMS_SQLHAS function does have some limitations, notably around the buffer size used, so if you have particularly wide tables, or large LOBs, or just lots of data per row, you may find you need to pull specified columns for input into the function.

You will need to grant execute on dbms_sqlhash to <query-user> ;
as it is not granted by default for security reasons.

WARNING! In order for the hash to be deterministic, you MUST ensure that you access the same data in the same order with the same National Language Support. If you access in a different order (e.g. by using an index rather than a full table scan), or if you change any formats (e.g. date) between runs, you will get different values!

SQL> desc DBMS_SQLHASH 

FUNCTION GETHASH RETURNS RAW
 Argument Name             Type                    In/Out Default?
 ------------------------- ----------------------- ------ --------
 SQLTEXT                   VARCHAR2                IN
 DIGEST_TYPE               BINARY_INTEGER          IN
 CHUNK_SIZE                NUMBER                  IN     DEFAULT

note the chunk size defaults to 128M, the amount of data processed in each pass. Changing this will change the hash!


Hash Functions from DBMS_CRYPTO
 HASH_MD4           CONSTANT PLS_INTEGER            :=     1;
 HASH_MD5           CONSTANT PLS_INTEGER            :=     2;
 HASH_SH1           CONSTANT PLS_INTEGER            :=     3;
 HASH_SH256         CONSTANT PLS_INTEGER            :=     4;
 HASH_SH384         CONSTANT PLS_INTEGER            :=     5;
 HASH_SH512         CONSTANT PLS_INTEGER            :=     6;

Cross Database Migrations

Well obviously if you are migrating from different one database type to another, you are not going to have the same functions available to you. However, there is nothing stopping you from writing your own hashing functions against the data. Most relational databases have built-in hashing functions [e.g. Postgres has MD5() ]Just make sure you are using exactly the same algorithms, pulling the data from the same character sets in the same order with the same data masks!

Compare a Linux File with Oracle

By way of demonstration, lets see if the md5sum in Linux matches the data in an Oracle database. Note that we are using the “simpler” DBMS_CRYPTO method to generate the hash. Using DBMS_SQLHASH will generate a different hash.

SQL > select c1, 
             lower(dbms_crypto.hash(rawtohex(c1),2)) c1_hash 
      from simple;

C1                             C1_HASH
------------------------------ --------------------------------
The yolk of an egg is white    42fabcc1aa41c79d6f9c33ddf51e011b


For visual purposes, lets check the RAW output from C1

SQL > select rawtohex(c1) c1_raw from simple;

C1_RAW                                                       
-------
54686520796F6C6B206F6620616E20656767206973207768697465

The same data in a text file on Linux:

$ xxd text_file
0000000: 5468 6520 796f 6c6b 206f 6620 616e 2065  The yolk of an e
0000010: 6767 2069 7320 7768 6974 65              gg is white

$ md5sum text_file
42fabcc1aa41c79d6f9c33ddf51e011b  text_file

And the MD5 hashes match!



Important! The text file *does not* contain a line feed! That would add "0a" to the data, and change the hash.

$ xxd text_file_edited
0000000: 5468 6520 796f 6c6b 206f 6620 616e 2065  The yolk of an e
0000010: 6767 2069 7320 7768 6974 650a            gg is white.

$ diff text_file text_file_edited
1c1
< The yolk of an egg is white
\ No newline at end of file
---
> The yolk of an egg is white

or… you could always pay for Oracle Veridata 🙂

or… look at Open Source options like this (Thanks Neil!):
https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/examples.md#run-a-row-hash-validation-for-all-rows

Happy Hashing!

4 responses to “Oracle Data Migration Validation”

Neil Johnson

2024-08-22

Hi Neil, this is a nice post. Thanks.

When validating across SQL engines (e.g. Oracle vs PostgreSQL) there is a nice (I would say that being a contributor!) open source project that validates data using a sha256 hash (STANDARD_HASH function in Oracle):

https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/examples.md#run-a-row-hash-validation-for-all-rows

LikeLiked by 1 person

1. Blog
  
  2024-08-22
  
  I wasn’t aware of that project. What a great idea! I will be looking into it.
  
  Is there any ability to change the hashing algo to allow different checksum calcs? As I say in the post, MD5 may be old but it’s quick and there are few benefits to using SHA256 or better in this particular use case (i.e. not security). It’s unlikely to have collisions, so the performance running over, say a 50TiB data set (not huge, but that was the size of the last DB I migrated) could be notable.
  
  LikeLike
  
Bob Macek

2024-10-24

Hey,

thanks for intertesting way how to be sure to check the migration is successful.
Any idea how to perform similar check when target DB character set has been changed?

LikeLike

Oracle Data Migration Validation | ivannexus

2025-07-19

[…] 2025 年 07 月 20 日 ~ ivannexus Oracle Data Migration Validation […]

LikeLike