Grid Infrastructure Disk Space Problem – CHM DB file: crfclust.bdb
29/09/2014 Leave a comment
The Grid Infrastructure filesystem was reporting that it was a bit full today (release 11.2.0.4). This was tracked down to the “crfclust.bdb” file, which records information about the cluster health for monitoring purposes. It was 26GB. It’s not supposed to get bigger than 1GB so this is probably a bug, but let’s explicitly resolve the size issue right now and search Oracle support later. Worst case, bdb (Berkerley Database) files get regenerated when CHM (ora.crf) resource is restarted. You only lose the (OS) statistics that CHM has gathered. Deleting bdb files does not have other impact. CHM will start collecting the OS statistics again.
df –h /u01
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 48G 36G 9.0G 81% /u01
pwd
/u01/app/11g/grid/crf/db/node01
ls -lh
total 29G
-rw-r–r– 1 root root 2.1M Jul 22 12:12 22-JUL-2014-12:12:03.txt
-rw-r–r– 1 root root 1.3M Apr 23 14:28 23-APR-2014-14:28:04.txt
-rw-r–r– 1 root root 1.2M Apr 23 14:33 23-APR-2014-14:33:34.txt
-rw-r–r– 1 root root 1.3M Jul 23 12:53 23-JUL-2014-12:53:02.txt
-rw-r–r– 1 root root 946K Apr 26 03:57 26-APR-2014-03:57:21.txt
-rw-r—– 1 root root 492M Aug 26 10:33 crfalert.bdb
-rw-r—– 1 root root 26G Aug 26 10:33 crfclust.bdb <-26G!
-rw-r—– 1 root root 8.0K Jul 23 12:52 crfconn.bdb
-rw-r—– 1 root root 521M Aug 26 10:33 crfcpu.bdb
-rw-r—– 1 root root 513M Aug 26 10:33 crfhosts.bdb
-rw-r—– 1 root root 645M Aug 26 10:33 crfloclts.bdb
-rw-r—– 1 root root 418M Aug 26 10:33 crfts.bdb
-rw-r—– 1 root root 24K Aug 1 16:07 __db.001
-rw-r—– 1 root root 392K Aug 26 10:33 __db.002
-rw-r—– 1 root root 2.6M Aug 26 10:33 __db.003
-rw-r—– 1 root root 2.1M Aug 26 10:34 __db.004
-rw-r—– 1 root root 1.2M Aug 26 10:33 __db.005
-rw-r—– 1 root root 56K Aug 26 10:34 __db.006
-rw-r—– 1 root root 16M Aug 26 10:17 log.0000008759
-rw-r—– 1 root root 16M Aug 26 10:33 log.0000008760
-rw-r—– 1 root root 8.0K Aug 26 10:33 repdhosts.bdb
-rw-r–r– 1 root root 115M Jul 22 12:12 node01.ldb
Lets see how big the repository is…
oclumon manage -get repsize
CHM Repository Size = 1073736016
Wow. Seems a bit oversized. Change the repository size to the desired number of seconds, between 3600 (1 hour) and 259200 (3 days)
oclumon manage -repos resize 259200
node01 –> retention check successful
node02 –> retention check successful
New retention is 259200 and will use 4524595200 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
If we now check the size, we get an error as the repository is bigger than the max allowed size.
oclumon manage -get resize
CRS-9011-Error manage: Failed to initialize connection to the Cluster Logger Service
So we need to stop and start the ora.crf service to get everything working again. It should be OK to do this on a running system with no impact, but I’d start with your sandpit to test it. Don’t take my word for it!
Check for process:
node01:/u01/app/11g/grid/bin>ps -ef |grep crf
root 26983 1 0 10:44 ? 00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01
Stop service:
node01:/u01/app/11g/grid/bin>crsctl stop res ora.crf -init
CRS-2673: Attempting to stop ‘ora.crf’ on ‘node01’
CRS-2677: Stop of ‘ora.crf’ on ‘node01’ succeeded
Start Service:
node01:/u01/app/11g/grid/bin>crsctl start res ora.crf -init
CRS-2672: Attempting to start ‘ora.crf’ on ‘node01’
CRS-2676: Start of ‘ora.crf’ on ‘node01’ succeeded
Check for Process:
node01:/u01/app/11g/grid/bin>ps -ef |grep crf
root 28000 1 5 10:49 ? 00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01
Check the size – as specified:
node01:/u01/app/11g/grid/bin>oclumon manage -get repsize
CHM Repository Size = 259200
Done
And the space is released and reclaimed.
node01:/u01/app/11g/grid/bin>df –h /u01
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 48G 7.7G 38G 18% /u01
The space has been returned. Marvellous.
Now repeat the stop/start on each node.
UPDATE: From Oracle Support: Having very large bdb files (greater than 2GB) is likely due to a bug since the default size limits the bdb to 1GB unless the CHM data retention time is increased. One such bug is 10165314.