Extending an ACFS filesystem dynamically.

To extend an ACFS cluster filesystem dynamically, we need to use the acfsutil command:

node01:/u01/grid>/sbin/acfsutil size +10G /u02
acfsutil size: ACFS-03008: The volume could not be resized.  The volume expansion limit has been reached.
acfsutil size: ACFS-03216: The ADVM compatibility attribute for the diskgroup was below the required
                           version (11.2.0.4.0) for unlimited volume expansions.

Oh dear, not 11.0.2.4, so you can only extend volumes dynamically a few times (5) before the global bitmap becomes full. So, now it’s an offline change. 😦

Check what is accessing /u02 and stop it:

node01:/opt/oracle>sudo -s
[root@node01 oracle]# lsof /u02

COMMAND   PID   USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
bash     5566 oracle  cwd    DIR 252,50177    12288   78 /u02/goldengate/bin11
su      29509   root  cwd    DIR 252,50177    12288   78 /u02/goldengate/bin11

erm. kill -9 5566 29509 :-)

DO THIS ON EVERY RAC NODE!

[root@node01 oracle]# umount –t acfs /u02
[root@node02 oracle]# umount –t acfs /u02
[root@node-n oracle]# umount –t acfs /u02

Once unmounted, we can “repair” the global bitmap:

[root@node02 oracle]# fsck -y -t acfs  /dev/asm/acfsdisk_u02-98
fsck from util-linux-ng 2.17.2
version                   = 11.2.0.4.0
*****************************
********** Pass: 1 **********
*****************************
Oracle ASM Cluster File System (ACFS) On-Disk Structure Version: 39.0
 ACFS file system created at: Thu Jan  2 17:08:02 2014
 checking primary file system
 Files checked in primary file system: 25%
 Files checked in primary file system: 100%

 fsck.acfs: ACFS-07728: The Global_BitMap file has reached the maximum number of extents (5).
 The file system can no longer be expanded. 

 Running fsck.acfs in fixer mode will attempt to consolidate the storage bitmap into 
 fewer extents which would allow for file system expansion

 Checking if any files are orphaned...
 0 orphans found
 Checker completed with no errors.

So lets fix it – output seriously trimmed, but with the important bit

[root@node02 oracle]# /sbin/fsck.acfs -a -v /dev/asm/acfsdisk_u02-98
fsck from util-linux-ng 2.17.2
version                   = 11.2.0.4.0
 *****************************
 ********** Pass: 1 **********
 *****************************
 Oracle ASM Cluster File System (ACFS) On-Disk Structure Version: 39.0
 ACFS file system created at: Thu Jan  2 17:08:02 2014

 checking primary file system
 Files checked in primary file system: 25%
 Files checked in primary file system: 100%

 fsck.acfs: ACFS-07729: The Global_Bitmap file has been
 consolidated into 2 extents.
 This may allow for file system expansion via the 'acfsutil size' command.
  
 Checking if any files are orphaned...
 0 orphans found
 Checker completed with no errors.

So, we’re done and can re-mount ON EVERY NODE. Given it’s now 2 extents, and the max we can have is 5, we have 3 more dynamic extensions before we need to do this again.

mount –t acfs /dev/asm/acfsdisk_u02-98 /u02

And re-attempt to expand the filesystem

node01:/u01/grid>df –h /u02
Filesystem              Size  Used Avail Use% Mounted on
/dev/asm/acfsdisk_u02-98  325G   36G  290G  12% /u02

node01:/u01/grid>/sbin/acfsutil size +10G /u02
acfsutil size: new file system size: 359703511040 (343040MB)
node01:/u01/grid>df -h

node01:/u01/grid>df -h /u02
Filesystem              Size  Used Avail Use% Mounted on
/dev/asm/acfsdisk_u02-98  335G   36G  300G  11% /u02
node01:/u01/grid>

Yey! Bigger filesystem! Lets minimise the amount of times it needs to be extended in the future but doing it in big lumps. Might just save a planned outage.

 

Alternatively, upgrade Grid Infra to at least 11.2.0.4 and set advm compatibility to 11.2.0.4 and the restriction will be gone for good:

ALTER DISKGROUP acfsdisk SET ATTRIBUTE 'compatible.asm' = '11.2.0.4', 'compatible.rdbms' = '11.2.0.4', 'compatible.advm' = '11.2.0.4';
(or ASMCMD [+] > setattr -G acfsdisk compatible.advm 11.2.0.4)
(or right-click on the disk group in asmca and click "edit attributes")

Grid Infrastructure Disk Space Problem – CHM DB file: crfclust.bdb

The Grid Infrastructure filesystem was reporting that it was a bit full today (release 11.2.0.4). This was tracked down to the “crfclust.bdb” file, which records information about the cluster health for monitoring purposes. It was 26GB. It’s not supposed to get bigger than 1GB so this is probably a bug, but let’s explicitly resolve the size issue right now and search Oracle support later. Worst case, bdb (Berkerley Database) files get regenerated when CHM (ora.crf) resource is restarted.  You only lose the (OS) statistics that CHM has gathered. Deleting bdb files does not have other impact.  CHM will start collecting the OS statistics again.

 

df –h /u01

Filesystem                Size  Used Avail Use% Mounted on
/dev/sdc1                  48G   36G  9.0G  81% /u01

pwd
/u01/app/11g/grid/crf/db/node01

ls -lh
total 29G

-rw-r–r– 1 root root 2.1M Jul 22 12:12 22-JUL-2014-12:12:03.txt
-rw-r–r– 1 root root 1.3M Apr 23 14:28 23-APR-2014-14:28:04.txt
-rw-r–r– 1 root root 1.2M Apr 23 14:33 23-APR-2014-14:33:34.txt
-rw-r–r– 1 root root 1.3M Jul 23 12:53 23-JUL-2014-12:53:02.txt
-rw-r–r– 1 root root 946K Apr 26 03:57 26-APR-2014-03:57:21.txt
-rw-r—– 1 root root 492M Aug 26 10:33 crfalert.bdb
-rw-r—– 1 root root  26G Aug 26 10:33 crfclust.bdb   <-26G!
-rw-r—– 1 root root 8.0K Jul 23 12:52 crfconn.bdb
-rw-r—– 1 root root 521M Aug 26 10:33 crfcpu.bdb
-rw-r—– 1 root root 513M Aug 26 10:33 crfhosts.bdb
-rw-r—– 1 root root 645M Aug 26 10:33 crfloclts.bdb
-rw-r—– 1 root root 418M Aug 26 10:33 crfts.bdb
-rw-r—– 1 root root  24K Aug  1 16:07 __db.001
-rw-r—– 1 root root 392K Aug 26 10:33 __db.002
-rw-r—– 1 root root 2.6M Aug 26 10:33 __db.003
-rw-r—– 1 root root 2.1M Aug 26 10:34 __db.004
-rw-r—– 1 root root 1.2M Aug 26 10:33 __db.005
-rw-r—– 1 root root  56K Aug 26 10:34 __db.006
-rw-r—– 1 root root  16M Aug 26 10:17 log.0000008759
-rw-r—– 1 root root  16M Aug 26 10:33 log.0000008760
-rw-r—– 1 root root 8.0K Aug 26 10:33 repdhosts.bdb
-rw-r–r– 1 root root 115M Jul 22 12:12 node01.ldb

Lets see how big the repository is…

oclumon manage -get repsize
CHM Repository Size = 1073736016

Wow.  Seems a bit oversized. Change the repository size to the desired number of seconds, between 3600 (1 hour) and 259200 (3 days)

oclumon manage -repos resize 259200

node01 –> retention check successful
node02 –> retention check successful

New retention is 259200 and will use 4524595200 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.

If we now check the size, we get an error as the repository is bigger than the max allowed size.

oclumon manage -get resize
CRS-9011-Error manage: Failed to initialize connection to the Cluster Logger Service

So we need to stop and start the ora.crf service to get everything working again. It should be OK to do this on a running system with no impact, but I’d start with your sandpit to test it. Don’t take my word for it!

Check for process:

node01:/u01/app/11g/grid/bin>ps -ef |grep crf
root     26983     1  0 10:44 ?        00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01

Stop service:
node01:/u01/app/11g/grid/bin>crsctl stop res ora.crf -init

CRS-2673: Attempting to stop ‘ora.crf’ on ‘node01’
CRS-2677: Stop of ‘ora.crf’ on ‘node01’ succeeded

Start Service:
node01:/u01/app/11g/grid/bin>crsctl start res ora.crf -init
CRS-2672: Attempting to start ‘ora.crf’ on ‘node01’
CRS-2676: Start of ‘ora.crf’ on ‘node01’ succeeded

Check for Process:
node01:/u01/app/11g/grid/bin>ps -ef  |grep crf
root     28000     1  5 10:49 ?        00:00:00 /u01/app/11g/grid/bin/ologgerd -m node02 -r -d /u01/app/11g/grid/crf/db/node01

Check the size – as specified:
node01:/u01/app/11g/grid/bin>oclumon manage -get repsize

CHM Repository Size = 259200

Done

And the space is released and reclaimed.

node01:/u01/app/11g/grid/bin>df –h /u01

Filesystem                Size  Used Avail Use% Mounted on
/dev/sdc1                  48G  7.7G   38G  18% /u01

The space has been returned. Marvellous.
Now repeat the stop/start on each node.

 

UPDATE: From Oracle Support: Having very large bdb files (greater than 2GB) is likely due to a bug since the default size limits the bdb to 1GB unless the CHM data retention time is increased.  One such bug is 10165314.

SCAN listener logs – where are they?

These are a bit of a pain as their location isn’t intuitive. So to remind me where they are:

Login to the server(s) as the grid owner and check the scan listener status. This will show you the location of the listener log. cd to just below the diag directory and you’re off!:

server-name:/u01/grid>ps -ef | grep SCAN
 grid 8542 8282 0 10:20 pts/0 00:00:00 grep SCAN
 grid 9349 1 0 Mar07 ? 00:07:33 /u01/app/11g/grid/bin/tnslsnr LISTENER_SCAN1 -inherit

server-name:/u01/grid>lsnrctl status LISTENER_SCAN1
LSNRCTL for Linux: Version 11.2.0.4.0 - Production on 28-MAY-2014 10:20:12
Copyright (c) 1991, 2013, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1)))
 STATUS of the LISTENER
 ------------------------
 Alias LISTENER_SCAN1
 Version TNSLSNR for Linux: Version 11.2.0.4.0 - Production
 Start Date 07-MAR-2014 17:27:50
 Uptime 81 days 15 hr. 52 min. 21 sec
 Trace Level off
 Security ON: Local OS Authentication
 SNMP OFF
 Listener Parameter File /u01/app/11g/grid/network/admin/listener.ora
 Listener Log File /u01/app/11g/grid/log/diag/tnslsnr/server-name/listener_scan1/alert/log.xml
 Listening Endpoints Summary...
 (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1)))
 (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.6.148.141)(PORT=1521)))
 Services Summary...
 Service "FRONT_APP_DB_SVC.WORLD" has 2 instance(s).
 Instance "ORCL1", status READY, has 1 handler(s) for this service...
 Instance "ORCL2", status READY, has 1 handler(s) for this service...
 Service "ORCL.WORLD" has 2 instance(s).
 Instance "ORCL1", status READY, has 1 handler(s) for this service...
 Instance "ORCL2", status READY, has 1 handler(s) for this service...
 Service "ORCLXDB.WORLD" has 2 instance(s).
 Instance "ORCL1", status READY, has 1 handler(s) for this service...
 Instance "ORCL2", status READY, has 1 handler(s) for this service...
 The command completed successfully

server-name:/u01/app/11g/grid/log>cd /u01/app/11g/grid/log
server-name:/u01/app/11g/grid/log>adrci
ADRCI: Release 11.2.0.4.0 - Production on Wed May 28 10:22:47 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
ADR base = "/u01/app/11g/grid/log"
 adrci> show homes
 ADR Homes:
 diag/asmcmd/user_grid/server-name
 diag/tnslsnr/server-name/listener_scan3
 diag/tnslsnr/server-name/listener_scan2
 diag/tnslsnr/server-name/listener_scan1
 diag/asmtool/user_root/host_3797755080_80
adrci> show alert
Choose the alert log from the following homes to view:
1: diag/asmcmd/user_grid/server-name
 2: diag/tnslsnr/server-name/listener_scan3
 3: diag/tnslsnr/server-name/listener_scan2
 4: diag/tnslsnr/server-name/listener_scan1
 5: diag/asmtool/user_root/host_3797755080_80
 Q: to quit
Please select option: 4

and there we are…
remember to be on the correct host for each scan,
otherwise the alert (listener) log file will be out of date.

 

The trouble with Timezones & Grid Infrastructure

When installing Oracle Grid Infrastructure 11.2 (and all other releases), you need to make sure that you have all of the server setting correct and to standard before you do the install. One that bit me recently was the timezone setting. The Red Hat 6.4 server(s) in question has the correct file in /etc/localtime (copied from /usr/share/zoneinfo/whatever). If I type in date, I get the reply in the correct timezone (GMT/BST as I’m in London), so all seems correct.

However, the slack Unix Sysadmin (which might or might not have been me) had not put the correct setting in /etc/sysconfig/clock. Unfortunately, when you install Grid Infrastructure, the setting is read from /etc/sysconfig/clock and embedded into a Grid Inforastructure config file. $GRID_HOME/crs/install/s_crsconfig_hostname_env.txt


### This file can be used to modify the NLS_LANG environment variable, which determines the charset to be used for messages.

### For example, a new charset can be configured by setting NLS_LANG=JAPANESE_JAPAN.UTF8

### Do not modify this file except to change NLS_LANG, or under the direction of Oracle Support Services

TZ=Europe/London

NLS_LANG=AMERICAN_AMERICA.AL32UTF8

TNS_ADMIN=

ORACLE_BASE=


If you change this entry, and you should check with Oracle Support if this is OK for your site, and you will need to restart Grid Infrastructure. The one thing about this that I really don’t like is that Oracle is storing a runtime configuration file in a an install directory. Does it do that anywhere else?

RACCheck

Running RAC? (Why? No, really, WHY?  Never heard of DataGuard? With a broker?)

Running RAC?
Not sure if you’ve configured it correctly?
Not sure if you have all of the recommended initialisation parameters set?
All recommended RPM’s installed?
All daemons running?
etc, etc, etc,

Well, as of Oracle 11.2.0.4 where’s a new feature provided by default called RACCheck. You can find it installed in directory $ORACLE_HOME/suptools/raccheck, (or you can download it from MOS article 1268927.1) and it’s called “raccheck”. With a little sudo configuration, or the root passwords, you can check the configuration on every node in a few minutes per node (run at a sensible time). All the basics appear to be covered, and you get a nice list of anomalies out of the system in HTML format.

I don’t necessarily agree with some of the errors/warnings produced (you might want the “problems” it’s finding!), but it gives you cause to re-think about an element of the system that may be configured in a non-standard way, and you get lots of relevant and useful links to MOS articles.

e.g. One problem: 

WARNING SQL Check Some user sessions lack proper failover mode (BASIC) and method (SELECT) All Databases

Can be happily ignored as I’m using a SCAN listener, which renders this WARNING irrelevant.

but I would recommend that you use the utility and accept/understand any exceptions. It should help stabilise any RAC installations you may have.

%d bloggers like this: