Data Guard Unexpected Lag

facepalmWhen configuring a physical standby database for Oracle using Data Guard, you need to create Standby Redo logs to allow the redo to be applied in (near) real time to the Standby. Without standby redo logs, Oracle will wait for an entire Archive Log to be filled and copied across to the standby before it will apply changes, which could take quite a while.

Which leads me to the problem I encountered a while ago, and due to being forgetful, still encounter today when creating standby redo logs in Oracle 12C and 18C.

I have a Primary database with 3 Redo logs, and I have created a Physical Standby database with 3 Redo logs and then 4 Standby Redo logs created as follows:

alter database add standby logfile group 4 '/u01/.../redo_standby04.dbf' size 200M;
alter database add standby logfile group 5 '/u01/.../redo_standby05.dbf' size 200M;
alter database add standby logfile group 6 '/u01/.../redo_standby06.dbf' size 200M;
alter database add standby logfile group 7 '/u01/.../redo_standby07.dbf' size 200M;

And this is what they look like:

SQL> select group#,thread#,sequence#,bytes/1024/1024 MB,status from v$log order by group#;
GROUP#     THREAD#    SEQUENCE#  MB         STATUS
---------- ---------- ---------- ---------- ----------------
1                   1          0        200 UNUSED
2                   1          0        200 UNUSED
3                   1          0        200 UNUSED

SQL> select group#,thread#,sequence#,bytes/1024/1024 MB,status from v$standby_log order by group#;
GROUP#     THREAD#    SEQUENCE#  MB         STATUS
---------- ---------- ---------- ---------- ----------
4                   1        111        200 ACTIVE
5                   0          0        200 UNASSIGNED
6                   0          0        200 UNASSIGNED
7                   0          0        200 UNASSIGNED

SQL> select * from v$logfile order by group#;
GROUP#     STATUS  TYPE    MEMBER 
---------- ------- ------- --------------------------------------------------------------------------------
         1         ONLINE  /u01/.../redo01.dbf
         2         ONLINE  /u01/.../redo02.dbf
         3         ONLINE  /u01/.../redo03.dbf
         4         STANDBY /u01/.../redo_standby01.dbf
         5         STANDBY /u01/.../redo_standby02.dbf
         6         STANDBY /u01/.../redo_standby03.dbf
         7         STANDBY /u01/.../redo_standby04.dbf

So what’s the problem?

Well, Dataguard works as expected for a while – applying change immediately – then it lags for ages and you get an OEM alert that it’s lagging. Then it works fine for a while again and then another lag.

The problem is that when I created the Standby Redo log files, I did not specify the Thread. But why do I need to specify the thread? This is not RAC. It does not have multiple threads for redo – one for each node. We have Thread 1 and that’s it.

When I used to do this in Oracle 11.2, the Standby Redo logs got Thread 1 by default. Always Thread 1.

From Oracle 12.1, and also in 12.2.0.1, 18C (12.2.0.2) and probably 19C (12.2.0.3), Oracle creates the Standby Redo logs with Thread 0 by default, although sometimes it creates them on Thread 1 (although this seems to be only in 12.1 but let me know if you know better). I don’t know why.

Standby Redo log MUST belong to the SAME Thread as the Online Redo logs for them to be used. Having (in this case) 1 x redo on Thread 1 and 3 x Redo on Thread 0, I only have 1 x Standby Redo usable for my online Redo.

The solution is stop the Apply, DROP the Standby Redo on Thread 0, and create new Standby Redo Logs explicity stating Thread 1, and restart the Apply.

dgmgrl> edit database orcl12dr set state='APPLY-OFF';

sql> alter database drop standby logfile          group 4 '/u01/.../redo_standby04.dbf';
sql> alter database add  standby logfile thread 1 group 4 '/u01/.../redo_standby04.dbf' size 200M;

dgmgrl> edit database orcl12dr set state='APPLY-ON';

And all is good with the world.

It is worth saying that MOS note 1595034.1 gets the corrective advice EXACTLY WRONG, with the incorrect snippet reproduced here:

"The standby redo logs can't be used, since they are assigned to thread 0. 
Standby RedoLogs should not be assigned to a specific Thread. 
Please use fallowing statements:
...
SQL> ALTER DATABASE DROP STANDBY LOGFILE GROUP <Group#>; 
and create new SRLs with
SQL> ALTER DATABASE ADD STANDBY LOGFILE GROUP <Group#> SIZE ;"

No MOS analyst! Standby Redo Logs MUST be assigned to a specific thread. Thread 0 Standby Logs are ignored!

And don’t forget, if you are using RAC you need to have Standby Redo Logs for EVERY thread accessible to EVERY instance, just like the Online Redo logs.

If you have single instance, ensure you specify Thread 1, and ensure you have 1 more Standby Redo log available than Online Redo log.

As a final note on Data Guard, from Oracle 12.1 you really should be using the Data Guard Broker to configure and manage Data Guard. It’s simple and it’s much better and safer than configuring Data Guard manually and has some nice validation and monitoring features too.

Addendum: For clarity, not specifying a Thread means that Oracle is supposed to auto-adjust the thread to match what thread is needed, using the Thread 0 as a pool so you presumably need less redo defined in RAC environment, but it just doesn’t work very well (or not at all) leading to the fast/slow/fast/slow lag symptoms. It’s also a pretty pointless saving as redo logs are small relative to the database size. If you have a LOT of nodes in the cluster meaning a lot of standby redo logs, and they aren’t all committing redo heavily (when they will all need all of their standby redo logs anyway) why do you have so many nodes? And validating the setup will report that you don’t have enough standby redo logs. Looks like a solution to something that isn’t a problem for anyone.

So this is an undocumented problem-introducing enhancement in Oracle 12 solving no practical real-world problem.

I’ll let you know if it is fixed in Oracle 19C.

Here’s some real-world output showing the apply lag with Thread 0 SRL’s, and how it disappears when Thread 1 SRL’s are explicitly created. Please note the units and date. I can’t provide the alert log output in this blog to prove my annotations.

SQL> select * from v$standby_event_histogram order by LAST_TIME_UPDATED;

NAME                       TIME UNIT                  COUNT LAST_TIME_UPDATED        CON_ID
-------------------- ---------- ---------------- ---------- -------------------- ----------
apply lag                    43 seconds                   3 01/03/2019 17:22:38           0
apply lag                     9 minutes                  58 01/03/2019 17:30:54           0
apply lag                    11 minutes                  60 01/03/2019 17:32:55           0
apply lag                    12 minutes                  59 01/03/2019 17:33:54           0
apply lag                    13 minutes                  57 01/03/2019 17:34:55           0
apply lag                    14 minutes                  60 01/03/2019 17:35:55           0
apply lag                    15 minutes                  47 01/03/2019 17:36:42           0
apply lag                    10 minutes                  61 01/03/2019 17:36:44           0
-- First SRL Thread 1 created : 2019-01-03T17:39:43.890329+00:00
apply lag                    27 seconds                   9 01/03/2019 17:45:39           0
apply lag                    30 seconds                   5 01/03/2019 17:45:42           0
apply lag                    31 seconds                   7 01/03/2019 17:45:44           0
apply lag                    32 seconds                   6 01/03/2019 17:45:45           0
apply lag                    33 seconds                  10 01/03/2019 17:45:46           0
apply lag                    34 seconds                   8 01/03/2019 17:45:47           0
apply lag                    35 seconds                   9 01/03/2019 17:45:48           0
apply lag                    36 seconds                   7 01/03/2019 17:45:49           0
apply lag                    37 seconds                   5 01/03/2019 17:45:50           0
apply lag                    38 seconds                   8 01/03/2019 17:45:51           0
apply lag                    39 seconds                   7 01/03/2019 17:45:52           0
apply lag                    40 seconds                   3 01/03/2019 17:45:53           0
apply lag                    41 seconds                   3 01/03/2019 17:45:54           0
apply lag                    42 seconds                   4 01/03/2019 17:45:55           0
apply lag                    44 seconds                   4 01/03/2019 17:45:56           0
apply lag                    45 seconds                   7 01/03/2019 17:45:57           0
apply lag                    46 seconds                   5 01/03/2019 17:45:58           0
apply lag                    47 seconds                   7 01/03/2019 17:45:59           0
apply lag                    48 seconds                   4 01/03/2019 17:46:00           0
apply lag                    49 seconds                   8 01/03/2019 17:46:01           0
apply lag                    50 seconds                   5 01/03/2019 17:46:02           0
apply lag                    51 seconds                   6 01/03/2019 17:46:03           0
apply lag                    52 seconds                   8 01/03/2019 17:46:04           0
apply lag                    53 seconds                   7 01/03/2019 17:46:05           0
apply lag                    54 seconds                   6 01/03/2019 17:46:06           0
apply lag                    55 seconds                  10 01/03/2019 17:46:07           0
apply lag                    56 seconds                  11 01/03/2019 17:46:09           0
apply lag                    57 seconds                   8 01/03/2019 17:46:10           0
apply lag                    58 seconds                   6 01/03/2019 17:46:11           0
apply lag                    59 seconds                   7 01/03/2019 17:46:12           0
-- Final SRL Thread 1 created 2019-01-03T17:51:58.694308+00:00
apply lag                     8 minutes                 103 01/03/2019 17:52:55           0
apply lag                     7 minutes                 120 01/03/2019 17:52:56           0
apply lag                     6 minutes                 130 01/03/2019 17:52:58           0
apply lag                     5 minutes                 180 01/03/2019 17:53:00           0
apply lag                     4 minutes                 181 01/03/2019 17:53:03           0
apply lag                     3 minutes                 183 01/03/2019 17:53:06           0
apply lag                     2 minutes                 260 01/03/2019 17:53:11           0
-- All SRL Thread 0's deleted, then ALTER SYSTEM ARCHIVE LOG issued. 2019-01-03T18:19:33.757994+00:00
-- No more APPLY lag for 3 days (any lag reported below -of a few seconds- is due to other reasons)
apply lag                    23 seconds                  10 01/06/2019 18:14:59           0
apply lag                    29 seconds                   8 01/06/2019 18:15:03           0
apply lag                    26 seconds                  10 01/06/2019 18:15:09           0
apply lag                    28 seconds                  10 01/06/2019 18:15:11           0
apply lag                    24 seconds                  12 01/06/2019 18:15:12           0
apply lag                    25 seconds                   9 01/06/2019 18:15:14           0
apply lag                    16 seconds                  33 01/07/2019 11:02:37           0
apply lag                    20 seconds                  17 01/09/2019 11:01:37           0
apply lag                    21 seconds                  29 01/09/2019 11:01:40           0
apply lag                    22 seconds                  19 01/09/2019 11:01:42           0
apply lag                    14 seconds                  50 01/10/2019 00:08:19           0
apply lag                    15 seconds                  44 01/10/2019 00:08:25           0
apply lag                    17 seconds                  17 01/10/2019 00:08:27           0
apply lag                    18 seconds                  23 01/10/2019 00:08:28           0
apply lag                    19 seconds                  21 01/10/2019 00:08:30           0
apply lag                    13 seconds                  73 01/10/2019 00:30:51           0
apply lag                     9 seconds                 170 01/10/2019 00:33:11           0
apply lag                     6 seconds                 426 01/10/2019 01:34:29           0
apply lag                     4 seconds                1304 01/10/2019 11:00:42           0
apply lag                     7 seconds                 294 01/10/2019 11:00:54           0
apply lag                     8 seconds                 222 01/10/2019 11:00:55           0
apply lag                    10 seconds                 117 01/10/2019 11:00:56           0
apply lag                    11 seconds                 109 01/10/2019 11:00:57           0
apply lag                    12 seconds                  90 01/10/2019 11:00:58           0
apply lag                     5 seconds                 726 01/10/2019 11:01:38           0
apply lag                     3 seconds                2911 01/10/2019 11:12:46           0
apply lag                     2 seconds                7866 01/10/2019 11:56:28           0
apply lag                     1 seconds               58544 01/10/2019 11:59:15           0
apply lag                     0 seconds              758505 01/10/2019 12:00:07           0
Advertisement

SAN Migration and moving files with ASM

Here’s a quick post to help with LUN, SAN or File migration when using Oracle ASM.

Please note that this is just an example method, and is not a definitive guide.
You need to ensure your approach is appropriate for your environment.

Did you know that you can add and drop disks on a diskgroup in a single command.
This saves Oracle from doing 2 REBAL actions and speeds things up considerably using a lot fewer resources.

sqlplus / as sysasm;

> ALTER DISKGROUP data 
2   ADD  DISK 'ORCL:disk101','ORCL:disk102' 
3   DROP DISK 'disk042'     ,'disk043' 
4   REBALANCE POWER n;

Diskgroup altered.

You can monitor the REBAL operation:

select * from v$asm_operation;

GROUP_NUMBER OPERA PASS      STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE CON_ID
------------ ----- --------- ---- ----- ------ ----- -------- -------- ----------- ---------- ------
           6 REBAL REBALANCE  RUN     1      1    68    20397     3030           8                 0
           6 REBAL COMPAT    WAIT     1      1     0        0        0           0                 0

If you are just migrating from one SAN to another, without needing to move files between diskgroups, this is a really easy way to achieve it.

From Oracle 12.1 onwards. if you are moving Datafiles from one Diskgroup to a new Diskgroup, that’s straightforward too:

export ORACLE_SID=ORCL
sqlplus  / as sysdba

alter database move datafile '+DATA_DG/ORCL/DATAFILE/system.101.902468275' to '+NEW_DG';

Moving logfiles involves a drop and re-create, but it still fully online.
Take care if using DataGuard that you do not try to drop logfiles before they are archived and applied to the standby:

alter database  add logfile member '+NEW_DG' to group 1;
alter database drop logfile member '+DATA_DG/ORCL/ONLINELOG/group_1.102.902049284';

Moving tempfiles can be tricky. The default gets used really quickly after startup.

You need to add a tempfile and then get all processes to not be using the old tempfile before you can drop it [select * from gv$sort_usage]. Worst-case, this may involve a SHUTDOWN, then a STARTUP RESTRICT to stop processes connecting. Alternatively, create a new default TEMP tablespace and convince processes to use that one.

alter tablespace temp add  tempfile '+NEW_DG';
alter tablespace temp drop tempfile '+DATA_DG/ORCL/TEMPFILE/temp.204.992482742';

Moving controlfiles requires a stop and start of the database. There’s no on-line options here. I tend to do the following:

srvctl stop database -d ORCL

sqlplus / as sysdba
startup nomount
show parameter control_files


rman target /
restore controlfile to '+NEW_DG' from '+DATA_DG/ORCL/CONTROLFILE/current.291.939274434';

sqlplus / as sysdba
alter system set control_files='+NEW_DG/ORCL/CONTROLFILE/current.992.346677889'
                              ,'+FRA_DG/ORCL/CONTROLFILE/current.???.?????????'
                               scope=spfile sid='*';
shutdown

srvctl start database -d ORCL

To move your spfile in ASM is a 2-stage process which will automatically update your database config in grid control:

sqlplus / as sysdba
create  pfile='/tmp/ORCL.ora' from spfile;
create spfile='+NEW_DG'       from pfile='/tmp/ORCL.ora';

Moving a password file in ASM is straightforward too:
NOTE: you are not allowed to move OMF files names and you must copy files via their alias.

as grid:
asmcmd
cp +DATA_DG/ORCL/orapworcl +NEW_DG/ORCL/orapworclexit
as oracle:
srvctl modify database -d ORCL -pwfile +NEW_DG/ORCL/orapworclsrvctl config database -d ORCL | grep Password

If you are using Block Change Tracking, you may need to move that file too:

alter database disable block change tracking;
alter database enable block change tracking using file '+NEW_DG';

Finally, if you’re dropping disk groups don’t forget to modify the disk group dependencies in Grid Infra so your DB isn’t dependent upon groups you have now removed:

srvctl config datbase -d ORCL | grep Group
srvctl modify database -d ORCL -diskgroup 'NEW_DG,FRA_DG,other_DG'

I hope you found this quick guide useful.

Goldengate Log Rotation

Golden Gate 12 has some excellent commands to keep your log files in check, plus one glaring omission (scheduled for a future enhancement)

Each extract, datapump and replicat will be writing to report (.rpt) and discard (.dsc) files in the dirrpt directory (if you aren’t specifying a discard file, you should. They are very useful for troubleshooting)
If your system is up for a long time, these files are going to get large. Oracle has realised this and provides some lovely in-built log rotation commands. To keep my parameter (.prm) files nice and neat and consistent, I use include files, and this is a perfect case for a standard include.

report.prm:

-- Standard include commands for ALL extracts and replicats to ensure they are aligned
-- Write the days stats out to the file at the end of every day.
-- Roll the file over every week
-- Report just how much throughput we have every 15 minutes
-- History: 14.04.2015 N Chandler 
-- place the command include/dirprm/report.prm in your parameter files

STATOPTIONS REPORTDETAIL, RESETREPORTSTATS
REPORT AT 23:59
REPORTROLLOVER AT 00:01 ON MONDAY
REPORTCOUNT EVERY 15 MINUTES, RATE
-- or if you would rather by volume...
--REPORTCOUNT EVERY 10000000 RECORDS, RATE

I think it’s worth pointing out here what the 2 bracketed throughput numbers output by the REPORTCOUNT commands mean, as you’ll struggle to find it in the documentation

Rate  = number of records processed per second since startup divided by the total time since startup of the extract/replicat
Delta = number of records processed per second since last report divided by time since last report (in this case, 15 minutes)

 

 

There is 1 notable growing file which you cannot rotate using Goldengate commands: ggserr.log

This is a significant oversight by Oracle and will be rectified in a future release, but as of 12.1 you have to manually sort this out. You have 2 main options to do this:

1. Stop the manager, rename the file, restart the manager
2. Copy the file to a new file and then empty the in-place file by catting /dev/null into it. (I’m sure there’s a Windows equivalent of this, but I mainly work on Unix)
* DO NOT simply delete the file while the manager is running.
All future error output will drop into a “black hole” until the manager is restarted.   Option 2 tends to be preferably, so here’s part of a bash script I use to perform this action

#!/bin/bash
# rotate_ggserr_log.sh - copies the logfile to one side with a date suffix and blows away the current file
# but leaves it in place to we can continue to write to it with the manager.
# Neil Chandler 14.04.2015 created
#
today=`date +%Y%m%d`

-- Check to see if we have already rolled-over today
if [ -e /u99/gg/bin/ggserr.log.${today} ]
then
 echo "File /u99/gg/bin/ggserr.log.${today} exist already. Stopping."
else
 # copy the log file preserving attributes
 /bin/cp -pnv /u99/gg/bin/ggserr.log /u99/gg/bin/ggserr.log.${today}

 # See if there is a difference - did you copy it successfully?
 diff /u99/gg/bin/ggserr.log /u99/gg/bin/ggserr.log.${today}
 RC=$?

 # If there is no difference, wipe the ggserr.log file out
 # otherwise stop!
 if [ ${RC} -eq 0 ]
 then
  echo "clear the file /u99/gg/bin/ggserr.log"
  cat /dev/null > /u99/gg/bin/ggserr.log
  exit ${RC}
 else
  echo "Error - cannot clear file /u99/gg/bin/ggserr.log as it's not the same as the copied version. Stopping."
  exit ${RC}
 fi
fi