Data Guard Unexpected Lag

facepalmWhen configuring a physical standby database for Oracle using Data Guard, you need to create Standby Redo logs to allow the redo to be applied in (near) real time to the Standby. Without standby redo logs, Oracle will wait for an entire Archive Log to be filled and copied across to the standby before it will apply changes, which could take quite a while.

Which leads me to the problem I encountered a while ago, and due to being forgetful, still encounter today when creating standby redo logs in Oracle 12C and 18C.

I have a Primary database with 3 Redo logs, and I have created a Physical Standby database with 3 Redo logs and then 4 Standby Redo logs created as follows:

alter database add standby logfile group 4 '/u01/.../redo_standby04.dbf' size 200M;
alter database add standby logfile group 5 '/u01/.../redo_standby05.dbf' size 200M;
alter database add standby logfile group 6 '/u01/.../redo_standby06.dbf' size 200M;
alter database add standby logfile group 7 '/u01/.../redo_standby07.dbf' size 200M;

And this is what they look like:

SQL> select group#,thread#,sequence#,bytes/1024/1024 MB,status from v$log order by group#;
GROUP#     THREAD#    SEQUENCE#  MB         STATUS
---------- ---------- ---------- ---------- ----------------
1                   1          0        200 UNUSED
2                   1          0        200 UNUSED
3                   1          0        200 UNUSED

SQL> select group#,thread#,sequence#,bytes/1024/1024 MB,status from v$standby_log order by group#;
GROUP#     THREAD#    SEQUENCE#  MB         STATUS
---------- ---------- ---------- ---------- ----------
4                   1        111        200 ACTIVE
5                   0          0        200 UNASSIGNED
6                   0          0        200 UNASSIGNED
7                   0          0        200 UNASSIGNED

SQL> select * from v$logfile order by group#;
GROUP#     STATUS  TYPE    MEMBER 
---------- ------- ------- --------------------------------------------------------------------------------
         1         ONLINE  /u01/.../redo01.dbf
         2         ONLINE  /u01/.../redo02.dbf
         3         ONLINE  /u01/.../redo03.dbf
         4         STANDBY /u01/.../redo_standby01.dbf
         5         STANDBY /u01/.../redo_standby02.dbf
         6         STANDBY /u01/.../redo_standby03.dbf
         7         STANDBY /u01/.../redo_standby04.dbf

So what’s the problem?

Well, Dataguard works as expected for a while – applying change immediately – then it lags for ages and you get an OEM alert that it’s lagging. Then it works fine for a while again and then another lag.

The problem is that when I created the Standby Redo log files, I did not specify the Thread. But why do I need to specify the thread? This is not RAC. It does not have multiple threads for redo – one for each node. We have Thread 1 and that’s it.

When I used to do this in Oracle 11.2, the Standby Redo logs got Thread 1 by default. Always Thread 1.

From Oracle 12.1, and also in 12.2.0.1, 18C (12.2.0.2) and probably 19C (12.2.0.3), Oracle creates the Standby Redo logs with Thread 0 by default, although sometimes it creates them on Thread 1 (although this seems to be only in 12.1 but let me know if you know better). I don’t know why.

Standby Redo log MUST belong to the SAME Thread as the Online Redo logs for them to be used. Having (in this case) 1 x redo on Thread 1 and 3 x Redo on Thread 0, I only have 1 x Standby Redo usable for my online Redo.

The solution is stop the Apply, DROP the Standby Redo on Thread 0, and create new Standby Redo Logs explicity stating Thread 1, and restart the Apply.

dgmgrl> edit database orcl12dr set state='APPLY-OFF';

sql> alter database drop standby logfile          group 4 '/u01/.../redo_standby04.dbf';
sql> alter database add  standby logfile thread 1 group 4 '/u01/.../redo_standby04.dbf' size 200M;

dgmgrl> edit database orcl12dr set state='APPLY-ON';

And all is good with the world.

It is worth saying that MOS note 1595034.1 gets the corrective advice EXACTLY WRONG, with the incorrect snippet reproduced here:

"The standby redo logs can't be used, since they are assigned to thread 0. 
Standby RedoLogs should not be assigned to a specific Thread. 
Please use fallowing statements:
...
SQL> ALTER DATABASE DROP STANDBY LOGFILE GROUP <Group#>; 
and create new SRLs with
SQL> ALTER DATABASE ADD STANDBY LOGFILE GROUP <Group#> SIZE ;"

No MOS analyst! Standby Redo Logs MUST be assigned to a specific thread. Thread 0 Standby Logs are ignored!

And don’t forget, if you are using RAC you need to have Standby Redo Logs for EVERY thread accessible to EVERY instance, just like the Online Redo logs.

If you have single instance, ensure you specify Thread 1, and ensure you have 1 more Standby Redo log available than Online Redo log.

As a final note on Data Guard, from Oracle 12.1 you really should be using the Data Guard Broker to configure and manage Data Guard. It’s simple and it’s much better and safer than configuring Data Guard manually and has some nice validation and monitoring features too.

Addendum: For clarity, not specifying a Thread means that Oracle is supposed to auto-adjust the thread to match what thread is needed, using the Thread 0 as a pool so you presumably need less redo defined in RAC environment, but it just doesn’t work very well (or not at all) leading to the fast/slow/fast/slow lag symptoms. It’s also a pretty pointless saving as redo logs are small relative to the database size. If you have a LOT of nodes in the cluster meaning a lot of standby redo logs, and they aren’t all committing redo heavily (when they will all need all of their standby redo logs anyway) why do you have so many nodes? And validating the setup will report that you don’t have enough standby redo logs. Looks like a solution to something that isn’t a problem for anyone.

So this is an undocumented problem-introducing enhancement in Oracle 12 solving no practical real-world problem.

I’ll let you know if it is fixed in Oracle 19C.

Here’s some real-world output showing the apply lag with Thread 0 SRL’s, and how it disappears when Thread 1 SRL’s are explicitly created. Please note the units and date. I can’t provide the alert log output in this blog to prove my annotations.

SQL> select * from v$standby_event_histogram order by LAST_TIME_UPDATED;

NAME                       TIME UNIT                  COUNT LAST_TIME_UPDATED        CON_ID
-------------------- ---------- ---------------- ---------- -------------------- ----------
apply lag                    43 seconds                   3 01/03/2019 17:22:38           0
apply lag                     9 minutes                  58 01/03/2019 17:30:54           0
apply lag                    11 minutes                  60 01/03/2019 17:32:55           0
apply lag                    12 minutes                  59 01/03/2019 17:33:54           0
apply lag                    13 minutes                  57 01/03/2019 17:34:55           0
apply lag                    14 minutes                  60 01/03/2019 17:35:55           0
apply lag                    15 minutes                  47 01/03/2019 17:36:42           0
apply lag                    10 minutes                  61 01/03/2019 17:36:44           0
-- First SRL Thread 1 created : 2019-01-03T17:39:43.890329+00:00
apply lag                    27 seconds                   9 01/03/2019 17:45:39           0
apply lag                    30 seconds                   5 01/03/2019 17:45:42           0
apply lag                    31 seconds                   7 01/03/2019 17:45:44           0
apply lag                    32 seconds                   6 01/03/2019 17:45:45           0
apply lag                    33 seconds                  10 01/03/2019 17:45:46           0
apply lag                    34 seconds                   8 01/03/2019 17:45:47           0
apply lag                    35 seconds                   9 01/03/2019 17:45:48           0
apply lag                    36 seconds                   7 01/03/2019 17:45:49           0
apply lag                    37 seconds                   5 01/03/2019 17:45:50           0
apply lag                    38 seconds                   8 01/03/2019 17:45:51           0
apply lag                    39 seconds                   7 01/03/2019 17:45:52           0
apply lag                    40 seconds                   3 01/03/2019 17:45:53           0
apply lag                    41 seconds                   3 01/03/2019 17:45:54           0
apply lag                    42 seconds                   4 01/03/2019 17:45:55           0
apply lag                    44 seconds                   4 01/03/2019 17:45:56           0
apply lag                    45 seconds                   7 01/03/2019 17:45:57           0
apply lag                    46 seconds                   5 01/03/2019 17:45:58           0
apply lag                    47 seconds                   7 01/03/2019 17:45:59           0
apply lag                    48 seconds                   4 01/03/2019 17:46:00           0
apply lag                    49 seconds                   8 01/03/2019 17:46:01           0
apply lag                    50 seconds                   5 01/03/2019 17:46:02           0
apply lag                    51 seconds                   6 01/03/2019 17:46:03           0
apply lag                    52 seconds                   8 01/03/2019 17:46:04           0
apply lag                    53 seconds                   7 01/03/2019 17:46:05           0
apply lag                    54 seconds                   6 01/03/2019 17:46:06           0
apply lag                    55 seconds                  10 01/03/2019 17:46:07           0
apply lag                    56 seconds                  11 01/03/2019 17:46:09           0
apply lag                    57 seconds                   8 01/03/2019 17:46:10           0
apply lag                    58 seconds                   6 01/03/2019 17:46:11           0
apply lag                    59 seconds                   7 01/03/2019 17:46:12           0
-- Final SRL Thread 1 created 2019-01-03T17:51:58.694308+00:00
apply lag                     8 minutes                 103 01/03/2019 17:52:55           0
apply lag                     7 minutes                 120 01/03/2019 17:52:56           0
apply lag                     6 minutes                 130 01/03/2019 17:52:58           0
apply lag                     5 minutes                 180 01/03/2019 17:53:00           0
apply lag                     4 minutes                 181 01/03/2019 17:53:03           0
apply lag                     3 minutes                 183 01/03/2019 17:53:06           0
apply lag                     2 minutes                 260 01/03/2019 17:53:11           0
-- All SRL Thread 0's deleted, then ALTER SYSTEM ARCHIVE LOG issued. 2019-01-03T18:19:33.757994+00:00
-- No more APPLY lag for 3 days (any lag reported below -of a few seconds- is due to other reasons)
apply lag                    23 seconds                  10 01/06/2019 18:14:59           0
apply lag                    29 seconds                   8 01/06/2019 18:15:03           0
apply lag                    26 seconds                  10 01/06/2019 18:15:09           0
apply lag                    28 seconds                  10 01/06/2019 18:15:11           0
apply lag                    24 seconds                  12 01/06/2019 18:15:12           0
apply lag                    25 seconds                   9 01/06/2019 18:15:14           0
apply lag                    16 seconds                  33 01/07/2019 11:02:37           0
apply lag                    20 seconds                  17 01/09/2019 11:01:37           0
apply lag                    21 seconds                  29 01/09/2019 11:01:40           0
apply lag                    22 seconds                  19 01/09/2019 11:01:42           0
apply lag                    14 seconds                  50 01/10/2019 00:08:19           0
apply lag                    15 seconds                  44 01/10/2019 00:08:25           0
apply lag                    17 seconds                  17 01/10/2019 00:08:27           0
apply lag                    18 seconds                  23 01/10/2019 00:08:28           0
apply lag                    19 seconds                  21 01/10/2019 00:08:30           0
apply lag                    13 seconds                  73 01/10/2019 00:30:51           0
apply lag                     9 seconds                 170 01/10/2019 00:33:11           0
apply lag                     6 seconds                 426 01/10/2019 01:34:29           0
apply lag                     4 seconds                1304 01/10/2019 11:00:42           0
apply lag                     7 seconds                 294 01/10/2019 11:00:54           0
apply lag                     8 seconds                 222 01/10/2019 11:00:55           0
apply lag                    10 seconds                 117 01/10/2019 11:00:56           0
apply lag                    11 seconds                 109 01/10/2019 11:00:57           0
apply lag                    12 seconds                  90 01/10/2019 11:00:58           0
apply lag                     5 seconds                 726 01/10/2019 11:01:38           0
apply lag                     3 seconds                2911 01/10/2019 11:12:46           0
apply lag                     2 seconds                7866 01/10/2019 11:56:28           0
apply lag                     1 seconds               58544 01/10/2019 11:59:15           0
apply lag                     0 seconds              758505 01/10/2019 12:00:07           0

Running RAC in a Virtual Environment

I was in a discussion at an Oracle Meetup this week, and the subject of RAC in a virtualized environment – specifically Oracle Virtual Machine (OVM) – came up.
Here’s a couple of points which were discussed.

pingtarget

There was a lack of awareness of a common problem, which has a solution built-in to Oracle 12.1.0.2 Grid Infrastructure and later. In a virtualized environment, the network components are also virtualized. Sometimes, network failures on the host may not be raised up to the guests. As a result, the guest O/S commands can fail to detect the network failures and the Virtual NIC remains up. Grid Infrastructure (GI) will not perform a VIP fail-over as it can’t see the failure despite the network being unavailable.

To resolve this, Oracle has added an option of a “pingtarget” for each public network defined in GI. This will perform a keep-alive to a external device, usually something like the default gateway. This is just like the heartbeat on the cluster interconnect.

Before

srvctl config network

Network 1 exists

Subnet IPv4: 192.168.0.160/255.255.255.224/eth1, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:

The default gateway makes a good ping target. For this IP and subnet, it’s 192.168.0.161

srvctl modify network -k 1 -pingtarget 192.168.0.161

After

srvctl config network

Network 1 exists

Subnet IPv4: 192.168.0.160/255.255.255.224/eth1, static
Subnet IPv6:
Ping Targets: 192.168.0.161
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:

All safe!

Server Pools

A second item we discussed was the Server Pools in OVM. Each RAC guest should be on a different host, otherwise you have not eliminated that as a Single Point Of Failure. A second less obvious SPOF is the Server Pool disk.

A Server Pool is a filesystem LUN (and IP address prior to release 3.4) used to group a logical collection of servers with similar CPU models, within which we can create and migrate VM guests. For a RAC installation, each RAC node should be within a different server pool, as well as on different physical hardware.

ovs

In this image, RAC nodes within the same cluster should be created within each server pool. This configuration can safely support a 2 node cluster despite having 4 servers, with one node created in “OVS-Pool-2” on server “ovs02“. The second node should be in “OVS-Pool-1″ and can be on “ovs01“, “ovs11” or “ovs12“.

It is possible to live migrate guests between these 3 servers.

 

 

 

 

 

UKOUG Technology Day

northerntech17-com-600x110-rn-v2

We have organized a UKOUG Technology Summit in Leeds on 16th May.

It’s really a small Conference!

This is available to UKOUG Members and will cost 1 SIG Pass – even Bronze memberships come with 1 SIG pass, so all members could attend at no cost.
Non-members can pay to attend, but you should take out a Bronze Membership instead as it’s about the same price!

Come along, network with your peers, and listen to some amazing talks about all aspects of the Oracle technology product set; the Database, RAC, Engineered Systems, Development, Apex, DevOps, Solaris 11.4!, and more. Learn how Oracle interacts and works with Kubernetes, Docker, Apache Kafka, Apache Impala, Chatbots, and  just about everything else! 30 amazing talks!

Plus a keynote address from Oracle Executive John Abel and a social afterwards too!

All that, just 2 hours from London on the train!

You can find the Agenda here 

northerntech17-agenda

 

 

UKOUG 2-Stream London SIG

ukoug_logoWe now have the agenda online for the UKOUG 2-stream London SIG, due to take place on 23rd February. There will be an RDBMS Stream and a RAC, Cloud, Infrastructure and Availability Stream.

Mark Rittman will be giving us a keynote talk, with an emphasis on kettles or eating dinner in the dark or the Hadoop cluster in his garage or, well, probably something IoT-related

There are some really cool other speakers too, from Oracle, Leading Consultancies and the odd Independent too (like me).

The full Agenda and Registration Info can be seen here:
https://www.ukoug.org/events/ukoug-database-and-rac-cia-joint-sig-23022017/

It would be lovely to see you there, and maybe for a beer afterwards too.

Call for Papers – UKOUG 2017 SIGs

All,

The time has come for me to plan the upcoming 2017 UKOUG Special Interest Groups.

I am chairman of the RAC, Cloud, Infrastructure and Availability (RAC-CIA) SIG and I’m after presentations for the 3 joint SIGs we will be putting on with the RDBMS SIG, plus the stand-alone SIG will we will having in the autumn.

A SIG is a single day one-or-two stream conference which we take around the UK to make it as accessible for UKOUG members and non-members alike (NOTE: there is a charge for non-members)

This year we are looking at running:

Thursday 23rd February – RDBMS + RAC, London 2 streams
Thursday 27th April – TBC: Northern Day – RDBMS + RAC + ? , Manchester (probably)
Thursday 15th June – 12.2 Themed event – TVP Reading – 2 stream event RDBMS + RAC
Wednesday 27th September – RAC-CIA SIG, London 1 stream

If you are interested in speaking, you can submit an abstract here:

http://ukoug.formstack.com/forms/submit_an_abstract

or you can email me directly : neil@chandler.uk.com

Be quick! The next event is in 10 weeks time…

Oracle Audit Control part 3 – OS files

Just realized that I wrote a follow-up to Part 2 two years ago and never posted it! So here’s the stripped-out key facts to round off auditing control a bit more:

Deleting O/S audit files is, like, totally necessary. In Oracle 12C (12.1), they are created by default as AUDIT_SYS_OPERATIONS defaults to TRUE now, so you are going to be filling up your AUDIT_FILE_DEST whether you realized it or not.

[NOTE: In theory, this could end up being a performance problem as well as a space problem if you have many millions files in the AUDIT_FILE_DEST. In Oracle 11G, when allocating an AUDIT file, it would check to see if a file for that SPID already existed and add an incremental number, The act of calling vx_dirbread and vx_dirscan can get terribly slow if there are lots of files to wade through. This mechanism is different in 12C but I have not traced the kernel calls to see if the vx_ calls have gone away.]

To cleanup the O/S, we use the same commands as we did in Part 2, but changed to use an AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS

You also need to be aware of the RAC_INSTANCE_NUMBER. For database-level audit, it’s irrelevant. For single instance, it is also irrelevant, but with RAC, this parameter tells you which node to clean up as they are all individual installs with their own audit files (assuming a Grid Infrastructure install per node.)

Commands:

-- Show the BEFORE sizes
host ssh server01 du -m -s /u01/app/oracle/admin/ORCL/a*
host ssh server02 du -m -s /u01/app/oracle/admin/ORCL/a*
begin

-- Loop around every instance in the RAC cluster and run the command
for i in (select inst_id from gv$instance)
loop

-- cleanup AUDIT_FILE_DEST (operating system audit files)
-- set cleanup for 60 days
DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP(
AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
LAST_ARCHIVE_TIME => systimestamp-60,
RAC_INSTANCE_NUMBER => i.inst_id );

-- And cleanup based upon that date
DBMS_AUDIT_MGMT.clean_audit_trail(
audit_trail_type        => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
use_last_arch_timestamp => TRUE);

end loop;

end;
/

-- And check the sizes AFTER
host ssh server01 du -m -s /u01/app/oracle/admin/ORCL/a*
host ssh server02 du -m -s /u01/app/oracle/admin/ORCL/a*

 


 

Output:

15:58:56 SYS @ ORCL1 > host ssh server01 du -m -s /u01/app/oracle/admin/ORCL/a*
776 /u01/app/oracle/admin/ORCL/adump
15:58:56 SYS @ ORCL1 > host ssh server02 du -m -s /u01/app/oracle/admin/ORCL/a*
694 /u01/app/oracle/admin/ORCL/adump
15:58:56 SYS @ ORCL1 >
15:58:56 SYS @ ORCL1 >
15:58:56 SYS @ ORCL1 > begin
15:58:56   2
15:58:56   3   for i in (select inst_id from gv$instance)
15:58:56   4   loop
15:58:56   5
15:58:56   6      -- cleanup AUDIT_FILE_DEST (operating system audit files)
15:58:56   7      -- set cleanup for 60 days
15:58:56   8      DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP(
15:58:56   9      AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
15:58:56  10      LAST_ARCHIVE_TIME => systimestamp-50,
15:58:56  11      RAC_INSTANCE_NUMBER => i.inst_id );
15:58:56  12
15:58:56  13      -- And cleanup based upon that date
15:58:56  14      DBMS_AUDIT_MGMT.clean_audit_trail(
15:58:56  15        audit_trail_type        => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
15:58:56  16        use_last_arch_timestamp => TRUE);
15:58:56  17
15:58:56  18   end loop;
15:58:56  19
15:58:56  20  end;
15:58:56  21  /
PL/SQL procedure successfully completed.

15:58:56 SYS @ ORCL1 >
15:58:56 SYS @ ORCL1 >
15:58:56 SYS @ ORCL1 > host ssh server01 du -m -s /u01/app/oracle/admin/ORCL/a*
354 /u01/app/oracle/admin/ORCL/adump
15:58:57 SYS @ ORCL1 > host ssh server02 du -m -s /u01/app/oracle/admin/ORCL/a*
364 /u01/app/oracle/admin/ORCL/adump
15:58:57 SYS @ ORCL1 >
 


 

 
And just one final bit. If you change the default

AUDIT_TRAIL=DB to
AUDIT_TRAIL=DB, EXTENDED

the audit writes the first 2,000 characters of the SQL command to DBA_AUDIT_TRAIL.SQL_TEXT, so you get the full text of the SQL command audited, rather than just the action. Very handy!

UKOUG RDBMS and RAC-CIA Special Interest Groups

On Thursday 21st April, there is a dual UKOUG Database and RAC, Cloud, Infrastructure and Availability special interest group.

For the first time, this event is being held in the fabulous Northern city of Manchester!

There are a dozen interesting, career-assisting, educational talks from end users, Oracle employees and a number of well known Oracle ACE’s at all levels, including Carl Dudley, Jonathan Lewis, Phil Brown and myself.

I will be talking about how to troubleshoot Goldengate, showing optimal configurations to assist with problem determination and a bit of staring at Hex dumps for the brave.

There are only a few places left for this popular dual-stream event. Click Here for more details about the talks and speakers, and details for registration.

See you there!

%d bloggers like this: