Unexpected Performance Issue with Unified Audit and OEM

Oracle 19C. January 2023.

EM Event: Critical: database-server.domain - Disk Device sda is 98.01% busy.
Photo of flat tyre on a car

A customer was experiencing excessive I/O against the operating disk (sda), which indicated problems with /u01 or /u03.

There was nothing obviously writing a lot of data to any “sda” mounted filesystems (and no swapping), so process tracing was initiated to review I/O against processes (pidstat), a process identified and linked to a query via GV$PROCESS and GV$SESSION, and this query against the audit trail was identified as a primary culprit.

(warning: pidstat can be very resource hungry – please use with caution!)

Bad SQL - sql_id=3wybn83gkuc4k

SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM
(
SELECT COUNT(db_user) AS failed_count
      ,MIN(extended_timestamp) AS first_occur_time
      ,MAX(extended_timestamp) AS last_occur_time
  FROM sys.dba_common_audit_trail
 WHERE action BETWEEN 100 AND 102
   AND returncode != 0
   AND statement_type = 'LOGON'
   AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
UNION
SELECT COUNT(dbusername) AS failed_count
      ,MIN(event_timestamp) AS first_occur_time
      ,MAX(event_timestamp) AS last_occur_time
  FROM unified_audit_trail
 WHERE action_name = 'LOGON'
   AND return_code <> 0
   AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
)

The problem presented itself a couple months after the implementation of Unified Audit (with a long retention).
The unified audit table AUDSYS.AUD$UNIFIED is partitioned by EVENT_TIMESTAMP, which is one of the predicates, and the code is only looking at the last 30 minutes of the data.

BUT, we aren’t looking just at the partitioned audit table. If the database cannot write to the audit table (e.g. if it is a physical standby), there is an overflow location of $ORACLE_BASE/audit/$instance_name. This contains “.bin” files which are processed and read by the view UNIFIED_AUDIT_TRAIL, and joined to AUDSYS.AUD$UNIFIED.

As the “.bin” files cannot be indexed, every file for each instance is translated and read every time the code is invoked.
Which is every 30 minutes.

pwd
/u01/app/oracle/audit/<instance_name>

ls -1 
ora_audit_1127534399_3323516267.bin
ora_audit_1127554851_1290166488.bin
ora_audit_1127556228_1457226874.bin
ora_audit_1127570671_3180208504.bin
etc... 

many many files, each up to 10MB in size

But where is this SQL coming from?

There is a metric within Oracle Enterprise Manager (OEM) called “Failed Login Count”. This runs every 30 minutes. Coincidence? Nope.

How does this metric work to determine the amount of failed logins (not) connecting to the database every 30 minutes? It queries your audit. To do this, it uses a pre-processor PERL program to determine what sort of auditing you are doing (unified/OS/DB/XML/etc) and runs the relevant query against the database to pull the audit records. Remember, unless you explicitly have disabled it, if you are on Oracle 12.1 or above you are auditing failed logons by default. Handy.

The preprocessor for this code is called “failLogin.pl” – I discovered this in MOS 1265699.1 (which is about a related problem from Oracle 12.1, when Unified Audit didn’t work properly and had significant performance issues.)

You can find failLogin.pl in the $AGENT_HOME/*/scripts/ (I’m being deliberately vague here as it seems to have moved around in different versions!)

$ cd $AGENT_HOME (you can find this by looking at the background process with: ps -ef | grep agent )
$ find . -name failLogin.pl -print

Fixing This

So, what’s the solution to this excessive disk usage, processing all of those .bin files every 30 minutes?

  1. Switch off the metric in OEM (but this is a really useful metric and can highlight hacking attempts, password problems, and deployment failure)
  2. Create a nice new segregated and very very fast mount point for the .bin files
  3. Keep the “.bin” audit directory really tidy. This means losing audit records of course, which you may not want.
  4. Load the “.bin” records into the database using DBMS_AUDIT_MGMT.LOAD_UNIFIED_AUDIT_FILES, and remove them (which may not be possible if this is a Physical Standby open for READ)
  5. Copy the failLogin.pl to a safe place, and amend the SQL to only read the table and not the .bin files. We can do this by accessing via the table directly, and not via the view UNIFIED_AUDIT_TRAIL (see amended SQL below)
  6. I’m sure you can think of something else
SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM (
    SELECT COUNT(db_user) AS failed_count
          ,MIN(extended_timestamp) AS first_occur_time
          ,MAX(extended_timestamp) AS last_occur_time
     FROM sys.dba_common_audit_trail
    WHERE action BETWEEN 100 AND 102
          AND returncode != 0
          AND statement_type = 'LOGON'
          AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
    UNION
    SELECT COUNT(dbid) AS failed_count
          ,MIN(event_timestamp) AS first_occur_time
          ,MAX(event_timestamp) AS last_occur_time
     FROM audsys.aud$unified
    WHERE action = 100
      AND return_code <> 0
      AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
);

NOTE: This code ran in a couple of seconds, rather than the many minutes of the version accessing the .bin files. Of course, you are missing reading all of the audit making the metric fairly useless on a READ-ONLY Physical Standby.

Of course, if you change the code there’s a very good chance it will get overwritten the next time you patch, or use a different SQL if you change how you are auditing, and you will have to remember to keep the code maintained manually. Not ideal!

Advertisement

OEM and monitoring the -MGMTDB GMIR Database

When you create Grid Infrastructure in 12.1.0.2, you are presented with a new (annoyingly named) “-MGMTDB”

This database is a standard, 12C CDB/PDB storing transient performance and other information (Grid Infrastructure Management Repository). If it is lost, no biggie. Just re-create it (in your voting disk DG. Aside: create a new MGMTDB_DG, move your voting disk there, re-create -MGMTDB, then move your voting disk back out to the proper multiple voting volumes.)

However, Oracle Enterprise Manager Cloud Control 12.1.0.5.0 and earlier sees this oracle database, PDB, listener and all, and decides to discover it. This is incorrect and should not happen. It is supposed to be “masked off” from OEM. Monitoring this database system will only lead to false positives and problems where none really exist, and all of the targets should be ignored (as per the attached picture)

OEM-MGMTDB

In a future release of Oracle Enterprise Manager Cloud Control, these targets will no longer be discovered and will automatically remain hidden from view within OEM, once the team have fixed the bug which – I was very reliably informed – they discovered the root cause of today.

Cloud Control 12c Discovery Problem

You know when things just don’t quite work as they should, and you spend far longer than you think you should have sorting them out. Well, this was one of them. Burned 20 minutes on this!

I’m running Oracle Enterprise Linux 6.6 with Oracle Enterprise Manager Cloud Control 12.1.0.4, and I’m trying to discover some lovely new targets. Discovery from within OEM is simplicity itself: Delegate privileges from the OEM server, allow unfettered sudo access from the installation account (oracle) to root and autodiscover by IP range.

Except that in OEL 6.6 it doesn’t work. The discovery uses nmap, and nmap is looking for an older version of libpcap than is installed at 6.6 and the discovery fails:

/u01/app/oracle/agent12c/agent_inst/discovery/nmap/bin/nmap: error while loading shared libraries: libpcap.so.0.9.4: cannot open shared object file: No such file or directory

But I have a newer version of libpcap installed!

-bash-4.1$ rpm -qa | grep libpcap
libpcap-1.4.0-1.20130826git2dbcaa1.el6.x86_64

So, what to do? You can try to get your hands on an old version of libpcap, or you can cheat and symbolically link the new version to the old version name!

ln -s /usr/lib64/libpcap.so.1.4.0 /usr/lib64/libpcap.so.0.9.4

And all it well with the world again. And I’ve discovered all of my targets from within OEM. Nice.

p.s. ensure you have nmap in your system!

yum install nmap
Setting up Install Process
Package 2:nmap-5.51-4.0.1.el6.x86_64 already installed and latest version