Unexpected Performance Issue with Unified Audit and OEM

Oracle 19C. January 2023.

EM Event: Critical: database-server.domain - Disk Device sda is 98.01% busy.

A customer was experiencing excessive I/O against the operating disk (sda), which indicated problems with /u01 or /u03.

There was nothing obviously writing a lot of data to any “sda” mounted filesystems (and no swapping), so process tracing was initiated to review I/O against processes (pidstat), a process identified and linked to a query via GV$PROCESS and GV$SESSION, and this query against the audit trail was identified as a primary culprit.

(warning: pidstat can be very resource hungry – please use with caution!)

Bad SQL - sql_id=3wybn83gkuc4k

SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM
(
SELECT COUNT(db_user) AS failed_count
      ,MIN(extended_timestamp) AS first_occur_time
      ,MAX(extended_timestamp) AS last_occur_time
  FROM sys.dba_common_audit_trail
 WHERE action BETWEEN 100 AND 102
   AND returncode != 0
   AND statement_type = 'LOGON'
   AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
UNION
SELECT COUNT(dbusername) AS failed_count
      ,MIN(event_timestamp) AS first_occur_time
      ,MAX(event_timestamp) AS last_occur_time
  FROM unified_audit_trail
 WHERE action_name = 'LOGON'
   AND return_code <> 0
   AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
)

The problem presented itself a couple months after the implementation of Unified Audit (with a long retention).
The unified audit table AUDSYS.AUD$UNIFIED is partitioned by EVENT_TIMESTAMP, which is one of the predicates, and the code is only looking at the last 30 minutes of the data.

BUT, we aren’t looking just at the partitioned audit table. If the database cannot write to the audit table (e.g. if it is a physical standby), there is an overflow location of $ORACLE_BASE/audit/$instance_name. This contains “.bin” files which are processed and read by the view UNIFIED_AUDIT_TRAIL, and joined to AUDSYS.AUD$UNIFIED.

As the “.bin” files cannot be indexed, every file for each instance is translated and read every time the code is invoked.
Which is every 30 minutes.

pwd
/u01/app/oracle/audit/<instance_name>

ls -1 
ora_audit_1127534399_3323516267.bin
ora_audit_1127554851_1290166488.bin
ora_audit_1127556228_1457226874.bin
ora_audit_1127570671_3180208504.bin
etc... 

many many files, each up to 10MB in size

But where is this SQL coming from?

There is a metric within Oracle Enterprise Manager (OEM) called “Failed Login Count”. This runs every 30 minutes. Coincidence? Nope.

How does this metric work to determine the amount of failed logins (not) connecting to the database every 30 minutes? It queries your audit. To do this, it uses a pre-processor PERL program to determine what sort of auditing you are doing (unified/OS/DB/XML/etc) and runs the relevant query against the database to pull the audit records. Remember, unless you explicitly have disabled it, if you are on Oracle 12.1 or above you are auditing failed logons by default. Handy.

The preprocessor for this code is called “failLogin.pl” – I discovered this in MOS 1265699.1 (which is about a related problem from Oracle 12.1, when Unified Audit didn’t work properly and had significant performance issues.)

You can find failLogin.pl in the $AGENT_HOME/*/scripts/ (I’m being deliberately vague here as it seems to have moved around in different versions!)

$ cd $AGENT_HOME (you can find this by looking at the background process with: ps -ef | grep agent )
$ find . -name failLogin.pl -print

Fixing This

So, what’s the solution to this excessive disk usage, processing all of those .bin files every 30 minutes?

Switch off the metric in OEM (but this is a really useful metric and can highlight hacking attempts, password problems, and deployment failure)
Create a nice new segregated and very very fast mount point for the .bin files
Keep the “.bin” audit directory really tidy. This means losing audit records of course, which you may not want.
Load the “.bin” records into the database using DBMS_AUDIT_MGMT.LOAD_UNIFIED_AUDIT_FILES, and remove them (which may not be possible if this is a Physical Standby open for READ)
Copy the failLogin.pl to a safe place, and amend the SQL to only read the table and not the .bin files. We can do this by accessing via the table directly, and not via the view UNIFIED_AUDIT_TRAIL (see amended SQL below)
I’m sure you can think of something else

SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM (
    SELECT COUNT(db_user) AS failed_count
          ,MIN(extended_timestamp) AS first_occur_time
          ,MAX(extended_timestamp) AS last_occur_time
     FROM sys.dba_common_audit_trail
    WHERE action BETWEEN 100 AND 102
          AND returncode != 0
          AND statement_type = 'LOGON'
          AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
    UNION
    SELECT COUNT(dbid) AS failed_count
          ,MIN(event_timestamp) AS first_occur_time
          ,MAX(event_timestamp) AS last_occur_time
     FROM audsys.aud$unified
    WHERE action = 100
      AND return_code <> 0
      AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
);

NOTE: This code ran in a couple of seconds, rather than the many minutes of the version accessing the .bin files. Of course, you are missing reading all of the audit making the metric fairly useless on a READ-ONLY Physical Standby.

Of course, if you change the code there’s a very good chance it will get overwritten the next time you patch, or use a different SQL if you change how you are auditing, and you will have to remember to keep the code maintained manually. Not ideal!