Unexpected Performance Issue with Unified Audit and OEM

Oracle 19C. January 2023.

EM Event: Critical: database-server.domain - Disk Device sda is 98.01% busy.
Photo of flat tyre on a car

A customer was experiencing excessive I/O against the operating disk (sda), which indicated problems with /u01 or /u03.

There was nothing obviously writing a lot of data to any “sda” mounted filesystems (and no swapping), so process tracing was initiated to review I/O against processes (pidstat), a process identified and linked to a query via GV$PROCESS and GV$SESSION, and this query against the audit trail was identified as a primary culprit.

(warning: pidstat can be very resource hungry – please use with caution!)

Bad SQL - sql_id=3wybn83gkuc4k

SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM
(
SELECT COUNT(db_user) AS failed_count
      ,MIN(extended_timestamp) AS first_occur_time
      ,MAX(extended_timestamp) AS last_occur_time
  FROM sys.dba_common_audit_trail
 WHERE action BETWEEN 100 AND 102
   AND returncode != 0
   AND statement_type = 'LOGON'
   AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
UNION
SELECT COUNT(dbusername) AS failed_count
      ,MIN(event_timestamp) AS first_occur_time
      ,MAX(event_timestamp) AS last_occur_time
  FROM unified_audit_trail
 WHERE action_name = 'LOGON'
   AND return_code <> 0
   AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
)

The problem presented itself a couple months after the implementation of Unified Audit (with a long retention).
The unified audit table AUDSYS.AUD$UNIFIED is partitioned by EVENT_TIMESTAMP, which is one of the predicates, and the code is only looking at the last 30 minutes of the data.

BUT, we aren’t looking just at the partitioned audit table. If the database cannot write to the audit table (e.g. if it is a physical standby), there is an overflow location of $ORACLE_BASE/audit/$instance_name. This contains “.bin” files which are processed and read by the view UNIFIED_AUDIT_TRAIL, and joined to AUDSYS.AUD$UNIFIED.

As the “.bin” files cannot be indexed, every file for each instance is translated and read every time the code is invoked.
Which is every 30 minutes.

pwd
/u01/app/oracle/audit/<instance_name>

ls -1 
ora_audit_1127534399_3323516267.bin
ora_audit_1127554851_1290166488.bin
ora_audit_1127556228_1457226874.bin
ora_audit_1127570671_3180208504.bin
etc... 

many many files, each up to 10MB in size

But where is this SQL coming from?

There is a metric within Oracle Enterprise Manager (OEM) called “Failed Login Count”. This runs every 30 minutes. Coincidence? Nope.

How does this metric work to determine the amount of failed logins (not) connecting to the database every 30 minutes? It queries your audit. To do this, it uses a pre-processor PERL program to determine what sort of auditing you are doing (unified/OS/DB/XML/etc) and runs the relevant query against the database to pull the audit records. Remember, unless you explicitly have disabled it, if you are on Oracle 12.1 or above you are auditing failed logons by default. Handy.

The preprocessor for this code is called “failLogin.pl” – I discovered this in MOS 1265699.1 (which is about a related problem from Oracle 12.1, when Unified Audit didn’t work properly and had significant performance issues.)

You can find failLogin.pl in the $AGENT_HOME/*/scripts/ (I’m being deliberately vague here as it seems to have moved around in different versions!)

$ cd $AGENT_HOME (you can find this by looking at the background process with: ps -ef | grep agent )
$ find . -name failLogin.pl -print

Fixing This

So, what’s the solution to this excessive disk usage, processing all of those .bin files every 30 minutes?

  1. Switch off the metric in OEM (but this is a really useful metric and can highlight hacking attempts, password problems, and deployment failure)
  2. Create a nice new segregated and very very fast mount point for the .bin files
  3. Keep the “.bin” audit directory really tidy. This means losing audit records of course, which you may not want.
  4. Load the “.bin” records into the database using DBMS_AUDIT_MGMT.LOAD_UNIFIED_AUDIT_FILES, and remove them (which may not be possible if this is a Physical Standby open for READ)
  5. Copy the failLogin.pl to a safe place, and amend the SQL to only read the table and not the .bin files. We can do this by accessing via the table directly, and not via the view UNIFIED_AUDIT_TRAIL (see amended SQL below)
  6. I’m sure you can think of something else
SELECT TO_CHAR(current_timestamp AT TIME ZONE 'GMT','YYYY-MM-DD HH24:MI:SS TZD') AS curr_timestamp
      ,SUM(failed_count) AS failed_count
      ,TO_CHAR(MIN(first_occur_time),'yyyy-mm-dd hh24:mi:ss') AS first_occur_time
      ,TO_CHAR(MAX(last_occur_time),'yyyy-mm-dd hh24:mi:ss') AS last_occur_time
FROM (
    SELECT COUNT(db_user) AS failed_count
          ,MIN(extended_timestamp) AS first_occur_time
          ,MAX(extended_timestamp) AS last_occur_time
     FROM sys.dba_common_audit_trail
    WHERE action BETWEEN 100 AND 102
          AND returncode != 0
          AND statement_type = 'LOGON'
          AND extended_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
    UNION
    SELECT COUNT(dbid) AS failed_count
          ,MIN(event_timestamp) AS first_occur_time
          ,MAX(event_timestamp) AS last_occur_time
     FROM audsys.aud$unified
    WHERE action = 100
      AND return_code <> 0
      AND event_timestamp >= current_timestamp - to_dsinterval('0 0:30:00')
);

NOTE: This code ran in a couple of seconds, rather than the many minutes of the version accessing the .bin files. Of course, you are missing reading all of the audit making the metric fairly useless on a READ-ONLY Physical Standby.

Of course, if you change the code there’s a very good chance it will get overwritten the next time you patch, or use a different SQL if you change how you are auditing, and you will have to remember to keep the code maintained manually. Not ideal!

Advertisement

SAN Migration: When modern SANs “Fail”

… or at least, when they don’t behave like you were expecting them to.

I recently performed a major migration of a dozen missions critical Oracle and SQL Server systems from a pair of old EMC CX700’s to 2 brand new shiny HP XP 20000’s. This blog post is intended to show, at a very high level, what steps were taken to ensure the new SAN could cope with the volume of traffic. It also shows that, despite the new SAN being much more modern and intelligent, how you can still get caught out by unexpected behaviour.

The databases to be migrated varied from high throughput OLTP, low throughput – fast response time OLTP, a small Data Warehouse and Grid Control. The systems are 24 x 7 x 365, but a nightly maintenance window can be arranged on some days of the week between midnight and 3am. Not a large window, but invaluable.

What are we migrating?

For a successful SAN migration, we first need to understand what we are migrating. To do this we need to profile the I/O requirement of each Oracle database. Oracle provides us with all of the tools we need to be able to do this. If you have the Tuning and Diagnostics Pack licensed (which all Oracle sites should have – it should really be part of base product.) then you can pull the information you need directly out of the AWR table DBA_HIST_SYSTEM_EVENT. By default your AWR stats gather every hour and only last a week. Hourly averages of system event wait times will give a good starting point, but I would not want to use them to determine peak volumes. If you capacity plan to your averages, you will run out of resources before you hit your peaks. A lot can happen in an hour when you are looking at millisecond response times. You need to be much more granular than an hourly average, but you might not want change AWR to gather all of its information as frequently as you need for the I/O events. You need to develop some code to gather the information from v$system_event directly, so I did [I will post this code up shortly]. You can be as granular as you want, but I balanced peaks against volume of information and the impact of gathering it and was running every 5 minutes. Mostly. I also took some much more frequent samples to ensure that I wasn’t wildly wrong with the 5 minute averages, and that I wasn’t experiencing very short peaky loads. This method also has the advantage of not needing any additional pack licensing from Oracle. Sorry Larry.

What are the resources available on the new SAN?

The second requirement for a SAN migration is to understand the capabilities of your new SAN. There were already some systems on the new SAN so I was unable to test to destruction. However, some dedicated resources were available on the shared SAN, namely dedicated disks, cache and front-end ports. I needed to throw substantial workloads at the SAN to stress the resources and find the tipping point.  There are many workload tools available, such as Hammerora and RAT. Some are expensive, such as RAT. However, for my purposes I was only looking to stress the SAN and get a comfortable feeling about the IOPS and throughput performance. This was achieved using Oracle’s free SAN stress tool, Orion [I will hopefully post more details about this shortly]. By taking all of the profiled I/O rates, adding them up and comparing against the new SAN and it’s much greater cache capacity (16GB dedicated to my systems, up from 4GB on the EMC’s), we were able to see that the footprint of the databases to be migrated fitted comfortably into the capabilities of the new SAN. And so the migrations began.

Thin Provisioning

To provide resilience for the databases, LUNs were presented from each SAN to each cluster node, and were mirrored at the Operating System level, either using native mirroring on the Unix platforms, or Veritas on the Windows platforms. This allows either SAN (or SAN location) to fail with no loss of service. The LUNs were either RAID-10 or RAID-5, and were carved out of the storage pools using Thin Provisioning.

Thin Provisioning (or using Meta-Luns or similar techniques from your SAN provider) is a way to allow LUN’s to be easily carved out of storage pools and extended at a later date. They are wonderful for SAN administrators and people who have not capacity planned their systems due to their dynamic nature and minimization of space wastage. However, they work on the basis of providing a bit of storage to your LUN from every disk in the array. This gives a maximum concurrency, maximum contention model. Every LUN is on the same disk competing for the same resource. Lots of IOPS, but everyone wants them. The Thin Provisioning also adds a small overhead to the I/O processing. Not much. Not even a whole milli-second.

Migration Problem

The migrations progressed well, from least critical to most critical system. As each system was migrated, we kept monitoring and measuring the I/O response times. Migrating the Data Warehouse showed that the new SAN was performing as expected. Processing times were lower. Not much, but in line with expectation and allocated resources. However, Grid Control was exhibiting some strange response times. Despite theoretically having more resources, some I/O response time metrics were worse on average, and much more variable than before. With the EMC SAN’s, we  had experienced a good, consistent level of performance for each LUN. The XP’s were proving to be more erratic.

I did a lot of investigation, moving resources around (e.g. dedicated front-end ports for the Grid Control LUN’s.) and measuring the affect. The level of I/O associated with Grid Control, with only a couple of hundred servers under management, was substantially lower than that of systems competing for shared resources (e.g. the Data Warehouse, which was exponentially more demanding). It seemed that Grid Control’s LUNs were not able to survive in the SAN cache. The old EMC cx700’s were relatively “dumb” compared to the new XP20000’s, and had effectively dedicated an amount of  cache to each LUN. The more intelligent XP’s would not be so potentially wasteful of resources, and so deemed the (relatively) low level of activity from the Grid Controls LUN’s to be unimportant and aged the blocks out.

We could live with this for Grid Control. Blisteringly fast response time is not a critical factor in the success of Grid Control. The same could not be said for one of our mission critical applications. It requires a very low end-to-end response time for each transaction. We carefully measure each step – client, calculation, render, network latency, database response time, etc. It needs to be fast and consistent. From the database, it really needs to hit cache or SSD’s. Spinning rust (ordinary disk) is simply too slow. However, the level of activity for this system, in terms of MB/s and IOPS, is relatively low compared to all of the other systems competing for SAN resources. About 3% of the whole. Add to this the overhead of Thin Provisioning and we have a problem. When a 2ms average turns into a nearly 4ms average with much higher peaks, as it did for db_file_sequential_read on Grid Control, that would be a major problem for this application.

Solution

Talking to HP, they would only guarantee the response time we demand by using SSD’s and a cost that made my eyes spin and think of buying a Supercar instead. A more practical (i.e. cheaper) solution had to be found, and it was (with the help amd support of HP, just no guarantees 🙂 ). We bought some more conventional disk and configured it into a traditional Raid-10 storage pool. No Thin Provisioning. We then partitioned the 16GB SAN cache into a 4GB cache dedicated to this system, and left the 12GB cache behind for everything else. We were migrating from 4GB, and understood that we had plenty of wriggle room in this configuration.

And the result? After migration, the performance of the mission critical low throughput system improved substantially, with consistent log_file_parallel_write times of less than 2ms and db_file_sequential_read times of less than 1ms, better than we were experiencing on the old reliable EMC’s. I mean, less than 1ms? That’s fast.

I/O Response Times – Graph

So how big is that disk?

I’m doing a fair bit of SAN-based work at the moment, migrating a bunch of Oracle databases from EMC to HP (posts to follow regarding Orion testing and other related topics). One thing that annoys me is the way SAN Manufacturers have changed the meaning of technical terms over time to suit themselves. Firstly, they changed the meaning of RAID from “Redundant Array of Inexpensive Disk” to “Redundant Array of Independent Disk”. Apparently it’s hard to put a $1,000,000 price tag on “inexpensive” disk.

Also, the industry changed what a Megabyte is. This really annoys me as it can lead to unexpected shortfalls in the LUN allocations if your traditional Oracle Megabyte is different to the SAN Admin’s Megabyte.

In our industry, the terms “kilo”, “mega”, “giga“, “tera”, “peta”, and “exa” are commonly used prefixes for computing performance and capacity. SAN manufacturers use the terms defined in “powers of ten” measurement units:

  • A kilobyte (KB) is equal to 1,000 (103) bytes.
  • A megabyte (MB) is equal to 1,000,000 (106) bytes.
  • A gigabyte (GB) is equal to 1,000,000,000 (109) bytes.
  • A terabyte (TB) is equal to 1,000,000,000,000 (1012) bytes.
  • A petabyte (PB) is equal to 1,000,000,000,000,000 (1015) bytes
  • An exabyte (EB) is equal to 1,000,000,000,000,000,000 (1018) bytes

Most operating systems components and (importantly for us) Oracle use “powers of two” measurement units rather than “power of ten” units. They are defined as:

  • A kibibyte (KiB) is equal to 1,024 (210) bytes.
  • A mebibyte (MiB) is equal to 1,048,576 (220) bytes.
  • A gibibyte (GiB) is equal to 1,073,741,824 (230) bytes.
  • A tebibyte (TiB) is equal to 1,099,511,627,776 (240) bytes.
  • A pebibyte (PiB) is equal to 1,125,899,906,842,624 (250) bytes.
  • An exbibyte (EiB) is equal to 1,152,921,504,606,846,976 (260) bytes.

That means that 1 terabyte is about 9% smaller than a tebibyte. Thanks, SAN manufacturers, for making our lives just a tiny bit more difficult than it needed to be so you can market drives as being seemingly larger than they really are (in my head anyway.)