Stats Collection Time Anomaly

Johnathan Lewis (@JLOracle) recently published a short post about Stats Collection Time, talking about the table dba_optstat_operation (and dba_optstat_operation_tasks ), which reminded me about (what I regard as) an anomaly in the output in the NOTES columns in Oracle 12C.

I won’t repeat why it’s useful to check these tables as Johnathans note and @MDWidlakes’s comment here should give you all you need to know.

The DBA_OPTSTAT_OPERATION.NOTES column contains the parameters passed into the DBMS_STATS command, so you know what was done. It also reports the DEFAULT used by the DBMS_STATS job. Well, it does if you call DBMS_STATS explicitly, but the standard overnight auto job just says “default”. Why doesn’t is expand on that the way the explicit call does? If the default was changed between runs, you may end up with very different results but with no indication why. Am I missing something?

The following 2 rows of data show the output from each run. Note that the DEFAULT for METHOD_OPT in this database has been changed from “FOR ALL COLUMNS SIZE AUTO” to “FOR ALL COLUMNS SIZE REPEAT”** but was not explicitly passed-in for either run.

DBMS_STATS.GATHER_SCHEMA_STATS – decodes the DEFAULTs

OPERATION : gather_schema_stats            
TARGET    : MYSCHEMA     
START_TIME: 15-SEP-16 07.04.47 
END_TIME  : 15-SEP-16 07.09.02 
STATUS    : COMPLETED                                   
JOB_NAME  : 
NOTES     : <params>
            <param name="block_sample" val="FALSE"/>
            <param name="cascade" val="NULL"/>
            <param name="concurrent" val="FALSE"/>
            <param name="degree" val="NULL"/>
            <param name="estimate_percent" val="DBMS_STATS.AUTO_SAMPLE_SIZE"/>
            <param name="force" val="FALSE"/>
            <param name="gather_fixed" val="FALSE"/>
            <param name="gather_temp" val="FALSE"/>
            <param name="granularity" val="AUTO"/>
            <param name="method_opt" val="FOR ALL COLUMNS SIZE REPEAT"/>
            <param name="no_invalidate" val="NULL"/>
            <param name="options" val="GATHER"/>
            <param name="ownname" val="MYSCHEMA"/>
            <param name="reporting_mode" val="FALSE"/>
            <param name="statid" val=""/>
            <param name="statown" val=""/>
            <param name="stattab" val=""/>
            <param name="stattype" val="DATA"/>
            </params>

Autotask Overnight Gather – doesn’t decode the DEFAULTs

OPERATION : gather_database_stats (auto)   
TARGET    : AUTO       
START_TIME: 15-SEP-16 22.01.20 
END_TIME  : 15-SEP-16 22.38.40 
STATUS    : COMPLETED            
JOB_NAME  : ORA$AT_OS_OPT_SY_1212  
NOTES     : <params>
            <param name="block_sample" val="FALSE"/>
            <param name="cascade" val="NULL"/>
            <param name="concurrent" val="FALSE"/>
            <param name="degree" val="DEFAULT_DEGREE_VALUE"/>
            <param name="estimate_percent" val="DEFAULT_ESTIMATE_PERCENT"/>
            <param name="granularity" val="DEFAULT_GRANULARITY"/>
            <param name="method_opt" val="DEFAULT_METHOD_OPT"/>
            <param name="no_invalidate" val="DBMS_STATS.AUTO_INVALIDATE"/>
            <param name="reporting_mode" val="FALSE"/>
            <param name="stattype" val="DATA"/>
            </params>

 

 

 

**as it should be in EVERY ORACLE DATABASE EVER from the start, to allow you to control the histograms that you need and need to maintain on your schema. The Oracle default approach of “everything is skewed, thousands of histograms everywhere please” is particularly painful for OLTP databases using Bind Variable. I’m sure some of Oracles Adaptive Optimization is to work around the bad things that happen under this particular scenario.

 

Accessing STATUS columns efficiently

A frequently reoccuring design problem with relational databases is the issue locating unprocessed rows in a large table, so we know which rows of data are still yet to be processed.

The problem with a STATUS column is that it generally has low cardinality; there are probably only a handful of distinct values [(C)omplete, (E)rror, (U)nprocessed or something like that]. Most records will be (C)omplete. This makes STATUS a poor candidate for standard B-Tree indexation. In a high throughput OLTP database, using bitmap indexes is probably not an option due to concurrency.

[Aside: When coding flag columns in Oracle, ALWAYS use a VARCHAR2(1 CHAR) {or CHAR(1 CHAR) if you prefer, but a CHAR is a VARCHAR2 under the covers and occupies the same number of bytes}. This is in preferance to a NUMBER(1). which occupies more bytes for a “1” than a “0”, so when you update it, you run the risk of row migration, chained rows and a performance hit. Frequently, ORM’s like Hibernate code for NUMBER by default. Override this!]

So what are my options? There’s a short list of possible table accesses for a low cardinality column.

1. Table scan. In an OLTP database where you only want a tiny fraction of the rows in the table, this would be a bad chouce.
2. Index the accessed columns and accept the inevitable INDEX_SCAN or FAST_FULL_INDEX_SCAN. This is not great and you probably need a Histogram on the column to convince the optimizer to use the index for your low frequency values. Otherwise you may be back to the table scan.
3. Make the “Complete” status “NULL”.
4. Uses a function-based index which makes the Complete status seems to be NULL for a specific query.

So what’s with options 3 and 4, why are they good, and how do we use them?

Unlike some RBDMS’s, Oracle does not store NULL values in it’s simple (non-composite) b-tree indexes. Therefore, if you choose Option (3) and make your “Complete” status be represented by a NULL, you will maintain an index on STATUS in which the only values that are stored are values you are interested in. This makes the index very sexy to the optimizer as it will generally be very tiny. However, we face one small problem. Convincing Developers that having a NULL as a valid status can be difficult. A NULL is a non-representative value. It is not supposed to represent anything. It means “I don’t know”. It doesn’t behave the same an normal values. This tends to freak out Developers and designers sometimes.

That’s where Option 4 comes in. If we wrap the index definition in a CASE statement, to produce a function-based index, we have have a highly specific tailored index on our table. If the SQL predicate matches the query exactly, we get a serious performance payoff.

But don’t take my word for it. Here’s a worked example from my laptop:

 
Here’s the table, it’s data distribution (16m rows, and a handful we care about)

NEIL @ ORCL01 > desc test_table
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                            NOT NULL NUMBER
 STATUS                        NOT NULL VARCHAR2(1 CHAR)
 DESCRIPTION                   NOT NULL VARCHAR2(100 CHAR)

NEIL @ ORCL01 > select status,count(*) from test_table group by status

S   COUNT(*)
- ----------
E         16
C   16777216
Y         32

 
Here are the indexes on the table, and their sizes. As you can see, the function-based index is absolutely tiny, making it as attractive to storage admins as it is to the optimizer.

- alter table test_table add constraint test_table_pk primary key (id);
- create index test_table_CASE on test_table (case status when 'Y' then status else null end);
- create index test_table_COVER_COMP on test_table (status, id) compress 1;
- create index test_table_STATUS on test_table (status) compress 1;



NEIL @ ORCL01 > select segment_name,segment_type,sum(bytes/1024) kb from user_extents 
where segment_name like 'TEST_TABLE%' 
group by segment_type,segment_name order by 2 desc,1;

SEGMENT_NAME               SEGMENT_TYPE               KB
-------------------------- ------------------ ----------
TEST_TABLE                 TABLE                  555008
TEST_TABLE_CASE            INDEX                      64
TEST_TABLE_COVER_COMP      INDEX                  658432
TEST_TABLE_PK              INDEX                  319488
TEST_TABLE_STATUS          INDEX                  413696

Some Index stats:
INDEX_NAME                DISTINCT_KEYS AVG_LEAF_BLOCKS_PER_KEY AVG_DATA_BLOCKS_PER_KEY CLUSTERING_FACTOR STATUS     NUM_ROWS SAMPLE_SIZE LAST_ANAL
------------------------- ------------- ----------------------- ----------------------- ----------------- -------- ---------- ----------- ---------
TEST_TABLE_CASE                       1                       1                       6                 6 VALID            32          32 21-FEB-16
TEST_TABLE_COVER_COMP          16748149                       1                       1            125447 VALID      16748149      234974 21-FEB-16
TEST_TABLE_PK                  17003239                       1                       1             91391 VALID      17003239      492287 21-FEB-16
TEST_TABLE_STATUS                     3                   13828                   32011             96034 VALID      16257590      363295 21-FEB-16

 
Where we have a choice of useful indexes, we get a FAST FULL SCAN with a hefty cost. A histogram could have given us an index RANGE SCAN, which can be very good.
With no Histogram:

select id from test_table where status = 'Y';

Plan hash value: 1140618830

----------------------------------------------------------------------------------------------
| Id  | Operation            | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                       |       |       | 18753 (100)|          |
|*  1 |  INDEX FAST FULL SCAN| TEST_TABLE_COVER_COMP |  5592K|    42M| 18753   (1)| 00:00:01 |
----------------------------------------------------------------------------------------------

 
With a histogram in place on STATUS, you get a much better plan as the covering index avoids the need for the table look-up. You also get the risk that the optimizer may have bind variable peeking issues and other complications should we have lots of table joins.

select id from test_table where status = 'Y'

Plan hash value: 2912582684

------------------------------------------------------------------------------------------
| Id  | Operation        | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |                       |       |       |     3 (100)|          |
|*  1 |  INDEX RANGE SCAN| TEST_TABLE_COVER_COMP |    32 |   256 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

NOTE: Ditching the covering index and just using the index on STATUS is pretty efficient too when combined with a histogram:

select id from test_table where status = 'Y'

Plan hash value: 2416598805

---------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TEST_TABLE        |    32 |   256 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TEST_TABLE_STATUS |    32 |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------


 
And now with the function-based index; having the case statement removing all statuses we are not interested-in for a tiny tidy index.

NOTE: The Predicate in the query must EXACTLY match the function-based index for it to be used.

select id from test_table where case status when 'Y' then status else null end = 'Y'

Plan hash value: 2073004851

-------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |       |       |     7 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TEST_TABLE      |    32 |   256 |     7   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | TEST_TABLE_CASE |    32 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Conclusion: For a highly skewed STATUS column you need a histogram, which is something you should mostly avoid in OLTP systems using BIND variables. Having a highly focussed function-based index allows for a tiny self-maintaining index which is guaranteed to only be used for queries that you want it to be used for.
 

NOTE: The original idea behind using NULLS to minimise index size came from the performance expert, Jonathan Lewis. I have implemented both NULL-as-complete design and case-based indexes at several clients, in varying forms, and always to great success.

Pre-creating Interval Partitions

One of the major problems with interval-based partitioning is the statistics. Partitions appear dynamically as-needed based upon data being inserted or udpated, and any partition magically appearing isn’t going to have any statistics.

This is generally a stability issue as you will, at best, be using dynamic statistics for your optimizations. So how do we deal with it? My preferred method is to pre-create the partitions and copy statistics from a good partition into the pre-created partition. But how do we get the partitions to appear? Here’s 2 options:

  1. Insert data into the row with the partition key for the partition we wish to create, and rollback. This can be tricky, especially with tables containing many NOT NULL columns, and is subject to failure based upon table evolution.
  2. Lock the relevant partition in shared mode using the commandLOCK TABLE .x. PARTITION FOR <partition-key> IN SHARE MODE;

    This will place a shared lock on the non-existant partition, which Oracle will create so it can lock it. A much less messy solution, and not one I had thought of until shown the light by Dan Jankowski.

So does option 2 work? Here’s a quick example:

10:51:55 NEIL @ ORCL01 > CREATE TABLE interval_table (id NUMBER, created_date DATE)
10:51:55   2             PARTITION BY RANGE (created_date) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
10:51:55   3           ( PARTITION part_01 values LESS THAN (TO_DATE('01-JAN-2015','DD-MON-YYYY')))
10:51:55   4  /
Table created.

10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA


use a shared lock to generate a new partition

10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-JAN-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.

10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P647                       TO_DATE(' 2016-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-FEB-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-MAR-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-APR-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P647                       TO_DATE(' 2016-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P648                       TO_DATE(' 2016-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P649                       TO_DATE(' 2016-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P650                       TO_DATE(' 2016-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

10:51:55 NEIL @ ORCL01 > -- and release the locks... the rollback doesn't rollback the creation of the new partitions.
10:51:55 NEIL @ ORCL01 > rollback;
Rollback complete.

So now what? To get the stats right I follow the following rule set:

Firstly, lock the table stats when you create the table and have a dedicated, focused stats job. This stop the general stats job from getting in the way of this this important partitioned table.

  • Create a new partition
  • Copy stats using DBMS_STATS.COPY_TABLE_STATS from your fullest partition (with FORCE=TRUE to override the locked stats option). Always try to pretend you have more data than you really have if faking it – it’s safer as an optimized plan with larger expected data sets processing a smaller data set tends to work much better than the stats assuming a small data set and being surprised by lots of data. Consider using SCALE_FACTOR if you have a growing dataset. Don’t reply upon Optimizer magic, such as Adaptive Execution Plans to dig out of a hole.
  • Once a partition becomes “old” (like a no-longer used date-range partition), gather actual stats to get all your stats for that partition accurate. That may even become your new baseline stats copy partition. You will possibly never need to gather stats again for that partition.
  • Be careful if you are copying stats when you have an object-level change. For example, if you put a new index on, you need to re-copy stats post change to any pre-created partitions. Empty pre-created partitions will have accurate (i.e. empty) stats and that’s really not what you want.
  • Repeat as often as you pre-create a new partition

Auditing Read-Only Standbys

If your company has a passing interest in database security, you are probably running some sort of session auditing at the very least [audit session] (although this can also be useful for troubleshooting connectivity issues). There’s a reasonable chance you’re also running some level of object auditing, or even DML access auditing if your employer so dictates:

Check access/change of objects in the DB
  1  select audit_option, success, failure
  2  from dba_stmt_audit_opts
  3  union
  4  select privilege, success, failure
  5  from dba_priv_audit_opts
  6* order by 2,1
AUDIT_OPTION                                       SUCCESS                                  FAILURE
-------------------------------------------------- ---------------------------------------- -----------------
ALTER ANY PROCEDURE                                BY ACCESS                                BY ACCESS
ALTER ANY TABLE                                    BY ACCESS                                BY ACCESS
ALTER DATABASE                                     BY ACCESS                                BY ACCESS
ALTER PROFILE                                      BY ACCESS                                BY ACCESS
ALTER SYSTEM                                       BY ACCESS                                BY ACCESS
ALTER TABLE                                        BY ACCESS                                BY ACCESS
ALTER USER                                         BY ACCESS                                BY ACCESS
AUDIT SYSTEM                                       BY ACCESS                                BY ACCESS
CREATE ANY JOB                                     BY ACCESS                                BY ACCESS
CREATE ANY LIBRARY                                 BY ACCESS                                BY ACCESS
CREATE ANY PROCEDURE                               BY ACCESS                                BY ACCESS
CREATE ANY TABLE                                   BY ACCESS                                BY ACCESS
CREATE EXTERNAL JOB                                BY ACCESS                                BY ACCESS
CREATE LIBRARY                                     BY ACCESS                                BY ACCESS
CREATE PROCEDURE                                   BY ACCESS                                BY ACCESS
CREATE PUBLIC DATABASE LINK                        BY ACCESS                                BY ACCESS
CREATE SESSION                                     BY ACCESS                                BY ACCESS
CREATE TABLE                                       BY ACCESS                                BY ACCESS
CREATE USER                                        BY ACCESS                                BY ACCESS
DATABASE LINK                                      BY ACCESS                                BY ACCESS
DIRECTORY                                          BY ACCESS                                BY ACCESS
DROP ANY PROCEDURE                                 BY ACCESS                                BY ACCESS
DROP ANY TABLE                                     BY ACCESS                                BY ACCESS
DROP PROFILE                                       BY ACCESS                                BY ACCESS
DROP USER                                          BY ACCESS                                BY ACCESS
EXEMPT ACCESS POLICY                               BY ACCESS                                BY ACCESS
GRANT ANY OBJECT PRIVILEGE                         BY ACCESS                                BY ACCESS
GRANT ANY PRIVILEGE                                BY ACCESS                                BY ACCESS
GRANT ANY ROLE                                     BY ACCESS                                BY ACCESS
PROFILE                                            BY ACCESS                                BY ACCESS
PUBLIC DATABASE LINK                               BY ACCESS                                BY ACCESS
PUBLIC SYNONYM                                     BY ACCESS                                BY ACCESS
ROLE                                               BY ACCESS                                BY ACCESS
SYSTEM AUDIT                                       BY ACCESS                                BY ACCESS
SYSTEM GRANT                                       BY ACCESS                                BY ACCESS
USER                                               BY ACCESS                                BY ACCESS
CREATE JOB                                         BY SESSION                               BY SESSION

And you’re probably writing it into a database table [AUDIT_TRAIL=’DB’]

so how does that work if you open a Dataguard database read only? You are writing into sys.aud$ on the Primary, and that table is replicated to the Standby. So what happens?

From Oracle 11G, if you are running Dataguard and opening the standby up for read access, you may not notice the line in the ALERT log which reads:

AUDIT_TRAIL initialization parameter is changed to OS, as DB is NOT compatible for database opened with read-only access

So it starts writing down O/S audit trail files for all of your auditing options (well, the session connect and DML audit options – you can’t run DDL in a r/o DB). You might want to go and see just how many files it has written to [audit_file_dest], as you may be surprised at just how many are in there.

You may, one day, either run out of space or (more worryingly) have so many millions of files that it causes a performance problem when Oracle access the O/S directory. You might want to think about some sort of periodic clean-up job.

SQL Plan Management – 12C dumb feature

In Oracle 11G, Oracle introduced SQL Plan Management (SPM). It is excellent (I love it to bits). It allows you to create Baselines against SQL which lock-down the SQL execution plan. No more plan flips. More consistency. Perfect**.

Whenever some Baselined SQL is ran, Oracle still parses it and compares the parsed output to the accepted (Evolved) baselines. If the newly parsed plan is better, a new baseline is added to DBA_SQL_PLAN_BASELINES but is NOT accepted. This means that you need to spend time manually accepting the baseline; running the command DBMS_SPM.EVOLVE_SQL_BASELINE plan and checking the new plan.

If you want it, and/or Oracle evaluates that is it a better plan for that particular set of bind variables, the plan is accepted and becomes a candidate to be used by future execution of your SQL. Complete control over your execution plans.

So, Oracle, what’s all this about in Oracle 12C, eh?

In Oracle 12C there’s a new SPM Evolve advisor task. “By default, SYS_AUTO_SPM_EVOLVE_TASK runs daily in the scheduled maintenance window” – So, it runs every night and by default it runs DBMS_SPM.EVOLVE_SQL_BASELINE for all new baselines created today and automatically accepts the new plans.

BY DEFAULT? NO! NO! NO!

That is precisely what I don’t want from baselines – Oracle making it’s own mind up about plans without any input from me. I’m using baselines to stop Oracle changing its mind. To explicitly limit the number of paths allowed by the Optimizer to ones I know about and with which I am comfortable. Don’t introduce functionality to do the opposite.

So, immediately following the installation of 12C, I would recommend running (you need to be SYS for this):

SELECT PARAMETER_NAME, PARAMETER_VALUE AS "VALUE"
FROM   DBA_ADVISOR_PARAMETERS
WHERE  TASK_NAME = 'SYS_AUTO_SPM_EVOLVE_TASK' AND
         PARAMETER_NAME in ('ACCEPT_PLANS','TIME_LIMIT')
ORDER BY 1;

PARAMETER_NAME            VALUE
------------------------- ----------
ACCEPT_PLANS              TRUE
TIME_LIMIT                3600

Then run:

BEGIN
  DBMS_SPM.SET_EVOLVE_TASK_PARAMETER('SYS_AUTO_SPM_EVOLVE_TASK',
    'ACCEPT_PLANS', 'false');
END;
/

OK, back where we were, with any baselines fixed in place and doing what I want them to do! Not change.

 

**Perfect? No. But Baselines are good and, as long as your DB structure does not change, they should keep working. If they don’t, raise an SR with Oracle as it’s probably a bug.

UPDATE 2015-11-25: This is still as true in 12.1.0.2.5 as it was in 12.0 Grrrr!!!

SAN Migration: When modern SANs “Fail”

… or at least, when they don’t behave like you were expecting them to.

I recently performed a major migration of a dozen missions critical Oracle and SQL Server systems from a pair of old EMC CX700’s to 2 brand new shiny HP XP 20000’s. This blog post is intended to show, at a very high level, what steps were taken to ensure the new SAN could cope with the volume of traffic. It also shows that, despite the new SAN being much more modern and intelligent, how you can still get caught out by unexpected behaviour.

The databases to be migrated varied from high throughput OLTP, low throughput – fast response time OLTP, a small Data Warehouse and Grid Control. The systems are 24 x 7 x 365, but a nightly maintenance window can be arranged on some days of the week between midnight and 3am. Not a large window, but invaluable.

What are we migrating?

For a successful SAN migration, we first need to understand what we are migrating. To do this we need to profile the I/O requirement of each Oracle database. Oracle provides us with all of the tools we need to be able to do this. If you have the Tuning and Diagnostics Pack licensed (which all Oracle sites should have – it should really be part of base product.) then you can pull the information you need directly out of the AWR table DBA_HIST_SYSTEM_EVENT. By default your AWR stats gather every hour and only last a week. Hourly averages of system event wait times will give a good starting point, but I would not want to use them to determine peak volumes. If you capacity plan to your averages, you will run out of resources before you hit your peaks. A lot can happen in an hour when you are looking at millisecond response times. You need to be much more granular than an hourly average, but you might not want change AWR to gather all of its information as frequently as you need for the I/O events. You need to develop some code to gather the information from v$system_event directly, so I did [I will post this code up shortly]. You can be as granular as you want, but I balanced peaks against volume of information and the impact of gathering it and was running every 5 minutes. Mostly. I also took some much more frequent samples to ensure that I wasn’t wildly wrong with the 5 minute averages, and that I wasn’t experiencing very short peaky loads. This method also has the advantage of not needing any additional pack licensing from Oracle. Sorry Larry.

What are the resources available on the new SAN?

The second requirement for a SAN migration is to understand the capabilities of your new SAN. There were already some systems on the new SAN so I was unable to test to destruction. However, some dedicated resources were available on the shared SAN, namely dedicated disks, cache and front-end ports. I needed to throw substantial workloads at the SAN to stress the resources and find the tipping point.  There are many workload tools available, such as Hammerora and RAT. Some are expensive, such as RAT. However, for my purposes I was only looking to stress the SAN and get a comfortable feeling about the IOPS and throughput performance. This was achieved using Oracle’s free SAN stress tool, Orion [I will hopefully post more details about this shortly]. By taking all of the profiled I/O rates, adding them up and comparing against the new SAN and it’s much greater cache capacity (16GB dedicated to my systems, up from 4GB on the EMC’s), we were able to see that the footprint of the databases to be migrated fitted comfortably into the capabilities of the new SAN. And so the migrations began.

Thin Provisioning

To provide resilience for the databases, LUNs were presented from each SAN to each cluster node, and were mirrored at the Operating System level, either using native mirroring on the Unix platforms, or Veritas on the Windows platforms. This allows either SAN (or SAN location) to fail with no loss of service. The LUNs were either RAID-10 or RAID-5, and were carved out of the storage pools using Thin Provisioning.

Thin Provisioning (or using Meta-Luns or similar techniques from your SAN provider) is a way to allow LUN’s to be easily carved out of storage pools and extended at a later date. They are wonderful for SAN administrators and people who have not capacity planned their systems due to their dynamic nature and minimization of space wastage. However, they work on the basis of providing a bit of storage to your LUN from every disk in the array. This gives a maximum concurrency, maximum contention model. Every LUN is on the same disk competing for the same resource. Lots of IOPS, but everyone wants them. The Thin Provisioning also adds a small overhead to the I/O processing. Not much. Not even a whole milli-second.

Migration Problem

The migrations progressed well, from least critical to most critical system. As each system was migrated, we kept monitoring and measuring the I/O response times. Migrating the Data Warehouse showed that the new SAN was performing as expected. Processing times were lower. Not much, but in line with expectation and allocated resources. However, Grid Control was exhibiting some strange response times. Despite theoretically having more resources, some I/O response time metrics were worse on average, and much more variable than before. With the EMC SAN’s, we  had experienced a good, consistent level of performance for each LUN. The XP’s were proving to be more erratic.

I did a lot of investigation, moving resources around (e.g. dedicated front-end ports for the Grid Control LUN’s.) and measuring the affect. The level of I/O associated with Grid Control, with only a couple of hundred servers under management, was substantially lower than that of systems competing for shared resources (e.g. the Data Warehouse, which was exponentially more demanding). It seemed that Grid Control’s LUNs were not able to survive in the SAN cache. The old EMC cx700’s were relatively “dumb” compared to the new XP20000’s, and had effectively dedicated an amount of  cache to each LUN. The more intelligent XP’s would not be so potentially wasteful of resources, and so deemed the (relatively) low level of activity from the Grid Controls LUN’s to be unimportant and aged the blocks out.

We could live with this for Grid Control. Blisteringly fast response time is not a critical factor in the success of Grid Control. The same could not be said for one of our mission critical applications. It requires a very low end-to-end response time for each transaction. We carefully measure each step – client, calculation, render, network latency, database response time, etc. It needs to be fast and consistent. From the database, it really needs to hit cache or SSD’s. Spinning rust (ordinary disk) is simply too slow. However, the level of activity for this system, in terms of MB/s and IOPS, is relatively low compared to all of the other systems competing for SAN resources. About 3% of the whole. Add to this the overhead of Thin Provisioning and we have a problem. When a 2ms average turns into a nearly 4ms average with much higher peaks, as it did for db_file_sequential_read on Grid Control, that would be a major problem for this application.

Solution

Talking to HP, they would only guarantee the response time we demand by using SSD’s and a cost that made my eyes spin and think of buying a Supercar instead. A more practical (i.e. cheaper) solution had to be found, and it was (with the help amd support of HP, just no guarantees 🙂 ). We bought some more conventional disk and configured it into a traditional Raid-10 storage pool. No Thin Provisioning. We then partitioned the 16GB SAN cache into a 4GB cache dedicated to this system, and left the 12GB cache behind for everything else. We were migrating from 4GB, and understood that we had plenty of wriggle room in this configuration.

And the result? After migration, the performance of the mission critical low throughput system improved substantially, with consistent log_file_parallel_write times of less than 2ms and db_file_sequential_read times of less than 1ms, better than we were experiencing on the old reliable EMC’s. I mean, less than 1ms? That’s fast.

I/O Response Times – Graph

The 10046 trace. Largely useless, isn’t it?

The other night I was sat in the pub with some like-minded individuals discussing the relative merits of the 10046 trace (we Rock! in the pub, dudes!) and somebody asked me how often I has actually used it in anger? A well-respected DBA / Architect maintained it was a pretty useless and difficult option to use, given the topology of modern applications (e.g. How do you find the correct session with all that connection pooling going on from multiple web servers.)

My answer surprised me – I thought back to one client where I spent 90% of my time performance tuning a large (TiB-ish) OLTP/Batch hybrid system and concluded that I had ran a 10046 against production about once a year. Once. So if the 10046 is the holy grail of plan information, why wasn’t I using it that much. And why did I never use a 10053 against Production there?

The answer for me is a little more complex than that given in the pub:

1. as stated above, it’s hard to catch the in-flight session unless the application is instrumented to inject the trace statement when needed (and how many applications are instrumented to help you discover problems? Screen ST03 in SAP is very helpful. Any others in major ERP’s? Thought not.)

2. In many places that I have worked, getting authorisation to make any a change to a 24×7 mission-critical system is highly bureaucratic, involving cast-iron justification for the change and it’s positive benefits, requirement that there will be no adverse effects because of the change, very senior sign-off, more red-tape, etc. This causes a significant amount of work simply to put a trace on, even if you can catch the SQL. This can end up being more work than actually fixing the problem.

3. An awful lot of SQL tuning is a fairly blunt affair, as the developer (who is frequently database-blind) has usually missed something obvious. It is frequently to do with incorrectly using or not using an index (or using a poor index), or lack of filtering data at the right point to minimise the I/O.

4. Most importantly, if you have AWR and ASH, it’s not really needed. For each plan created by the optimizer the database stores the bind variables along with it, so we can usually understand why the optimizer makes the decisions it makes. ASH contains the main event waits. Why bother trying to capture all of the detail in a trace when you really don’t need that much detail, and it’s all already there; ready to be extracted from the relevant tables (e.g. dba_hist_active_sess_history, dba_hist_sql_plan and dba_hist_sql_bind.)

I have never used a 10053 trace on a Production system. I have simply never needed to know the decisions taken by the optimizer in that much detail. Like most DBA’s and Oracle consultants, I don’t go from site-to-site on a weekly basis resolving edge-case problems that the incumbent DBA’s haven’t had the time, or possibly don’t have the skills, to resolve themselves. I usually don’t need that level of confirmation that I’m right about why the plan is wrong, and I don’t have the time to conclusively prove it over and over again – I just need to get the fix into place and move onto the next problem.

That said, perhaps I should get fully to the bottom of these problems to ensure that they never occur again – which is the fundamental problem with Adaptive Cursor Sharing.

%d bloggers like this: