Exadata System Statistics

Following on from last weeks Oracle Optimizer System Statistics post, I though it worthwhile adding a note about gathering system statistics in Exadata mode.

exec dbms_stats.gather_system_statistics('EXADATA');

So what is this, and why would you chose to do it? First of all, these are not workload system statistics (which I believe you should never gather without extraordinary proof of their effectiveness – all Oracle official documenation and MOS notes about workload stats should point to this blog post now, explaining why: https://blogs.oracle.com/optimizer/should-you-gather-system-statistics). Workload stats measure a number of I/O and CPU metrics. Exadata stats are a little bit of measurement (IO Transfer Speed – IOTFRSPEED) and the hard-setting of the Multi-block read count (MBRC). How are these set?

https://unsplash.com/photos/HwNCyLWw7hw

The IOTFRSPEED is the rate at which an Oracle database can read data in a single read request. It is measured and set (subject to bug 17501565.8 not getting in the way – Pre 12.1- and leaving it to default). The default is 4096, but after measuring and setting it will increase significantly. For the worked example below, lets say the speed was measures at 200,000 (about 200MB, which is a good number for an exadata)

The MBRC is set by copying in the db_file_multiblock_read_count in from the initialization parameters. By default this is not set explicitly and relies upon 2 hidden parameters. If your block size is 8k, it will probably be 128, and if your block size is 16k it will probably be 64. If your block size is any other size, I hope you have a good tested and proven reason for your edge-case configuration.

Each of these setting will change the balance the optimizer will use between single block reads (to read index blocks, and to read table rows by rowid), and performing full table scans. Higher MBRC and IOTFRSPEED’s mean table scans become more attractive to the optimizer. This means they will occur more frequently. What does an Exadata do very well? Offload table scans to storage cells.

DEFAULT, 8K Block Size
MBRC=8 (db_file_multblock_read is not set, so use _db_file_optimizer_read_count)
IOTFRSPEED=4096
Calculated Ratio 0.271, or 3.69 multiblocks for every single block.
DEFAULT, 16K Block Size
MBRC=8 (db_file_multblock_read is not set, so use _db_file_optimizer_read_count)
IOTFRSPEED=4096
Calculated Ratio 0.375, or 2.67 multiblocks for every single block.

16k block sizes mean indexes are slightly more likely to be used as we get less blocks per scan.
DEFAULT, 8K Block Size, db_file_multblock_read is set explicitly
MBRC=128 (db_file_multblock_read is explicitly set, so use explicit 128)
IOTFRSPEED=4096
Calculated Ratio 0.173, or 5.77 multiblocks for every single block.

Tablescans are a little more attractive than default. You get 5.77 blocks for every multiblock read instead of 3.69.
EXADATA GATHERED, 8K Block Size, db_file_multblock_read NOT set explicitly
MBRC=128 (db_file_multblock_read is not set, so set it to the value of _db_file_exec_read_count)
IOTFRSPEED=200000 - this was measured and set by the 'exadata' gather.
Calculated Ratio 0.012, or 84.32 multiblocks for every single block.

For an index to be used it is going to have to be much more efficient than table scan.
Pulling a few rows out of unique indexes will continue to work well but
any kind of predicate which returns a set of data will be pushed towards 
a FTS (or IFFS), and therefore the exadata will offload to the storage cells.

So how does this affect optimizer costing? Looking at a 10053 trace to see how the cost is changed:

I/O COST WITH DEFAULTS: 270,835
[10053] SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for COST_CHECK[COST_CHECK]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: COST_CHECK  Alias: COST_CHECK
    Card: Original: 1000000.000000  Rounded: 1000000  Computed: 1000000.000000  Non Adjusted:1000000.000000
  Scan IO  Cost (Disk) =   270835.000000
  ...
I/O COST EXADATA GATHERED AS ABOVE: 11,860
[10053] SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for COST_CHECK[COST_CHECK]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: COST_CHECK  Alias: COST_CHECK
    Card: Original: 1000000.000000  Rounded: 1000000  Computed: 1000000.000000  Non Adjusted:1000000.000000
  Scan IO  Cost (Disk) =   11860.000000
  ...

So, the I/O cost with “Exadata” system stats us reduces from 270,835 to 11,860. That’s quite a reduction.
Please bear in mind that this is only the Optimizer Cost calculation of a table scan, and not the reality of accessing the data. This is used to decide the access path.

By gathering ‘Exadata’ stats, you need to ensure you have the capacity on your exadata to offload the workload increase on the storage cells. For a classic “overnight load, daytime query, dedicated Exadata”, this is probably a good thing. Where the storage cells are already being worked hard, this will make things worse.

If you followed the advice of the Exadata salesman and marketing people and already removed your indexes, you should already be employing a performance expert to put (some of) the indexes back on…

How did I calculate the ratio’s above? Check the Oracle Optimizer System Statistics blog post from last week….

Oracle Optimizer System Statistics

System Statistics Are a Little Complex


Oh System statistics! Should we gather them? Should we not?
What do they mean? What are they doing to the optimizer?

[tl;dr – in almost all cases, DON’T GATHER SYSTEM STATISTICS. DON’T SET MBRC. LET THEM DEFAULT!]

First you need to be armed with a piece of information. When Oracle optimizes your SQL, it produces a COST to compare each SQL execution plan. Exactly what is this COST? It’s not seconds. It’s not quite the amount of I/O required. It’s not an arbitrary figure either. Oracle Optimizer Cost is the cost of getting your data in terms of SINGLE BLOCK READS. Remember this for later.

Lets have a look at the system stats defaults :

SELECT sname,pname,pval1 FROM SYS.AUX_STATS$ ORDER BY 1,2;
SNAME                PNAME                               PVAL1
-------------------- ------------------------------ ----------
SYSSTATS_INFO        DSTART
SYSSTATS_INFO        DSTOP
SYSSTATS_INFO        FLAGS                                   0
SYSSTATS_INFO        STATUS
SYSSTATS_MAIN        CPUSPEED
SYSSTATS_MAIN        CPUSPEEDNW                           2911 (this will vary)
SYSSTATS_MAIN        IOSEEKTIM                              10
SYSSTATS_MAIN        IOTFRSPEED                           4096
SYSSTATS_MAIN        MAXTHR
SYSSTATS_MAIN        MBRC
SYSSTATS_MAIN        MREADTIM
SYSSTATS_MAIN        SLAVETHR
SYSSTATS_MAIN        SREADTIM

What we are looking for here is the the 3 metrics highlighted in colour. As we can see, the are not set. By default, we need to calculate those, or know where to find the defaults for those values to get to them. Once we have those 3 metrics, we can calculate the RATIO used by the optimizer to convert MULTIBLOCK READ into a cost metric of SINGLE BLOCK READS.

The CPU speed is used to calculate the COST of the query in terms of CPU required. This is small percentage of the query cost but it may help decide which plan is chosen. This is also converted from CPU cost into a cost metric of SINGLE BLOCK READS.

Lets get the values for MBRC, MREADTIM and SREADTIM.

MBRC is easy. If it is not explicitly set, it uses the init.ora parameter “db_file_multiblock_read_count” and uses that. However, by default (and also my recommendation) this should not be set, meaning Oracle will use the hidden parameter “_db_file_optimizer_read_count” to cost your queries. This defaults to 8. [note: this is not the value used in execution. Oracle attempts to do 1MB reads, and uses the value in “_db_file_exec_read_count” to control the multiblock reads at execution time. For an 8K block size, this is set to 128.

SREADTIM and MREADTIM are calculations based upon information we now have:
SREADTIM = IOSEEKTIM + db_block_size        / IOTFRSPEED = 10+(8192 /4096) = 12
MREADTIM = IOSEEKTIM + db_block_size * MBRC / IOTFRSPEED = 10+(8192*8/4096) = 26

Right! Now we have more information and can calculate a ratio. The multi block cost-per-block. This will allow us to take the number of blocks in (for example) a table scan [DBA_TAB_STATISTICS.BLOCKS statistic] and covert it to the cost metric os SINGLE BLOCK READS, meaning we can compare (for example) a FULL TABLE SCAN’s I/O directly with the I/O required for an Index Range Scan and Table Lookup.

multi-block cost-per-block = 1/MBRC * MREADTIM/SREADTIM = 1/8 * 26/12 = 0.270833

If we pull some data from a 10053 trace, where I have a table scan of a table containing 1,000,000 blocks, we should see the 1,000,000 blocks being converted to 270,833 “single block read blocks” for COST purposes.

[10053] SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for COST_CHECK[COST_CHECK]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: COST_CHECK  Alias: COST_CHECK
    Card: Original: 1000000.000000  Rounded: 1000000  Computed: 1000000.000000  Non Adjusted:1000000.000000
  Scan IO  Cost (Disk) =   270835.000000
  Scan CPU Cost (Disk) =   7411440000.000001
  Total Scan IO  Cost  =   270835.000000 (scan (Disk))
                       =   270835.000000
  Total Scan CPU  Cost =   7411440000.000001 (scan (Disk))
                       =   7411440000.000001
  Access Path: TableScan
    Cost:  271041.492812  Resp: 271041.492812  Degree: 0
      Cost_io: 270835.000000  Cost_cpu: 7411440000
      Resp_io: 270835.000000  Resp_cpu: 7411440000
  Best:: AccessPath: TableScan
         Cost: 271041.492812  Degree: 1  Resp: 271041.492812  Card: 1000000.000000  Bytes: 0.000000

Well we were close! There’s a fudge factor in play here, but 270,835 is pretty much 270,833 🙂

We can also work out the CPU element of the COST.
Scan IO  Cost (Disk) =   270835.000000
Cost: 271041.492812
so CPU cost must be 271041.492812 – 270835.000000 = 206.492812

Scan CPU Cost (Disk)=7411440000.000001 so doing the maths to convert to units of “single block read time”…

CPUSPEEDNW = 2,991 (Mhz, but we convert 2,991,000 Khz)

= 7,411,440,000 / 2,991,000 / SREADTIM
‭=  ‭2,477.913741223671013039 / 12
‭= 206.49281176863925108659311267135‬ cost of CPU in units of SREADTIM

So now you can see how your system statistics can fundamentally change the optimizer by changing the ratio of multiblock-reads to single block reads.

You want to play yourself and see what happens? Of course you do…

Here’s an excel spreadsheet where you can plug in your numbers and see what happens to the ratio between single and multi block reads to see how system stats influence the optimizer. By plugging in different numbers, you will see how complex the interactions are – so what’s the ideal numbers? Use the DEFAULTS in almost all cases** [WARNING – don’t go changing your system stats based on this statement. You might experience a significant amount of plan changes, some of which may be very very bad!]:


Also, this is my version of a script by Franck Pachot, and also based on work by Chris Antognini, which you can run against your system:

select pname,pval1,calculated,formula from sys.aux_stats$ where sname='SYSSTATS_MAIN'
model
reference sga on (
select 'Database Buffers' name,sum(bytes) value from v$sgastat where name in ('shared_io_pool','buffer_cache')
) dimension by (name) measures(value)
reference parameter on (
select name,decode(type,3,to_number(value)) value from v$parameter where name='db_file_multiblock_read_count' and ismodified!='FALSE'
union all
select name,decode(type,3,to_number(value)) value from v$parameter where name='sessions'
union all
select name,decode(type,3,to_number(value)) value from v$parameter where name='db_block_size'
union all
SELECT a.ksppinm name, to_number(b.ksppstvl) value FROM x$ksppi a, x$ksppsv b WHERE a.indx=b.indx AND ksppinm like '_db_file_optimizer_read_count'
)
dimension by (name) measures(value)
partition by (sname) dimension by (pname) measures (pval1,pval2,cast(null as number) as calculated,cast(null as varchar2(120)) as formula) rules(
calculated['MBRC']=coalesce(pval1['MBRC'],parameter.value['db_file_multiblock_read_count'],parameter.value['_db_file_optimizer_read_count'],8),
calculated['MREADTIM']=coalesce(pval1['MREADTIM'],pval1['IOSEEKTIM'] + (parameter.value['db_block_size'] * calculated['MBRC'] ) / pval1['IOTFRSPEED']),
calculated['SREADTIM']=coalesce(pval1['SREADTIM'],pval1['IOSEEKTIM'] + parameter.value['db_block_size'] / pval1['IOTFRSPEED']),
calculated['_multi block Cost per block']=round(1/calculated['MBRC']*calculated['MREADTIM']/calculated['SREADTIM'],4),
calculated['_single block Cost per block']=1,
formula['MBRC']=case when pval1['MBRC'] is not null then 'MBRC' when parameter.value['db_file_multiblock_read_count'] is not null then 'db_file_multiblock_read_count' when parameter.value['_db_file_optimizer_read_count'] is not null then '_db_file_optimizer_read_count (db_file_multiblock_read_count not set, which is good!)' else '= not sure so used 8' end,
formula['MREADTIM']=case when pval1['MREADTIM'] is null then '= IOSEEKTIM + db_block_size * MBRC / IOTFRSPEED' end||' = '||pval1['IOSEEKTIM']||'+('||parameter.value['db_block_size']||'*'||calculated['MBRC']||'/'||pval1['IOTFRSPEED']||')',
formula['SREADTIM']=case when pval1['SREADTIM'] is null then '= IOSEEKTIM + db_block_size / IOTFRSPEED' end||' = '||pval1['IOSEEKTIM']||'+('||parameter.value['db_block_size']||'/'||pval1['IOTFRSPEED']||')',
formula['_multi block Cost per block']='= 1/MBRC * MREADTIM/SREADTIM = 1/'||calculated['MBRC']||' * '||calculated['MREADTIM']||'/'||calculated['SREADTIM'],
calculated['_maximum mbrc']=sga.value['Database Buffers']/(parameter.value['db_block_size']*parameter.value['sessions']),
formula['_maximum mbrc']='= buffer cache blocks/sessions (small cache limiter) = ' || sga.value['Database Buffers']/parameter.value['db_block_size']||'/'||parameter.value['sessions'],
formula['_single block Cost per block']='relative to the multi blovk cost per block. Always 1!',
formula['CPUSPEED']='overrides CPUSPEEDNW when set',
formula['CPUSPEEDNW']='CPU speed Mhz - non workload',
formula['IOSEEKTIM']='IO seek time in ms',
formula['IOTFRSPEED']='IO transfer speed in KB/s',
formula['MAXTHR']='Maximum IO system throughput',
formula['SLAVETHR']='average parallel slave IO throughput'
) order by 1;

** there is a case for gathering ‘exadata’ system stats. Increasing the IOTRFSPEED to 200,000 and changing the MBRC to (probably) 128 or 64 will *really* change the ratios, forcing a lot of full table scans down onto the storage cells, instead of using mediocre indexes. This should be considered (and thoroughly tested) if you have a DW on a dedicated Exadata. From a little more on exadata stats, check this blog post

UKOUG Techfest19

I’ve just returned from Techfest19 – the UKOUG Flagship Oracle Tech conference – and it was simply excellent.

79374309_10158533454103840_8801012542173347840_o

Amazing Sunset from the Grand

It was held at The Grand Hotel in Brighton, and it was grand. If you missed it, you really missed out. 250 mostly amazing  presentations.

I did 2 main ones about Oracle Statistics, and also MC’d the Jonathan Lewis Optimizer Panel, alongside Martin Widlake. The highlight of my conference was asking a left-field question of an Oracle Architect about Full Table Scans, which has resulted in the potential for a change to the code to make them (possibly slightly) more efficient. Thank you for listening Roger!

78256648_10158533454068840_8935776057437454336_o

Oracle ACE Dinner

 

 

 

 


If you missed my 2 sessions on Database Statistics, you can find a webinar and a Blair-Witch style video of each of them here on my website:


78372397_10158533455058840_55580213999828992_o

UKOUG Guest Speaker, before the party

Oracle Tech Conferences

It’s mid-November 2019. The leaves have mostly fallen and the nights are drawing in. That can only mean… it’s time for Europes 2 biggest Independent Oracle User Groups to have their conferences.

The first, starting Tuesday 18th November, is DOAG – the German Oracle User Group – in Nuremberg.nuremberg I’m flying out there a couple of days beforehand as I have meetings, and it’s nice to travel and chill out too. This year I’m speaking about Oracle Database Statistics at both conferences, but I only have time for the “basics” talk at DOAG so I’ll try to get some Unconference time to talk about when it gets harder.

Then, from the 1st to the 4th December, it’s the UKOUG Techfest 2019. This is slightly different this year as we have split the main conference, comprising Tech, Business Apps and JDE into 3 separate components at different times of the year. The Tech-only conference has stayed in the December slot (with Business Apps moving to June).

We have also moved the conference to Brighton. New location, Great town.

I’ll be talking about  Oracle Database Statistics, but this time in 2 slots, Database Stats #1. Doing it Right, the Easy Way (Monday 2nd December 09:00), then in a bit more depth I’ll do Database Stats #2. Doing it Right, When it’s Harder (Wednesday 4th December 11:00)

For a UKOUG “conference survival guide”, I recommendbrighton-autumn UKOUG Techfest Survival Guide 2019. All you need to know about getting the most out of 3.5 days of Tech shenanigans. Turn up, learn, but also try to meet people and have fun! I hope to see you at one of both of these conferences, maybe have a coffee or somethings stronger – easy on the stalking though (read Martins Survival Guide first)

If you have not registered for Techfest 2019, you can do that here: UKOUG TechFest 2019!

There a full agenda of around 250 Oracle-related Tech talks. The problem isn’t trying to work out if there’s enough to see, but trying to work out how to prioritise all of them!

See you there!

techfest19_2000x400_techfest

Oracle 19C New Feature Availability

trump-exasperatedSince Oracle Open World 2018, Oracle have been trumpetting a few cool new features in the Oracle 19C database, the headline two for administrators being “Automatic Indexing” and “Real-Time Statistics“.

With the release of Oracle 19.2 on Exadata (on-premises – not yet on Cloud!) this week, we also got the documentation released which allow us to answer a very important question: on which platforms will we be able to use these 2 (and other) new awesome features. We may not like the answers – Oracle has decreed that we cannot have them on-premises for SE2, Enterprise Edition or on an ODA.

This is confined purely to Exadata on-prem and the cloud offerings. That’s it. There are no technical reasons why this should be the case – it’s just code – so the restrictions can only be marketing-based (like Hybrid Columnar Compression).

I’m disappointed, but it does mean that the over 80% of Oracle database clients still not living in the cloud should continue to need that aspect of my services!

Oracle Factoring

OracleLicensingFactoring

 

Pre-creating Interval Partitions

One of the major problems with interval-based partitioning is the statistics. Partitions appear dynamically as-needed based upon data being inserted or udpated, and any partition magically appearing isn’t going to have any statistics.

This is generally a stability issue as you will, at best, be using dynamic statistics for your optimizations. So how do we deal with it? My preferred method is to pre-create the partitions and copy statistics from a good partition into the pre-created partition. But how do we get the partitions to appear? Here’s 2 options:

  1. Insert data into the row with the partition key for the partition we wish to create, and rollback. This can be tricky, especially with tables containing many NOT NULL columns, and is subject to failure based upon table evolution.
  2. Lock the relevant partition in shared mode using the commandLOCK TABLE .x. PARTITION FOR <partition-key> IN SHARE MODE;

    This will place a shared lock on the non-existant partition, which Oracle will create so it can lock it. A much less messy solution, and not one I had thought of until shown the light by Dan Jankowski.

So does option 2 work? Here’s a quick example:

10:51:55 NEIL @ ORCL01 > CREATE TABLE interval_table (id NUMBER, created_date DATE)
10:51:55   2             PARTITION BY RANGE (created_date) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
10:51:55   3           ( PARTITION part_01 values LESS THAN (TO_DATE('01-JAN-2015','DD-MON-YYYY')))
10:51:55   4  /
Table created.

10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA


use a shared lock to generate a new partition

10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-JAN-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.

10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P647                       TO_DATE(' 2016-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-FEB-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-MAR-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > lock table interval_table partition for (to_date('01-APR-2016','DD-MON-YYYY')) in share mode;
Table(s) Locked.
10:51:55 NEIL @ ORCL01 > select table_name, partition_name,high_value from user_tab_partitions order by 1,2;
TABLE_NAME                     PARTITION_NAME                 HIGH_VALUE
------------------------------ ------------------------------ --------------------------------------------------------------------------------
INTERVAL_TABLE                 PART_01                        TO_DATE(' 2015-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P647                       TO_DATE(' 2016-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P648                       TO_DATE(' 2016-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P649                       TO_DATE(' 2016-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
INTERVAL_TABLE                 SYS_P650                       TO_DATE(' 2016-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

10:51:55 NEIL @ ORCL01 > -- and release the locks... the rollback doesn't rollback the creation of the new partitions.
10:51:55 NEIL @ ORCL01 > rollback;
Rollback complete.

So now what? To get the stats right I follow the following rule set:

Firstly, lock the table stats when you create the table and have a dedicated, focused stats job. This stop the general stats job from getting in the way of this this important partitioned table.

  • Create a new partition
  • Copy stats using DBMS_STATS.COPY_TABLE_STATS from your fullest partition (with FORCE=TRUE to override the locked stats option). Always try to pretend you have more data than you really have if faking it – it’s safer as an optimized plan with larger expected data sets processing a smaller data set tends to work much better than the stats assuming a small data set and being surprised by lots of data. Consider using SCALE_FACTOR if you have a growing dataset. Don’t reply upon Optimizer magic, such as Adaptive Execution Plans to dig out of a hole.
  • Once a partition becomes “old” (like a no-longer used date-range partition), gather actual stats to get all your stats for that partition accurate. That may even become your new baseline stats copy partition. You will possibly never need to gather stats again for that partition.
  • Be careful if you are copying stats when you have an object-level change. For example, if you put a new index on, you need to re-copy stats post change to any pre-created partitions. Empty pre-created partitions will have accurate (i.e. empty) stats and that’s really not what you want.
  • Repeat as often as you pre-create a new partition