Huge Pages on Exadata

Quick note to myself: Oracle has deprecated the initialisation parameter use_large_pages=”true” on Exadata as of 19.13 (OK – could be 19.12, I’ve not checked)

If you started your database in 19.11, and had this set, you got yourself a warning in the alert log (if you have a look):
“WARNING: use_large_pages=TRUE is not a recommended setting for database on Exadata”

from 19.13, this has changed to

“ERROR: use_large_pages = TRUE is not a supported setting for database on Exadata”
” : suggested values are AUTO_ONLY or ONLY”

So what’s the difference?

use_large_pages = TRUE – allocate memory out of huge pages. If there aren’t enough huge pages, allocate the rest out of standard pages. This can result in performance degradation and increased transaction lookaside buffer allocation to manage the standard pages (taking even more memory).

use_large_pages = ONLY – allocate memory out of large pages. If there aren’t enough large pages, fail to start the database.

use_large_pages = AUTO_ONLY – allocate memory out of large pages. If there aren’t enough large pages configured, configure some more by increasing the kernel parameter vm.nr_hugepages. If that fails (e.g. if it can’t get enough contiguous memory to deallocate from standard pages and allocate the new huge pages) fail to start the database. This does result in an “up-only” scenario for huge page allocation as there is no mechanism to automatically deallocate the hugepages once they have been allocated, but the changes are not written permanently to the kernel config so a reboot would take care of it.

So what should you do?

Well, the idea behind AUTO_ONLY is seems to be Oracle ongoing “death of the DBA”, whereby the DBA doesn’t need to understand huge pages and automatically gets the memory and performance benefits associated with using huge pages.

No huge pages allocated:
$ cat /proc/meminfo | grep -i huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB

------------------------------------------------------------------------

SQL> ALTER system SET use_large_pages='AUTO_ONLY' sid='*' scope=spfile;
 
System altered.

SQL> shutdown immediate 
ORACLE instance shut down.

SQL> startup nomount
ORACLE instance started.

Total System Global Area 3154114080 bytes
Fixed Size                  9139744 bytes
Variable Size            1207959552 bytes
Database Buffers         1929379840 bytes
Redo Buffers                7634944 bytes

(note: "startup nomount" is enough to allocate memory, so the huge_pages are generated at this point)

------------------------------------------------------------------------
And they have magically appeared! (note: total allocation will vary by a page or two)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    1506
HugePages_Free:        3
HugePages_Rsvd:        3
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         3082240 kB

$ sysctl vm.nr_hugepages
vm.nr_hugepages = 1506
Oracle must be changing the setting and performing a "sysctl" to allocate the pages. If anyone knows what their algorithm is to decide if the pages should be allocated to SGA (and "stolen" from potential PGA and O/S utilisation), and at what percentage will Oracle stop taking from the PGA and O/S? 

An 8G (7972M available) server is allowed an SGA of 6,906M (86% of available memory). Not a lot left for PGA there.

set it too large and you could hit 
"ORA-27104: system-defined limits for shared memory was misconfigured"
which might result in
"ORA-00700: soft internal error, arguments: [sms physmem limit], [6828326912]... "
or
"ORA-00700: soft internal error, arguments: [pga physmem limit], [1048576000]..."

due to being unable to allocate PGA/SGA. I'm talking:

"WARNING: pga_aggregate_limit value is too high for the
amount of physical memory on the system


I'd upped the SGA_MAX_SIZE to 6516M, started the DB, lowered it to 3000M and restarted the DB without a server reboot and...

Total System Global Area 3154114080 bytes
Fixed Size                  9139744 bytes
Variable Size             637534208 bytes
Database Buffers         2499805184 bytes
Redo Buffers                7634944 bytes
Database mounted.
Database opened.

SQL > !cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    3258 <- that's a lot of 2M pages (6516M)
HugePages_Free:     1752 <- we've only allocated 3000M. No cleanup without a reboot
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         6672384 kB


Happy huge pages everyone and keep checking that alert log.
Advertisement

Exadata System Statistics

Following on from last weeks Oracle Optimizer System Statistics post, I though it worthwhile adding a note about gathering system statistics in Exadata mode.

exec dbms_stats.gather_system_statistics('EXADATA');

So what is this, and why would you chose to do it? First of all, these are not workload system statistics (which I believe you should never gather without extraordinary proof of their effectiveness – all Oracle official documenation and MOS notes about workload stats should point to this blog post now, explaining why: https://blogs.oracle.com/optimizer/should-you-gather-system-statistics). Workload stats measure a number of I/O and CPU metrics. Exadata stats are a little bit of measurement (IO Transfer Speed – IOTFRSPEED) and the hard-setting of the Multi-block read count (MBRC). How are these set?

https://unsplash.com/photos/HwNCyLWw7hw

The IOTFRSPEED is the rate at which an Oracle database can read data in a single read request. It is measured and set (subject to bug 17501565.8 not getting in the way – Pre 12.1- and leaving it to default). The default is 4096, but after measuring and setting it will increase significantly. For the worked example below, lets say the speed was measures at 200,000 (about 200MB, which is a good number for an exadata)

The MBRC is set by copying in the db_file_multiblock_read_count in from the initialization parameters. By default this is not set explicitly and relies upon 2 hidden parameters. If your block size is 8k, it will probably be 128, and if your block size is 16k it will probably be 64. If your block size is any other size, I hope you have a good tested and proven reason for your edge-case configuration.

Each of these setting will change the balance the optimizer will use between single block reads (to read index blocks, and to read table rows by rowid), and performing full table scans. Higher MBRC and IOTFRSPEED’s mean table scans become more attractive to the optimizer. This means they will occur more frequently. What does an Exadata do very well? Offload table scans to storage cells.

DEFAULT, 8K Block Size
MBRC=8 (db_file_multblock_read is not set, so use _db_file_optimizer_read_count)
IOTFRSPEED=4096
Calculated Ratio 0.271, or 3.69 multiblocks for every single block.
DEFAULT, 16K Block Size
MBRC=8 (db_file_multblock_read is not set, so use _db_file_optimizer_read_count)
IOTFRSPEED=4096
Calculated Ratio 0.375, or 2.67 multiblocks for every single block.

16k block sizes mean indexes are slightly more likely to be used as we get less blocks per scan.
DEFAULT, 8K Block Size, db_file_multblock_read is set explicitly
MBRC=128 (db_file_multblock_read is explicitly set, so use explicit 128)
IOTFRSPEED=4096
Calculated Ratio 0.173, or 5.77 multiblocks for every single block.

Tablescans are a little more attractive than default. You get 5.77 blocks for every multiblock read instead of 3.69.
EXADATA GATHERED, 8K Block Size, db_file_multblock_read NOT set explicitly
MBRC=128 (db_file_multblock_read is not set, so set it to the value of _db_file_exec_read_count)
IOTFRSPEED=200000 - this was measured and set by the 'exadata' gather.
Calculated Ratio 0.012, or 84.32 multiblocks for every single block.

For an index to be used it is going to have to be much more efficient than table scan.
Pulling a few rows out of unique indexes will continue to work well but
any kind of predicate which returns a set of data will be pushed towards 
a FTS (or IFFS), and therefore the exadata will offload to the storage cells.

So how does this affect optimizer costing? Looking at a 10053 trace to see how the cost is changed:

I/O COST WITH DEFAULTS: 270,835
[10053] SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for COST_CHECK[COST_CHECK]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: COST_CHECK  Alias: COST_CHECK
    Card: Original: 1000000.000000  Rounded: 1000000  Computed: 1000000.000000  Non Adjusted:1000000.000000
  Scan IO  Cost (Disk) =   270835.000000
  ...
I/O COST EXADATA GATHERED AS ABOVE: 11,860
[10053] SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for COST_CHECK[COST_CHECK]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: COST_CHECK  Alias: COST_CHECK
    Card: Original: 1000000.000000  Rounded: 1000000  Computed: 1000000.000000  Non Adjusted:1000000.000000
  Scan IO  Cost (Disk) =   11860.000000
  ...

So, the I/O cost with “Exadata” system stats us reduces from 270,835 to 11,860. That’s quite a reduction.
Please bear in mind that this is only the Optimizer Cost calculation of a table scan, and not the reality of accessing the data. This is used to decide the access path.

By gathering ‘Exadata’ stats, you need to ensure you have the capacity on your exadata to offload the workload increase on the storage cells. For a classic “overnight load, daytime query, dedicated Exadata”, this is probably a good thing. Where the storage cells are already being worked hard, this will make things worse.

If you followed the advice of the Exadata salesman and marketing people and already removed your indexes, you should already be employing a performance expert to put (some of) the indexes back on…

How did I calculate the ratio’s above? Check the Oracle Optimizer System Statistics blog post from last week….

UKOUG Technology Day

northerntech17-com-600x110-rn-v2

We have organized a UKOUG Technology Summit in Leeds on 16th May.

It’s really a small Conference!

This is available to UKOUG Members and will cost 1 SIG Pass – even Bronze memberships come with 1 SIG pass, so all members could attend at no cost.
Non-members can pay to attend, but you should take out a Bronze Membership instead as it’s about the same price!

Come along, network with your peers, and listen to some amazing talks about all aspects of the Oracle technology product set; the Database, RAC, Engineered Systems, Development, Apex, DevOps, Solaris 11.4!, and more. Learn how Oracle interacts and works with Kubernetes, Docker, Apache Kafka, Apache Impala, Chatbots, and  just about everything else! 30 amazing talks!

Plus a keynote address from Oracle Executive John Abel and a social afterwards too!

All that, just 2 hours from London on the train!

You can find the Agenda here 

northerntech17-agenda