Parallel Madness

usain-bolt-1I’ve noticed at a few clients with data warehouses recently that the Developers and, upon occasion, Business Users have a real fondness for hinting the SQL they are producing with one particular hint. PARALLEL.

As any fule kno, this IS the magic go-faster hint. PARALLEL(2) is obviously twice as fast as serial execution. PARALLEL(4) is amazing and PARALLEL(64) like Usain Bolt on Red Bull.

The problem is that, like all database features, parallel query comes with a cost.

When you specify /*+ PARALLEL(n) */ you are telling the optimizer that it has a lot more resources to use to complete this particular query. Not a single thread but many. PARALLEL(10) will use 21 processes to complete its execution – 20 Parallel Execution Server (10 producers, 10 consumers) and a coordinator (which is your connections shadow process) which will deal with any aspects of the parallel plan which cannot be parallelised.

Allowed free reign to use PARALLEL, devs and users will quickly consume all of the resources available on a given server, causing contentions which will inevitably slow down the overall execution of every piece of code executing on there. To illustrate this, I’d like to use an example I came across a while ago to show how excess PARALLEL of a single statement can be problematic itself.

Lets say I have a single server with 16 cores, lots of memory and a decent SSD array so the problem will centre around the CPU. Inevitably your 16 cores will be hyperthreaded. This then looks to Oracle like you have 32 cores. Whilst Oracle knows you have 16 hyperthreaded cores, you get CPU_COUNT=32

NOTE: 16 cores hyperthreaded DO NOT have the power of 32 cores, especially when dealing with databases. Some database workloads are actually WORSE with hyperthreading enabled (e.g. Data Warehouse systems on SQL Server). Inevitably the server admins will have enabled it unless you can provide cast-iron evidence to have it disabled.

I have a statement which the users are complaining about. It starts with the  following code: SELECT /*+ PARALLEL */ (complex multi-table join)

So what does this unrestricted (and therefore DEFAULT!) degree of parallelism (DOP) do in this case?
The default DOP is defined as PARALLEL_THREADS_PER_CPU x CPU_COUNT=2 x 32 = PARALLEL(64)
On RAC it is PARALLEL_THREADS_PER_CPU x CPU_COUNT x INSTANCE_COUNT!

Lets have a look at the ACTIVITY of this PARALLEL(64) query:

parallel_64_activity

You can see from the screenshot that Oracle knows there are 16 cores but it has gone PARALLEL(64), using 128 parallel exection slaves and fully expecting to have the available resources to run PARALLEL(64) efficiently. The execution plan is calculated around this assumption. There are 64 parallel execution slaves attempting to work on this at the same time. It’s worth looking at the metrics associated with this query.

Peaks of 2GB/s disk, 140GB of TEMP and 32 CPU’s.

parallel_64_metrics

The query took 36.9 minutes to complete.

I had the query changed to inject a modicum of realism into the available resources at the time of the run, and restricted the DOP to PARALLEL(8).

parallel_8_activity

Oracle is restricted to the limited amount of resource, which is availble. The execution plan is different, to reflect the lower amount of available resources. Looking at the metrics:

Peaks of 1GB/s, 3GB of TEMP and 12 CPU’s.

parallel_8_metrics

The query took 10.3 minutes to complete. 3 times quicker!

It is worth noting that testing the query in isolation with PARALLEL(16) took 7 minutes to complete, but that DOP would have resource-starved the server as a whole causing everything else currently executing to slow down, and was discounted as an option.

With PARALLEL, less can be better.
Using PARALLEL for everything can be counter-productive.
Co-ordinating PARALLEL across multiple RAC nodes can be disasterous without careful design to take advantage of this feature (using PARALLEL_LOCAL_FORCE=TRUE will restrict parallel processing to a single node). Oracle recommend you don’t do this. Other opinions are available and I generally recommend setting this to TRUE.

We have a limited amount of resources on our servers. Working within those resource limitations will provide substantial benefits.

Before using Parallel processing, I’d recommend thoroughly reading the VLDB and Partitioning Guide appropriate to your database release!

If you use PARALLEL:

  • Use Resource Manager to restrict DOP appropriately for different groups of users.
  • Consider setting PARALLEL_LOCAL_FORCE=TRUE
  • Consider setting PARALLEL_THREAD_PER_CPU=1, especially where you have hyperthreading enabled.
  • Consider your overall workload, not just the SQL you are working with right now.

 

Advertisement

The 10046 trace. Largely useless, isn’t it?

The other night I was sat in the pub with some like-minded individuals discussing the relative merits of the 10046 trace (we Rock! in the pub, dudes!) and somebody asked me how often I has actually used it in anger? A well-respected DBA / Architect maintained it was a pretty useless and difficult option to use, given the topology of modern applications (e.g. How do you find the correct session with all that connection pooling going on from multiple web servers.)

My answer surprised me – I thought back to one client where I spent 90% of my time performance tuning a large (TiB-ish) OLTP/Batch hybrid system and concluded that I had ran a 10046 against production about once a year. Once. So if the 10046 is the holy grail of plan information, why wasn’t I using it that much. And why did I never use a 10053 against Production there?

The answer for me is a little more complex than that given in the pub:

1. as stated above, it’s hard to catch the in-flight session unless the application is instrumented to inject the trace statement when needed (and how many applications are instrumented to help you discover problems? Screen ST03 in SAP is very helpful. Any others in major ERP’s? Thought not.)

2. In many places that I have worked, getting authorisation to make any a change to a 24×7 mission-critical system is highly bureaucratic, involving cast-iron justification for the change and it’s positive benefits, requirement that there will be no adverse effects because of the change, very senior sign-off, more red-tape, etc. This causes a significant amount of work simply to put a trace on, even if you can catch the SQL. This can end up being more work than actually fixing the problem.

3. An awful lot of SQL tuning is a fairly blunt affair, as the developer (who is frequently database-blind) has usually missed something obvious. It is frequently to do with incorrectly using or not using an index (or using a poor index), or lack of filtering data at the right point to minimise the I/O.

4. Most importantly, if you have AWR and ASH, it’s not really needed. For each plan created by the optimizer the database stores the bind variables along with it, so we can usually understand why the optimizer makes the decisions it makes. ASH contains the main event waits. Why bother trying to capture all of the detail in a trace when you really don’t need that much detail, and it’s all already there; ready to be extracted from the relevant tables (e.g. dba_hist_active_sess_history, dba_hist_sql_plan and dba_hist_sql_bind.)

I have never used a 10053 trace on a Production system. I have simply never needed to know the decisions taken by the optimizer in that much detail. Like most DBA’s and Oracle consultants, I don’t go from site-to-site on a weekly basis resolving edge-case problems that the incumbent DBA’s haven’t had the time, or possibly don’t have the skills, to resolve themselves. I usually don’t need that level of confirmation that I’m right about why the plan is wrong, and I don’t have the time to conclusively prove it over and over again – I just need to get the fix into place and move onto the next problem.

That said, perhaps I should get fully to the bottom of these problems to ensure that they never occur again – which is the fundamental problem with Adaptive Cursor Sharing.