The 10046 trace. Largely useless, isn’t it?

The other night I was sat in the pub with some like-minded individuals discussing the relative merits of the 10046 trace (we Rock! in the pub, dudes!) and somebody asked me how often I has actually used it in anger? A well-respected DBA / Architect maintained it was a pretty useless and difficult option to use, given the topology of modern applications (e.g. How do you find the correct session with all that connection pooling going on from multiple web servers.)

My answer surprised me – I thought back to one client where I spent 90% of my time performance tuning a large (TiB-ish) OLTP/Batch hybrid system and concluded that I had ran a 10046 against production about once a year. Once. So if the 10046 is the holy grail of plan information, why wasn’t I using it that much. And why did I never use a 10053 against Production there?

The answer for me is a little more complex than that given in the pub:

1. as stated above, it’s hard to catch the in-flight session unless the application is instrumented to inject the trace statement when needed (and how many applications are instrumented to help you discover problems? Screen ST03 in SAP is very helpful. Any others in major ERP’s? Thought not.)

2. In many places that I have worked, getting authorisation to make any a change to a 24×7 mission-critical system is highly bureaucratic, involving cast-iron justification for the change and it’s positive benefits, requirement that there will be no adverse effects because of the change, very senior sign-off, more red-tape, etc. This causes a significant amount of work simply to put a trace on, even if you can catch the SQL. This can end up being more work than actually fixing the problem.

3. An awful lot of SQL tuning is a fairly blunt affair, as the developer (who is frequently database-blind) has usually missed something obvious. It is frequently to do with incorrectly using or not using an index (or using a poor index), or lack of filtering data at the right point to minimise the I/O.

4. Most importantly, if you have AWR and ASH, it’s not really needed. For each plan created by the optimizer the database stores the bind variables along with it, so we can usually understand why the optimizer makes the decisions it makes. ASH contains the main event waits. Why bother trying to capture all of the detail in a trace when you really don’t need that much detail, and it’s all already there; ready to be extracted from the relevant tables (e.g. dba_hist_active_sess_history, dba_hist_sql_plan and dba_hist_sql_bind.)

I have never used a 10053 trace on a Production system. I have simply never needed to know the decisions taken by the optimizer in that much detail. Like most DBA’s and Oracle consultants, I don’t go from site-to-site on a weekly basis resolving edge-case problems that the incumbent DBA’s haven’t had the time, or possibly don’t have the skills, to resolve themselves. I usually don’t need that level of confirmation that I’m right about why the plan is wrong, and I don’t have the time to conclusively prove it over and over again – I just need to get the fix into place and move onto the next problem.

That said, perhaps I should get fully to the bottom of these problems to ensure that they never occur again – which is the fundamental problem with Adaptive Cursor Sharing.

13 Responses to The 10046 trace. Largely useless, isn’t it?

  1. Pingback: How deep to dig – Another Opinion and Another Good Blog « Martin Widlake's Yet Another Oracle Blog

  2. Tim Hall says:

    Hi.

    Getting the trace “in flight” is easy using DBMS_MONITOR if the developers have used DBMS_APPLICATION_INFO to write module and action information to the V$SESSION view. Trouble is they never do. 🙂

    In many cases, it’s ASH or nothing. If you don’t have the D&T pack, that’ll be nothing then… 🙂

    Agree with the differentiation between DBA and performance consultant to the stars. Getting things working again is the number one priority for most DBAs, even if that means never getting to the bottom of the problem.

    Cheers

    Tim…

    Like

    • Tim,

      You can simulate bits of ASH/ASM with a bunch of scripts querying v$ tables, but it’s painful compared to the built in stuff, and SQL Monitoring too.

      Note to license purchasers – get the D&T pack, it pays for itself.

      regs

      Neil.

      Like

    • Dom Brooks says:

      All good points.

      As you say, most sql performance issues are diagnosable via the execution plan (including the predicate section of course) and there are very few other serious performance issues which don’t leave big clues lying aroung AWR & ASH.
      And the excellent real-time sql monitoring is a mixture of both features above – and actually so much easier for parallel executions – meaning that there’s even less need for a 10046.

      The main usage that I’ve had for 10053s is for confirming the odd optimizer bug and even then it’s usually easier to rewrite the query to avoid than to go raising bugs and going through the internal red tape for getting patches approved/applied.

      But it all depends on license and version. And I have had to ask for a number of 10046 production traces on 9.2.0.8 recently after being asked the standard challenge of saying why a particular process in the batch overran using only statspack.

      Cheers,
      Dominic

      Like

    • even, if you know the SQL_ID, you can trace it. please see
      http://blog.tanelpoder.com/2010/06/23/the-full-power-of-oracles-diagnostic-events-part-2-oradebug-doc-and-11g-improvements/ – if you need to flush the statement from shared pool for any reasons, it’s not that complicated: http://blogs.oracle.com/mandalika/entry/oracle_rdbms_flushing_a_single
      this should solve 1) and reduce the problems with 2)
      for 4) it’s often funny to see how customers react if you ask them for valid licenses – where only 10046 is for free – and helps in negotiating 2)
      I just agree, 10053 is REALLY rare – probably that’s the reason why there are more tools to visualise 10046 than 10053 🙂

      Like

      • Martin,

        Thanks for highlighting that. I was unaware that you could enable trace for a single SQL-ID in 11G. I do know sites that wouldn’t allow you to do that as it’s undocumented and therefore technically “unsupported” but that’s so useful (thanks Tanel, brilliant as always).

        regards

        Neil

        Like

  3. coskan says:

    After SQL Monitoring and session snapper I can say I use tracing just for additional opinion. I maybe usually gather trace just because I think I like the output of OraSRP 🙂

    Like

  4. Kevin Fries says:

    Transaction codes ST03, ST03N or ST03G isn’t that useful in SAP. TCode STAD is a lot more useful as an aggregate to pinpoint a bad application execution in either foreground or background.

    STAD doesn’t quite account for DB locks, enqueues (SAP’s application-based row locking) accurately and misses some of the stuff at OS level. But as you tangentially point out ST03 does this based on the instrumentation of the SAP Kernel. There’s other separate pieces to the instrumentation to trace the buffering, gateways, roll-in/out of programs and the like. STAD basically points to the aggregate of the time spent which only tells me where I might want to start looking. Once I locate a likely culprit in SAP, the next step is either an SQL, system, RFC (communications) trace or most importantly the ABAP or Java traces. The latter two are important as they tell you how many times a routine has executed and how long it took as well as how much time was spent processing in the DB or in the internal code operations of the program(s). It also, in later versions allows a direct look at the offending source code without having to step through their (reasonably good and always improving) debugger.

    Your base point is correct, I don’t do very many 10046 traces on SAP and it’s instrumented quite well. There’s also an “end to end” trace facility which has taken forever to get working and is part of a bloated creature called “Solution Manager” but that’s another story.

    Most of the reason I don’t need a 10046 trace in SAP is that the custom code is so bad that it’s readily identifiable and easy to spot via the other tools. Of course getting it corrected is another matter. I have one particularly clueless mess that’s coming up on it’s second anniversary with no fix in sight from the folks who are supposed to correct it despite explicit instructions.

    Like

    • Kevin,

      I’m not a SAP person and was just using ST03 as an illustration (it only proves in which layer the performance problem lies), and have only a little experience of SAP. All big Enterprise systems (Apps, Peoplesoft, JDE, SAP, etc) have serious faults due to their generic nature and significant size, but SAP has implemented some very good instrumentation which really does help. ST03 was just one that I had used and could remember. Thank you for your input – some really good learning for me too 🙂

      I sympathise with the fix delay issue. Sometimes you just have to fix it yourself and send them the code with an “implement this please” post-it attached. 😐

      regards

      Neil.

      Like

  5. Pingback: How deep to dig – Another Opinion and Another Good Blog « Ukrainian Oracle User Group

  6. Kevin Fries says:

    Your comments are noted, I have to deal with folks who stop at ST03 and blame the system, network, SAP and never their code. So just in case one of them is reading this, it couldn’t hurt to clarify.

    On to the second part, I have noted corrections and suggested sample code as well as detailed explanations of why it is bad, wrong or just not needed[1]. It doesn’t seem to matter. One of the more frequent responses is that they didn’t understand point “X” and thus disregarded the entire set of points rather than ask for an explanation. As such, I simply maintain a little niche where all the instances I’ve pointed out things to correct can be viewed, complete with all communications and dates. This serves to defend not just myself but also my co-workers. We are constrained here by silos of job duties. In fact, I’m not allowed to change code.

    [1] Q. What’s the quickest way to do something? A. Don’t do it. <- Hats off to Cary for this!

    Like

  7. Graham says:

    All very good points. Personally I use 10046 a little more often, but still only a handful of times a year. The incorporation of capturing 10046 directly in sql developer using Cary’s plug-in (MR-Trace) should certainly see the number of times it’s used in development increase – I was blown away at how simple this tool makes it. Having said that I don’t believe it will help in production situations when you’re trying to capture ‘in flight’.

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.