Query Suddenly Slow - Troubleshooting

My Query Suddenly Slow - The Background

My query is suddenly slow, or the database is suddenly slow. For any DBA this is a common complaint from our beloved users, no matter what database server being used this is an inevitable.
So what should be the approach to troubleshoot this type of request from an angry user who has already passed the deadline to provide a report? (Normally they all start after the deadline and hope the database will do the magic ::D)
Below are set of questions, tasks that can be performed on Oracle, to correctly identify and fix the situation.

My Query Suddenly Slow - The Questioner

1. Slow? Is it for single user, subset of users or all the users (Application)
If it is a particular query executed by one user, or particular department accessing the system via particular floor, we can always start looking for external clues, such as network delays, a bad application query impacting a particular function etc.. If the all the backend users  and application end users connecting to a database is calling you, then the chances are it is most probably a db wide issues.
Here we will assume the single query slowness scenario. So it is a query executed by backend user and suddenly the query is slow compared to previous executions.

2. Were there any changes to the query sql, data set accessed by the query or the execution start time of the day?
This should help to identify any user triggered condition that resulted in the slowness. But normally the response is firm “No”. With this we can start out “technical investigation”.


3. What does the session info tell?
Quick look at the (g)v$session info or look at the cloud control screens and verify whether there is higher than normal overall load in the database level overall. Best option is the Cloud Control or similar monitoring tool screen, or you may query the session table using below query or similar. Given that you have a rough idea on number of sessions that can be active during normal operations any anomaly can be identified.
 select sid, state, status, event, wait_class, sql_id, program, machine, logon_time, username, blocking_session, last_call_et, seconds_in_wait, sql_exec_start, p1, p2, p3 from gv$session where username is not null and status!='INACTIVE';  

4. Finding more on the query is question.
If the overall database load is acceptable then it is required to concentrate on the query in question. In order to pin point to the session and the sql_id, attributes such as  username, osuser, logon_time, module or program can be used. For the given example we will assume that the osuser of the complainer is known,
 select sid, state, status, event, sql_id, program, machine, logon_time, username, blocking_session, last_call_et, seconds_in_wait, sql_exec_start, p1, p2, p3 from gv$session where osuser=’COMPLAINER’;  
In the output, critical points to consider are,
1.   Whether the complainers session is on blocking state?
a.    If the blocking_session is pointing to another session, we need to concentrate on
same and follow the same steps.
2.    What is the wait event, session is in and for how long (seconds_in_wait)?
a.    is the session is in normal wait event such as User I/O, or log switch related wait,
or cluster wait? Based on the wait event we need to dig further. If it is assumed that the
wait event is not usual, we can check the query history as per step 5. 
3.    Whether the session is on completely stuck stage. i.e no change in p1, p2, p3 and v$sesstat parameters are not changed. 

If the wait events not pointing to exact direction the best way forward is to check the sessions execution history.

5. The history of the query.

Checking the query execution history will confirm whether the query actually used to run faster. Below query can be used to check the same. Please note the usage of the query requires licenses. And I copied the query from internet a long time back, credit should go to original contributor.
 set lines 155  
 col execs for 999,999,999  
 col avg_etime for 999,999.999  
 col avg_lio for 999,999,999.9  
 col begin_interval_time for a30  
 col node for 99999  
 break on plan_hash_value on startup_time skip 1  
 select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,  
 nvl(executions_delta,0) execs,  
 (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,  
 (buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio  
 from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS  
 where sql_id = '<sql_id in question from v$session>'  
 and ss.snap_id = S.snap_id  
 and ss.instance_number = S.instance_number  
 and executions_delta > 0  
 order by 1, 2, 3  
 /  

Below is an example output from the above query.

query suddenly slow plan change


What we are looking here is the changes in plan_hash_values. Different plan hash values with different execution times confirms a plan change causing a sudden slowness.  In the sample output above, with plan 4233095222, average execution time was ~0.02 but with new changed plan 4086530757, the time was ~4.7. In this type of scenario, we can check for stale stats, availability of sql profiles or we can use SQL base line.

By reaching the 5th step it should be possible to get an idea on the mystery behind the sudden slowness of the query and hopefully the possible solution as well.


Comments

Popular posts from this blog

ORA-16433: The database or pluggable database must be opened in read/write

Wait for unread message on broadcast channel - Blocking Sessions

ORA-14126 - While splitting max value partition