Wednesday, May 31, 2006

Hang Analyze

I got an email from a user about incredibly slow response times and never ending queries. When I attempted to check for users holding locks, the query would lock up and not run. So, I decided to run a hang analyze in order to get some trace dumps for Oracle Support. Here is how:

Login as sysdba

oradebug hanganalyze 3;

wait 2 minutes and note the location of the trace file

oradebug hanganalyze 3;

Start another sqlplus session as sysdba

ALTER SESSION SET EVENTS '10998 trace name context forever, level 1';

alter session set events 'immediate trace name SYSTEMSTATE level 10';

After generating the dumps, used the UNIX top command to look for Oracle processes that were hung. Found an Oracle user process that was using 99% CPU and all of the other Oracle processes were "waiting". So, I decided to kill the process. First I did ps -ef pipe to grep 99999 (where 99999 is the PID) to see what this process was. Then I determined that the process could be safely killed. Next, I killed it and the PID (Process Id) that was associated with the rogue process. [su - oracle followed by kill -9 99999] Once the PID was killed, all other Oracle processes began "running" and the crisis was over. My next step is to open a SR with Oracle Support and upload the trace files.

Check out the comments for more instructions by me.

2 comments:

sedwardba said...

I heard back from Oracle Support.
A true hang condition was NOT found. The database has only STUCK sessions which are not hung in a cycle deadlock but are unable to continue working due to a long wait condition. (e.g. A transactional lock wait).

Wait chains may reflect a performance problem scenario, which is what appears to have
happened in your case. Wait chains may be resolved by killing the session that is on top of
the waiting list, or by forcing it to release the resources that other processes require to keep
working (Commit or Rollback the transaction).

sedwardba said...

Today, I had a hang so severe that it was taking hours to create the trace files using the procedures that I outlined in this post. Decided to go ahead and stop the trace dumps after an hour,so we could get the database back in business. Had to do a shutdown abort to get the instance to shutdown.

Here is a note to check out on MetaLink 121779.1 Taking Systemstate Dumps when You cannot Connect to Oracle. Use that note whenever the system will not respond at all during a hang up.