ORA-03106: fatal two-task communication protocol error

Error:

ORA-24330: internal OCI error
ORA-03106: fatal two-task communication protocol error
ORA-03114: not connected to ORACLE


Where: Error logged on oracle 11.2.0.2 client application server. Application based on c/c++ coding. Solaris 10 SPARC


Database 11.2.0.2 compatible, Solaris 10 SPARC

Background:
ORA-24330: internal OCI error & ORA-03114: not connected to ORACLE errors caused due to session disconnection by ORA-03106: fatal two-task communication protocol error

So, if you receive any of these three errors there could be other error strings relevant to it.

First time we observed ORA-24330: internal OCI error & started hunting for resolution, but did not find anything than contact oracle support , However debugging further in logs found other two errors i.e. ORA-03114 & ORA-03106

Showing complete error logs below, where you will find the sql statement & errors

Being DBA It may take many efforts to get all these errors out of thousands of application logs.

Initially we found just OCI Error which took us in wrong direction completely until we (application team) sighted ORA-03106: fatal two-task communication protocol error

[RSP] 139100 SIVTMasterProProfList:1001 _getList_  SAVES=0 < "CCS"  "PRS"  F 1 "18512"  "":0 17000 "Database error: SQL> FETCH SelectCursor USING SQL DESCRIPTOR electBindDesc_."  <  "CCS"  "PRS"  F 1 "18512"  ""  :0 17000 "Database error: SELECT siv_t_mp_p0.sy_createddate, siv_t_mp_p0.rfu9, siv_t_mp_p0.mp_profname, siv_t_mp_p0.rfu8, siv_t_mp_p0.rfu7, siv_t_mp_p0.rfu6, siv_t_mp_p0.rfu5, siv_t_mp_p0.rfu4, siv_t_mp_p0.av_values, siv_t_mp_p0.rfu3, siv_t_mp_p0.rfu2, siv_t_mp_p0.rfu1, siv_t_mp_p0.sy_pendoperation, siv_t_mp_p0.rfu10, siv_t_mp_p0.vpn, siv_t_mp_p0.sy_modifieddate, siv_t_mp_p0.sy_committeddate, siv_t_mp_p0.domain, siv_t_mp_p0.sy_objectid, siv_t_mp_p0.metadata, siv_t_mp_p0.sy_profname, siv_t_mp_p0.sy_changeorderid, siv_t_mp_p0.sy_version, siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_objectid = 'SIVU0000Fe' AND siv_t_mp_p0.sy_pendoperation != 2 AND (siv_t_mp_p0.sy_changeorderid = 'SYxf004PZK' AND siv_t_mp_p0.sy_version = 1 OR siv_t_mp_p0.sy_version = 0 AND NOT (siv_t_mp_p0.sy_objectid IN (SELECT siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_changeorderid = 'SYxf004PZK' AND siv_t_mp_p0.sy_version = 1)))."  <  "CCS"  "PRS"  F 1 "18512"  ""  :0 17000 "Database error: ORA-24330: internal OCI error. (Code=-24330) (Position=0)."  <  "CCS"  "ATR"  P 1 "18512"  ""  SIVTMasterProProfList:1001 85010 "Getlist failed. [VPN Access Path Master Profile P List #1001 

[SIVTMasterProProfList:1001]]"  =TOP [->REQ] 139100 SIVTMasterProProfList:1001 _clearList_SAVES=0 <  "CCS"  "PRS"  F 1 "18512"  ""  :0 17000 "Database error: SQL> OPEN GetInfoCursor."  <  "CCS"  "PRS"  F 1 "18512"  ""  :0 17000 "Database error: ORA-03114: not connected to ORACLE. (Code=-3114) (Position=0)."  <  "CCS"  "PRS"  I 1 "18512"  ""  :0 17097 "Database server is no longer accessible, forcing NPA shutdown and restart."  =TOP [EVT->] 138999 LogMessage:1 systemadmin2 CCS PRS 18512 ":0"  0 17000 F 1 "!nouser!"  "Database error: ORA-03114: not connected to ORACLE. (Code=-3114) (Position=0)." [EVT->] 138999 LogMessage:1 systemadmin2 CCS PRS 18512 ":0"  0 17000 F 1 "!nouser!"  "Database error: SQL> OPEN GetInfoCursor."

SIVTMasterProProfList:1002 _getList_  SAVES=0 <  "CCS"  "PRS"  F 1 "29644"  ""  :0 17000 "Database error: SQL> FETCH SelectCursor USING SQL DESCRIPTOR SelectBindDesc_."  < "CCS"  "PRS"  F 1 "29644"  ""  :0 17000 "Database error: SELECT siv_t_mp_p0.sy_createddate, siv_t_mp_p0.rfu9, siv_t_mp_p0.mp_profname, siv_t_mp_p0.rfu8, siv_t_mp_p0.rfu7, siv_t_mp_p0.rfu6, siv_t_mp_p0.rfu5, siv_t_mp_p0.rfu4, siv_t_mp_p0.av_values, siv_t_mp_p0.rfu3, siv_t_mp_p0.rfu2, siv_t_mp_p0.rfu1, siv_t_mp_p0.sy_pendoperation, siv_t_mp_p0.rfu10, siv_t_mp_p0.vpn, siv_t_mp_p0.sy_modifieddate, siv_t_mp_p0.sy_committeddate, siv_t_mp_p0.domain, siv_t_mp_p0.sy_objectid, siv_t_mp_p0.metadata, siv_t_mp_p0.sy_profname, siv_t_mp_p0.sy_changeorderid, siv_t_mp_p0.sy_version, siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_objectid = 'SIVU0000Fe' AND siv_t_mp_p0.sy_pendoperation != 2 AND (siv_t_mp_p0.sy_changeorderid = 'SYxf004M5G' AND siv_t_mp_p0.sy_version = 1 OR siv_t_mp_p0.sy_version = 0 AND NOT (siv_t_mp_p0.sy_objectid IN (SELECT siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_changeorderid = 'SYxf004M5G' AND siv_t_mp_p0.sy_version = 1)))."  <  "CCS"  "PRS"  F 1 "29644"  ""  :0 17000 "Database error: ORA-03106: fatal two-task communication protocol error. (Code=-3106) (Position=0)."  <  "CCS"  "ATR"  P 1 "29644"  "" SIVTMasterProProfList:1002 85010 "Getlist failed. [VPN Access Path Master Profile P List #1002 [SIVTMasterProProfList:1002]]"  =TOP
Cause & Analysis:

As I explained earlier getting exact bunch of errors & the actual error out of it was the trick to get into correct direction to solve this problem 

ORA-03106: fatal two-task communication protocol was the actual error & rest are side effects of it causing disconnection of DB session 

Here, I would like to show step by step analysis that we did rather than jumping to solution directly which is at the end. In case the solution is not relevant to your problem you may get some hint or direction to resolve your problem based on analysis shown here below …

Day 1:  Application Team reported error ORA-03114 first time appearing within specific time interval 2-4 PM IST

What we did?  Of course pointed finger to network 

[EVT->] 138999 LogMessage:1 systemadmin2 CCS PRS 18512 ":0"  0 17000 F 1 "!nouser!"  "Database error: ORA-03114: not connected to ORACLE. (Code=-3114) (Position=0)."[EVT->] 138999 LogMessage:1 systemadmin2 CCS PRS 18512 ":0"  0 17000 F 1 "!nouser!"  "Databaseerror: SQL> OPEN GetInfoCursor." 

Issue went in Network and system admin’s bucket as no error was reported in alert log of DB

Day 2: Network team were ready to capture the log to identify what’s going wrong between client & server communications, but Nothing Found suspicious at Network level!

System admin also hold the line of control, they enabled something like tcpdump, packet snooping and other traces, No luck!

Day 3: Yes, it was DBA’s turn on day 3

We did below to capture what’s going wrong between 2-4 PM IST

I. Ensure AWR snaps being taken
II. Enabled 10046 trace for every incoming session from application server host
III. Continuously Monitored TOP cpu/memory/locking session, load on DB & APPS server

What we observed?
    - Load found to be going high on both DB & Apps server between 2-4 PM IST when error occurs
    - Huge COMMIT operation from application as if some batch job being executed
    - RMAN backup was running causing huge IO
    - 5 Queries with poor execution plan including a DML executing almost 50 K time , two of the SELECT queries using /*+ FIRST_ROWS_1*/ which suggests a COMMIT after every row causing high COMMIT rate
    - Log file sync , db sequential read , DB CPU were amongst top 5 wait events

What we did?
   - Highlighted these DB observation to application team for their analysis
   - As log file sync observed , when checked the redo log file it was just 50 MB * 3 redo groups, so we replaced it by 1G * 6 Redo Log group with 2 member each
   - RMAN backup scheduled changed to Night

What we got from application in reply to DB observation is New Error i.e. ORA-24330: internal OCI error along with previous ORA-03114: not connected

Day 4: ORA-24330: internal OCI error made this issue interesting for others (application, network team) as it confirms to be complete oracle issue either at oracle client or database

What we did / found?
    - Logged Sev 1 SR in My Oracle Support (MOS)
    - Relevant to OCI error ORA-24330 found below oracle notes

Effect of client_result_cache_size On Client Applications Or OCI Based Applications (Doc ID 1300727.1)
Bug 10427522 - ORA-24330 "internal OCI error" during DML (Doc ID 10427522.8)

In above oracle notes, it was suggested to disable 11g feature of client result cache by setting client_result_cache_size=0 in parameter file

If value set for client_result_cache_size is greater than 0, then during heavy DML buffer size limited to this value causing OCI error

But client_result_cache_size was already set to zero on db, still we enforced it by adding parameter client_result_cache_size=0 in pfile as db was running with pfile than spfile with a db reboot

At the end of day 4 we were completely dependent & waiting for SR updates for recommendation to get breakthrough

At the end of the day 4, one major challenge was to re-produce the scenario which application team was unable to do it using a sample code and till now exact part of the application code was not identified

MOS expecting us to create a testcase by re-producing the problem using below, but we were unable to do so
make -f demo_rdbms.mk build EXE=testcase OBJS=testcase.o

Day 5: Fortunately last day of Hunt, followed by Happy Weekend

Again error appeared at 3:46 PM IST & this time application team introduced us to new ORA error i.e.
ORA-03106: fatal two-task communication protocol error along with previous ORA-24330 & ORA-03114

It is not like the new errors appearing every day but as I mentioned in beginning, huge application logs to be mined (log created for almost every server process id at application level) took time to identify the ORA- errors which was there right from the beginning of the issues but we got it on day 5, unfortunately (or say fortunately on day 5 at least)

SIVTMasterProProfList:1002 _getList_  SAVES=0 <  "CCS"  "PRS"  F 1 "29644"  ""  :0 17000 "Database error: SQL> FETCH SelectCursor USING SQL DESCRIPTOR SelectBindDesc_."  < "CCS" "PRS"  F 1 "29644"  ""  :0 17000 "Database error: SELECT siv_t_mp_p0.sy_createddate, siv_t_mp_p0.rfu9, siv_t_mp_p0.mp_profname, siv_t_mp_p0.rfu8, siv_t_mp_p0.rfu7, siv_t_mp_p0.rfu6, siv_t_mp_p0.rfu5, siv_t_mp_p0.rfu4, siv_t_mp_p0.av_values, siv_t_mp_p0.rfu3, siv_t_mp_p0.rfu2, siv_t_mp_p0.rfu1, siv_t_mp_p0.sy_pendoperation, siv_t_mp_p0.rfu10, siv_t_mp_p0.vpn, siv_t_mp_p0.sy_modifieddate, siv_t_mp_p0.sy_committeddate, siv_t_mp_p0.domain, siv_t_mp_p0.sy_objectid, siv_t_mp_p0.metadata, siv_t_mp_p0.sy_profname, siv_t_mp_p0.sy_changeorderid, siv_t_mp_p0.sy_version, siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_objectid = 'SIVU0000Fe' AND siv_t_mp_p0.sy_pendoperation != 2 AND (siv_t_mp_p0.sy_changeorderid = 'SYxf004M5G' AND siv_t_mp_p0.sy_version = 1 OR siv_t_mp_p0.sy_version = 0 AND NOT (siv_t_mp_p0.sy_objectid IN (SELECT siv_t_mp_p0.sy_objectid FROM siv_t_mp_p siv_t_mp_p0 WHERE siv_t_mp_p0.sy_changeorderid = 'SYxf004M5G' AND siv_t_mp_p0.sy_version = 1)))."  <  "CCS"  "PRS"  F 1 "29644"  ""  :0 17000 "Database error: ORA-03106: fatal two-task communication protocol error. (Code=-3106) (Position=0)."  <  "CCS"  "ATR"  P 1 "29644"  "" SIVTMasterProProfList:1002 85010 "Getlist failed. [VPN Access Path Master Profile P List #1002 [SIVTMasterProProfList:1002]]"  =TOP 

Also a hint that error is appearing while selecting rows from blank LOB datatype column value of the table, which application team were able to replicate using a sample pro*C code found on oracle site


Ansidyn1.pc is an Oracle Pro*C program that implements a dynamic SQL interpreter. Using the ANSI embedded SQL syntax, Ansidyn1.pc accepts dynamic SQL at the input command prompt (SQL>

We tried to compile this proc*C code which expects DB Username, Password & db service name as input

proc mode=ansi ansidyn1
Pro*C/C++: Release 11.2.0.2.0 - Production on Sat Mar 23 15:34:31 2013
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
System default option values taken from: /rdbms/11gR2/db_2/precomp/admin/pcscfg.cfg
Error at line 41, column 10 in file ansidyn1.pc 
#include "tt_version.h" 
.........1
PCC-S-02015, unable to open include file

But got so many errors, so we need to rely on application team who handed over complied code file a.out
There were solutions to this compilation error, but i did not tried as got readily available compiled file

./a.out CE ***1234 CEPRD
Connecting as CE@CEPRD
Connected
You are now in a SQL intrepreter shell using ANSI Dynamic SQL.
Please enter SQL DDL and DML statements, eg create table, select, commit etc
SQL> select sraddcmds from sr_cfglet_p;
SRADDCMDS
ANSI sqlstate: 63000:
ORA-03106: fatal two-task communication protocol error
SQL> exit

In the select query used above to re-produce the problem sraddcmds was the CLOB column of table sr_cfglet_p with few empty rows

So the root cause of the problem was error being thrown for LOB empty column values
  

Now, Next question was why is it so ?
With the help of Oracle support we did get the answer i.e. db_securefile=ALWAYS parameter

ALWAYS
All LOBs created in the system are created as SecureFile LOBs. If the LOB is not created in an Automatic Segment Space Managed tablespace, then an error will occur. Any BasicFile LOB storage options are ignored. The SecureFile defaults will be used for all storage options not specified.

PERMITTED (default)
LOBs are allowed to be created as SecureFiles.

If the COMPATIBLE initialization parameter is not set to 11.1 or higher, then LOBs are not treated as SecureFiles.

So action plan was to make set  db_securefile=PERMITTED ,  but huge setback was just by setting a db parameter it was not going to resolve the problem completely as LOB tables we created with ALWAYS  options will throw same error if only db parameter is changed . So post changing db_securefile to PERMITTED it was mandatory to re-create those tables
Changing DB parameter along with tables re-creation was very much effort consuming task, as it involves major structure changes at schema as well at code level

To save these efforts a workaround was suggested to put some data or string "NULL" in all blank columns along with modifying column to NOT NULL constraint & set the DB parameter.
So that new tables will be created under default option of PERMITTED wherein LOBS are allowed to be created as secure files without restriction of ASSM

Due to some issues oracle analyst was unable to re-produce the problem in his setup & he was suspecting one parameter sql92_security=TRUE which is when set to FALSE could resolve the issue, but that did not happen

Bug has been created by MOS as below for this issue which is in progress while I am writing this article and you may get benefited with the solution/patch once made available

Bug 16536665 - ORA-3106 WHEN CLOB NULL AND SQL92_SECURITY=TRUE AND DB_SECUREFILE=ALWAYS

Just before enlisting solution, here are few snaps showing how we came to conclusion that setting parameter PERMITTED will resolve this issue

Test_1.  Tests based on setting db_securefile PERMITTED/ALWAYS at session level to re-produce the problem

When sample table created with session level parameter PERMITTED no error observed, but it does when set to ALWAYS

 At DB:


At Apps: test_lob_2  is created when session level  ALWAYS value was set for db_securefile parameter
  

Test_2.  To test the workaround that blank LOB column row removal or replacing with some value could resolve the issue 

 At DB:


At App:  Yes, it worked. But need to move such columns to NOT NULL for future
  

Solution:

1. Set db_securefile to PERMITTED & re-create all LOB data type tables (Workaround)
2. Replace blank row column values of the table by some value or NULL flag, and enable NOT NULL constraint for those columns

3. Apply Patch 

<Patch 14727315> DATABASE PATCH SET UPDATE 11.2.0.2.9 (INCLUDES CPUJAN2013)
since it includes accoding to
11.2.0.2 Patch Set Updates - List of Fixes in each PSU (Doc ID 1340011.1)
the fix for
Bug 11723722ORA-3106 READNG SECUREFILES BY OCI SQLTYPE SQLT_LBI AND SQLT_CHR


Lesson Learned:

As the parameter db_securefile default value is PERMITTED , which got changed to ALWAYS as a part if database hardening as per CSI benchmark 11g


Need to have proper testing & review from application after DB hardening

2 comments:

  1. Excellent post and wonderful blog, I really like this type of interesting articles keep it up.

    ReplyDelete