一次Bug诊断记录

友商维护的Oracle数据库10.2.0.5 + RAC + ASM 部署在vmware里……两个节点只能启动一个,另一个启动时会宕掉…… 友商工程师,跟客户说,你这是bug,你升级到11gR2才能打相关的补丁——这是屁话,意思就是:不是我不能给你解决,相当于给你重装一遍系统,你看着办吧……
重装系统能够解决99.9%的问题——这个道理我在windows 95时代就已经明白了。

还是我们来吧…… 帮助客户就是帮助我们自己!

alert 日志:
Tue May 16 21:08:17 CST 2017
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Tue May 16 21:13:15 CST 2017
Errors in file /oracle/app/oracle/admin/yjjsdbs/bdump/yjjsdbs2_dbw0_6917.trc:
ORA-00240: control file enqueue held for more than 120 seconds
Tue May 16 21:47:14 CST 2017
alter database open
Tue May 16 21:47:14 CST 2017
ORA-1154 signalled during: alter database open...
Tue May 16 21:47:38 CST 2017
Shutting down instance (immediate)
Tue May 16 21:47:38 CST 2017
Shutting down instance: further logons disabled

日志中提及的那个trace文件:

----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
kcc_tac_callback()+ call ksedst1() 000000000 ? 000000000 ?
2531 7FFFB7333650 ? 7FFFB73336B0 ?
7FFFB73335F0 ? 000000000 ?
ksu_dispatch_tac()+ call kcc_tac_callback() 2AB2A32AB958 ? 7FFFB7337540 ?
402 7FFFB7333650 ? 7FFFB73336B0 ?
7FFFB73335F0 ? 000000000 ?
ksdxexeotherwait()+ call ksu_dispatch_tac() 2AB2A32AB958 ? 7FFFB7337540 ?
1102 7FFFB7333650 ? 7FFFB73336B0 ?
7FFFB73335F0 ? 000000000 ?
ksdxdocmdmult()+346 call ksdxexeotherwait() 0DFC2F148 ? 7FFFB7337540 ?
0 00554A7A0 ? 000000001 ?
7FFFB7335520 ? 7FFFB7335480 ?
ksudmp_proc()+2123 call ksdxdocmdmult() 7FFFB73360B0 ? 000000001 ?
000000000 ? 00553A2F4 ?
753000000009 ? 10E400007530 ?
ksvworkmsgdump()+44 call ksudmp_proc() 0DFC2F148 ?
2 FFFFFFFFFFFFFFFF ?
000000100 ? 00553A2F4 ?
006B05C20 ?
FFFFFFFF00000000 ?
ksvsubmit()+3572 call ksvworkmsgdump() 0DFC2F148 ? 000000002 ?
000000100 ? 00553A2F4 ?
006B05C20 ?
FFFFFFFF00000000 ?
kfncSlaveSubmit()+4 call ksvsubmit() FFFFFFFFDDB5E150 ?
9 0DEAD0528 ? 0DDB5FFC0 ?
07FFFFFFF ? 000000000 ?
0FFFFFFFF ?
kfncFileIdentify()+ call kfncSlaveSubmit() 7FFFB7336470 ? 0DDB5FFC0 ?
675 005647158 ? 07FFFFFFF ?
000000000 ? 0FFFFFFFF ?
kfioIdentify()+1067 call kfncFileIdentify() 0D290D730 ? 0D5726E4A ?
0DEAD0590 ? 000550B7F ?
000000009 ? 0D5726E18 ?
ksfd_osmopn()+1138 call kfioIdentify() 7FFFB7336D82 ? 0D5726E2C ?
000550B7F ? 000002000 ?
000000002 ? 0D5726E00 ?
ksfdopn()+1014 call ksfd_osmopn() 7FFFB7336D82 ? 00000002E ?
000002000 ? 000000002 ?
000060000 ? 0DFC281B8 ?
kcfbid()+492 call ksfdopn() 7FFFB7336D82 ? 00000002E ?
000002000 ? 000000002 ?
000060000 ? 0DFC281B8 ?
kcfbsy()+2363 call kcfbid() 000000007 ? 7FFFB7337540 ?
000002000 ? 000000038 ?
7FFF00000001 ? 00000000A ?
kcvvia()+77 call kcfbsy() 000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
ksbabs()+562 call kcvvia() 7FFFB7337AD8 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
ksbrdp()+821 call ksbabs() 7FFFB7337AD8 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
opirip()+614 call ksbrdp() 7FFFB7337AD8 ? 000000000 ?
000000001 ? 0600130C8 ?
000000000 ? 000000000 ?
opidrv()+583 call opirip() 000000032 ? 000000004 ?
7FFFB7338C08 ? 0600130C8 ?
000000000 ? 000000000 ?
sou2o()+114 call opidrv() 000000032 ? 000000004 ?
7FFFB7338C08 ? 0600130C8 ?
000000000 ? 000000000 ?
opimai_real()+317 call sou2o() 7FFFB7338BE0 ? 000000032 ?
000000004 ? 7FFFB7338C08 ?
000000000 ? 000000000 ?
main()+116 call opimai_real() 000000003 ? 7FFFB7338C70 ?
000000004 ? 7FFFB7338C08 ?
000000000 ? 000000000 ?
__libc_start_main() call main() 000000003 ? 7FFFB7338C70 ?
+244 000000004 ? 7FFFB7338C08 ?
000000000 ? 000000000 ?
_start()+41 call __libc_start_main() 00072D848 ? 000000001 ?
7FFFB7338DC8 ? 000000000 ?
000000000 ? 000000003 ?

困难点在于: Oracle已经停止对10g的支持,像这种部署在vmware上的,更是别指望oracle support……
根据trace文件,可以找到bug 12900003

Bug Attributes

TypeB - DefectFixed in Product VersionSeverity2 - Severe Loss of ServiceProduct Version11.2.0.2Status91 - Closed, Could Not ReproducePlatform226 - Linux x86-64CreatedAug 21, 2011Platform VersionORACLE LINUX 5UpdatedOct 31, 2011Base Bug 10222719Database Version10.2.0.4Affects PlatformsGenericProduct SourceOracle Knowledge, Patches and Bugs related to this bug

Related Products

LineOracle Database ProductsFamilyOracle Database SuiteAreaOracle DatabaseProduct5 - Oracle Database - Enterprise Edition

Hdr: 12900003 10.2.0.4 RDBMS 11.2.0.2 ASM PRODID-5 PORTID-226 10222719 
Abstract: ORA-240 CONTROLFILE ENQUEUE ON KFNCFILEIDENTIFY HANGING CLUSTER

*** 08/21/11 12:03 am ***

PROBLEM:
--------
controlfile enqueue causing hang on the cluster.

LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=28, OS id=18199
Sat Aug 20 22:11:03 2011
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=29, OS id=18201
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Sat Aug 20 22:11:06 2011
ARC0: Becoming the heartbeat ARCH
Sat Aug 20 22:11:07 2011
SUCCESS: diskgroup MISC_FBA was mounted
SUCCESS: diskgroup MISC_FBA was dismounted
SUCCESS: diskgroup MISC_FBA was mounted
SUCCESS: diskgroup MISC_FBA was dismounted
SUCCESS: diskgroup MISC_FBA was mounted
Sat Aug 20 22:17:03 2011
System State dumped to trace file
/a07/app/oracle/product/10.2.0/db_1/admin/amxprd1/udump/amxprd12_ora_23020.trc

Sat Aug 20 22:18:08 2011
System State dumped to trace file
/a07/app/oracle/product/10.2.0/db_1/admin/amxprd1/udump/amxprd12_ora_23020.trc

Sat Aug 20 22:18:54 2011
System State dumped to trace file
/a07/app/oracle/product/10.2.0/db_1/admin/amxprd1/udump/amxprd12_ora_23020.trc

Sat Aug 20 22:19:13 2011
Errors in file
/a07/app/oracle/product/10.2.0/db_1/admin/amxprd1/bdump/amxprd12_arc0_18199.tr
c:
ORA-240: control file enqueue held for more than 120 seconds


DIAGNOSTIC ANALYSIS:
--------------------

kcc_tac_callback <- 2352 <- ksu_dispatch_tac <- 401 <- ksdxexeotherwait
<- 817 <- ksdxdocmdmult <- ksudmp_proc <- ksvworkmsgdump <- ksvsubmit
<- kfncSlaveSubmit <- kfncFileIdentify <- 572 <- kfioIdentify <-
ksfd_osmopn
<- ksfdopn <- kcropn <- kcroio <- kcrrgfi <- kcrr_find_work
<- kcrrwkx <- kcrrwk <- ksbabs <- ksbrdp <- opirip
<- opidrv <- sou2o <- opimai_real <- main <- libc_start_main
<- start

File_name :: amxprd12_arc0_18199.trc

WORKAROUND:
-----------
None

RELATED BUGS:
-------------
9555335

REPRODUCIBILITY:
----------------

TEST CASE:
----------

STACK TRACE:
------------
kcc_tac_callback <- 2352 <- ksu_dispatch_tac <- 401 <- ksdxexeotherwait
<- 817 <- ksdxdocmdmult <- ksudmp_proc <- ksvworkmsgdump <- ksvsubmit
<- kfncSlaveSubmit <- kfncFileIdentify <- 572 <- kfioIdentify <-
ksfd_osmopn
<- ksfdopn <- kcropn <- kcroio <- kcrrgfi <- kcrr_find_work
<- kcrrwkx <- kcrrwk <- ksbabs <- ksbrdp <- opirip
<- opidrv <- sou2o <- opimai_real <- main <- libc_start_main
<- start

SUPPORTING INFORMATION:
-----------------------
Uploading shortly

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------

DIAL-IN INFORMATION:
--------------------

IMPACT DATE:
------------

*** 08/21/11 12:05 am ***
*** 08/21/11 12:07 am ***
*** 08/21/11 12:07 am ***
*** 08/21/11 12:08 am ***
*** 08/21/11 12:13 am ***
*** 08/21/11 12:20 am ***
*** 08/21/11 12:38 am ***
*** 09/01/11 01:34 am ***
*** 09/01/11 01:34 am ***
*** 10/28/11 08:46 pm *** (CHG: Sta->16)
*** 10/28/11 08:46 pm ***
*** 10/30/11 11:49 pm ***
*** 10/30/11 11:50 pm *** (CHG: Base Bug-> NULL -> 10222719)
*** 10/30/11 11:50 pm ***
*** 10/30/11 11:52 pm *** (CHG: Sta->91)
*** 10/30/11 11:52 pm ***
*** 10/30/11 11:52 pm *** (ADD: Impact/Symptom->DATABASE HANG )

这个bug的描述(Call Stack) 跟我们的问题是完全匹配的,我们看到这个bug的Base bug : 10222719

运气比较好,patch 10222719 x86-64 Linux Oracle 10.2.0.5 还提供下载,成功帮客户解决了问题。

Leave Comment