LMON terminating the instance due to ORA error 481

Environment:

Current Database: Oracle 3-Node RAC 19.27

Operating System : Windows Server

Issue Description:

The Oracle database instance failed to start with the following error messages. The entire cluster was manually shut down to troubleshoot frequent network issues. Although the cluster stack started successfully, the RDBMS instance failed to start and encountered the errors listed below.

CRS-2672: Attempting to start 'ora.testdb.db' on 'testdb3'

CRS-5017: The resource action "ora.testdb.db start" encountered the following error:

ORA-03113: end-of-file on communication channel

Process ID: 0

Session ID: 0 Serial number: 0

. For details refer to "(:CLSN00107:)" in "D:\oracle\app\grid\diag\crs\testdb3\crs\trace\crsd_oraagent_system.trc".

CRS-2674: Start of 'ora.testdb.db' on 'testdb3' failed

CRS-2672: Attempting to start 'ora.ons' on 'testdb2'

CRS-2674: Start of 'ora.ons' on 'testdb2' failed

CRS-2672: Attempting to start 'ora.testdb.db' on 'testdb3'

Changes in environment:

The Network Team provided a different Ethernet card for the private interconnect IP on Node3. This was the only change made on Node3 to troubleshoot the private interconnect fluctuations.

Troubleshooting Steps:

Step 1: The private interconnect configuration was verified across all cluster nodes. This included checking Ethernet card names, IP addresses, subnet masks, gateways, and DNS IPs. No issues were observed during the verification.

Step 2: Started the database manually and checked the issue.

C:\Windows\system32> sqlplus / as sysdba

SQL> startup

ORA-03113: end-of-file on communication channel

Here, after starting DB manually also, issue re-occurred.

Step 3: A server reboot was initiated to investigate the issue. However, the command was aborted using <Ctrl+C> at the end, as it was continuously attempting to start the database instance.

C:\Windows\system32>crsctl start crs -wait

CRS-4123: Starting Oracle High Availability Services-managed resources

CRS-2672: Attempting to start 'ora.mdnsd' on 'testdb03'

CRS-2672: Attempting to start 'ora.evmd' on 'testdb03'

CRS-2676: Start of 'ora.mdnsd' on 'testdb03' succeeded

CRS-2676: Start of 'ora.evmd' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.gpnpd' on 'testdb03'

CRS-2676: Start of 'ora.gpnpd' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.gipcd' on 'testdb03'

CRS-2676: Start of 'ora.gipcd' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.crf' on 'testdb03'

CRS-2672: Attempting to start 'ora.cssdmonitor' on 'testdb03'

CRS-2676: Start of 'ora.crf' on 'testdb03' succeeded

CRS-2676: Start of 'ora.cssdmonitor' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.cssd' on 'testdb03'

CRS-2676: Start of 'ora.cssd' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.ctssd' on 'testdb03'

CRS-2676: Start of 'ora.ctssd' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'testdb03'

CRS-2676: Start of 'ora.asm' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.storage' on 'testdb03'

CRS-2676: Start of 'ora.storage' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'testdb03'

CRS-2676: Start of 'ora.crsd' on 'testdb03' succeeded

CRS-6017: Processing resource auto-start for servers: testdb03

CRS-2673: Attempting to stop 'ora.testdb03.vip' on 'testdb02'

CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'testdb01'

CRS-2672: Attempting to start 'ora.ons' on 'testdb03'

CRS-2677: Stop of 'ora.testdb03.vip' on 'testdb02' succeeded

CRS-2672: Attempting to start 'ora.testdb03.vip' on 'testdb03'

CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'testdb01' succeeded

CRS-2673: Attempting to stop 'ora.scan1.vip' on 'testdb01'

CRS-2677: Stop of 'ora.scan1.vip' on 'testdb01' succeeded

CRS-2672: Attempting to start 'ora.scan1.vip' on 'testdb03'

CRS-5052: invalid host name or IP address '10.189.197.25'

CRS-2676: Start of 'ora.testdb03.vip' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.LISTENER.lsnr' on 'testdb03'

CRS-2676: Start of 'ora.scan1.vip' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'testdb03'

CRS-2676: Start of 'ora.LISTENER.lsnr' on 'testdb03' succeeded

CRS-33672: Attempting to start resource group 'ora.asmgroup' on server 'testdb03'

CRS-2672: Attempting to start 'ora.asmnet1.asmnetwork' on 'testdb03'

CRS-2676: Start of 'ora.asmnet1.asmnetwork' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.ASMNET1LSNR_ASM.lsnr' on 'testdb03'

CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'testdb03' succeeded

CRS-2676: Start of 'ora.ASMNET1LSNR_ASM.lsnr' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.asm' on 'testdb03'

CRS-2676: Start of 'ora.ons' on 'testdb03' succeeded

CRS-2676: Start of 'ora.asm' on 'testdb03' succeeded

CRS-33676: Start of resource group 'ora.asmgroup' on server 'testdb03' succeeded.

CRS-2672: Attempting to start 'ora.DG_ARCH.dg' on 'testdb03'

CRS-2672: Attempting to start 'ora.DG_DATA.dg' on 'testdb03'

CRS-2672: Attempting to start 'ora.DG_REDO2.dg' on 'testdb03'

CRS-2672: Attempting to start 'ora.DG_REDO1.dg' on 'testdb03'

CRS-2672: Attempting to start 'ora.ons' on 'testdb02'

CRS-2676: Start of 'ora.DG_ARCH.dg' on 'testdb03' succeeded

CRS-2676: Start of 'ora.DG_REDO2.dg' on 'testdb03' succeeded

CRS-2676: Start of 'ora.DG_DATA.dg' on 'testdb03' succeeded

CRS-2676: Start of 'ora.DG_REDO1.dg' on 'testdb03' succeeded

CRS-2674: Start of 'ora.ons' on 'testdb02' failed

CRS-2679: Attempting to clean 'ora.testdb.db' on 'testdb03'

CRS-2681: Clean of 'ora.testdb.db' on 'testdb03' succeeded

CRS-2672: Attempting to start 'ora.testdb.db' on 'testdb03'

CRS-5017: The resource action "ora.testdb.db start" encountered the following error:

ORA-03113: end-of-file on communication channel

Process ID: 0

Session ID: 0 Serial number: 0

. For details refer to "(:CLSN00107:)" in "D:\oracle\app\grid\diag\crs\testdb03\crs\trace\crsd_oraagent_system.trc".

CRS-2674: Start of 'ora.testdb.db' on 'testdb03' failed

CRS-2672: Attempting to start 'ora.ons' on 'testdb02'

CRS-2674: Start of 'ora.ons' on 'testdb02' failed

CRS-2672: Attempting to start 'ora.testdb.db' on 'testdb03'

C:\Windows\system32>

The issue persisted even after performing a server reboot.

DB Alert Log:

2025-08-14T00:18:59.264174+05:30

My CSS node number is 3

My CSS hostname is iprblcdb03

lmon registered with NM - instance number 3 (internal mem no 2)

2025-08-14T00:19:20.702698+05:30

Detected partial connectivity during CGS reconfiguration.

No connectivity to instances: 1 2

As it is starting up, terminate the instance.

LMON (ospid: 19464): terminating the instance due to ORA error 481

Cause - 'Instance is being terminated by LMON due to partial connectivity'

2025-08-14T00:19:20.766701+05:30

System state dump requested by (instance=3, osid=19464 (LMON)), summary=[abnormal instance termination]. error - 'Instance is terminating.

System State dumped to trace file D:\ORACLE\APP\ORACLE\diag\rdbms\testdb\testdb3\trace\testdb3_diag_17636.trc

2025-08-14T00:19:26.025616+05:30

Instance terminated by LMON, pid = 19464

2025-08-14T00:41:39.128615+05:30

As observed in the database alert log, the instance was terminated by the LMON process. Let's review the LMON trace file for any related error messages.

testdb3_lmon_25556.trc

================================

== System Network Information ==

================================

==[ Network Interface Query not supported ]==

== [ Network Transport Usage (ksipc: avail[x28] sel[UDP]) (IPv4) ] ==

kjxgmConnCheckAllInst_CB: dump the CGS communication info.

my inst 3 (node 2) inc (376 4) send-window 500

flags 0x8c100002 (b.) ats-threshold 250

reg msgs: sz 504 blks 1 alloc 100 free 100

big msgs: sz 4112 blks 1 alloc 100 free 100

rcv msgs: sz 4112 blks 8 alloc 2000 free 0

total rcv msgs 11 size 1920 avgsz 174 (in bytes)

total rcv bmsgs 0 embedded msgs 0 avg 0

total prc msgs 17 pending sends 0

conn[inst 1] fg 0x0 snds (14:0) qmsgs (0:0) bths (0:0) unas 8 unar 2:0 nreq 0 nack (0:0)

conn[inst 2] fg 0x0 snds (8:0) qmsgs (0:0) bths (0:0) unas 6 unar 1:0 nreq 0 nack (0:0)

conn[inst 3] fg 0x0 snds (6:0) qmsgs (0:0) bths (0:0) unas 0 unar 0:0 nreq 0 nack (0:0)

==============================

Instance Interface Information

==============================

num of pending ping requests = 0

inst 1 nifs 0 overall state (0x20:succ)

inst 2 nifs 0 overall state (0x20:succ)

inst 3 nifs 1 (our node)

real IP 10.191.137.133

Detected partial connectivity during CGS reconfig

No connectivity to instances: 1 2

Terminating the instance

2025-08-13 23:25:20.028 :kjzduptcctx(): Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+485<-kjzdssdmp()+270<-kjzduptcctx()+944<-kjzdicrshnfy()+1462<-ksuitm_opt()+2163<-kjxgmConnCheckAllInst_CB()+584<-kjxgm_ConnectivityReq_InvokeCB()+554<-kjxgm_ConnectivityReq_Process()+221<-kjxgmpoll()+1472<-kjxggpoll()+488<-kjfmact()+360<-kjfcln()+11511

<-ksbrdp()+1548<-opirip()+631<-opidrv()+851<-sou2o()+110<-opimai_real()+283<-opimai()+268<-BackgroundThreadStart()+580<-0x00007FFEEB007AC4

----- End of Abridged Call Stack Trace -----

Partial short call stack signature: 0x577baf1a2d9e368f

*** 2025-08-13T23:25:20.048779+05:30

LMON (ospid: 25556): terminating the instance due to ORA error 481

Cause - 'Instance is being terminated by LMON due to partial connectivity'

Upon reviewing the LMON trace file, it was observed that the Node3 RDBMS instance was unable to establish communication with Node1 and Node2. This lack of connectivity resulted in a partial interconnect failure, ultimately causing the Node3 database instance to fail during startup.

The LMON process on Node3 detected a communication breakdown with the other cluster nodes. This behavior pointed to a partial connectivity issue within the private interconnect network.

The newly configured Ethernet card assigned to the private interconnect on Node3 was reviewed for any misconfigurations or hardware issues.

Ping tests were executed for all private interconnect IPs across all nodes to check end-to-end connectivity and packet loss, if any.

Normal Ping:

C:\Windows\system32>ping Private_IP_Node1

C:\Windows\system32>ping Private_IP_Node2

C:\Windows\system32>ping Private_IP_Node3

MTU Ping:

C:\Windows\system32>ping -f -l 9000 Private_IP_Node1

C:\Windows\system32>ping -f -l 9000 Private_IP_Node2

C:\Windows\system32>ping -f -l 9000 Private_IP_Node3

Let's check MTU setting for the private interconnect Ethernet card across all the Nodes for any potential misconfiguration.

#Command to check MTU for all ethernet cards on Windows Server.

C:\Windows\system32>netsh interface ipv4 show subinterface

MTU MediaSenseState Bytes In Bytes Out Interface

----------- --------------- --------- --------- -------------

4294967295 1 0 49704 Loopback Pseudo-Interface 1

9000 1 170675828 217183734 pubnet

9000 1 1034136 259307 Ethernet2

1500 1 448185024 269497005 privnet

Here, is the issue:

The issue was related to an MTU mismatch. The newly configured private Ethernet card on Node3 had the MTU set to 1500, whereas the expected value across all nodes was 9000.

While Node1 and Node2 were correctly configured with an MTU of 9000, Node3 remained at 1500, leading to a mismatch and causing private interconnect communication issues.

Steps to resolve the MTU mismatch Issue:

Stop the cluster again on Node3 and change the MTU value to 9000 for privnet ethernet card.

#Stop cluster on Node3.

C:\Windows\system32>crsctl stop crs -f

#Set the MTU to 9000 for private interconnect card. CO-ordinate with System Admin team for changing MTU value.

C:\Windows\system32>netsh interface ipv4 show subinterface

MTU MediaSenseState Bytes In Bytes Out Interface

------ --------------- --------- --------- -------------

4294967295 1 0 81428 Loopback Pseudo-Interface 1

9000 1 177182139 226258472 pubnet

9000 1 8233 14568 privnet

9000 1 1091504 300327 Ethernet2

1500 1 458750868 275851852 privnet_new

#Start the cluster