Database : Oracle Database 19.28 EE
Operating System : Redhat Linux 8.10 64 Bit
Database Type : Oracle RAC DB
Incident Summary: One of the critical production applications started encountering intermittent failures with the error: Application Log Error Message: org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) java.lang.Thread.run(Thread.java:750) Caused by exception: oracle.jdbc.OracleDatabaseException Message: OR***** cannot use system rollback segment for non-system tablespace 'APPTBS_A1' Stacktrace: oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:498) oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:446) oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1054) oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:623) oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:252) The application was working without any issues on the existing 2-node Oracle RAC database. The issue was observed only after the RAC cluster was expanded from 2 nodes to 4 nodes by adding two additional instances. This is a 4-node Oracle RAC database environment, and the issue started immediately after the new nodes were introduced into the cluster. Application team confirmed that there were no code or configuration changes at the application level. The application primarily performs INSERT operations. Impact : - Approximately 80–90% of transactions were successful. - Around 10% of transactions failed intermittently. - Failures resulted in transaction declines. - Although the impact was partial, this was a production system, making the issue critical. Let's troubleshoot the issue...Step 1: Check the Node Addition Activity logs for any error messages. Let's check the Node addition activity for any error messages since this was the only change in the environment. The runcluvfy for nodeadd was executed to ensure cluster configuration after node addition is correct. cluvfy stage -post nodeadd -n testdbnode1,testdbnode2 -verbose Both nodes were added correctly in the cluster. No issues observed. Step 2: Now check the UNDO configuration across all nodes. As per the error message, the application performs its DML operations on the application tablespace APPTBS_A1. However, during certain transactions, Oracle attempts to use the SYSTEM rollback segment for undo generation. Oracle does not allow rollback of DML operations on non-SYSTEM tablespaces using the SYSTEM rollback segment, and this is expected behavior, which results in the error "ORA-01552: cannot use system rollback segment for non-system tablespace". This indicates that Oracle was unable to allocate undo from the configured UNDO tablespace for the instance handling the session. As a fallback, Oracle attempted to use the default SYSTEM rollback segment, which is disallowed for application tablespaces. Therefore, the issue is not with the application tablespace [APPTBS_A1], but with the UNDO availability or configuration at the instance level. The primary focus of the investigation is: - Why Oracle internally attempted to use the SYSTEM rollback segment if UNDO tablespace of RAC instances are available ? - What is incorrect or unavailable with the existing UNDO tablespaces on the affected RAC instances ? It is important to note that UNDO tablespaces are configured for all four RAC instances; however, despite this configuration, one or more instances were unable to use their assigned UNDO tablespace during runtime, leading to the intermittent failures. Let us review the existing UNDO tablespace configuration in the database. The initial investigation focuses on verifying whether UNDO_MANAGEMENT is set to MANUAL. SQL> select inst_id,name,value from gv$parameter where name like '%undo%' order by 1; INST_ID NAME VALUE ------- ------------------- -------- 1 undo_management AUTO 2 undo_management AUTO 3 undo_management AUTO 4 undo_management AUTO 1 undo_tablespace UNDOTBS1 2 undo_tablespace UNDOTBS2 3 undo_tablespace UNDOTBS3 4 undo_tablespace UNDOTBS4 1 undo_retention 1800 2 undo_retention 1800 3 undo_retention 1800 4 undo_retention 1800 In this case, the UNDO_MANAGEMENT parameter is set to AUTO, confirming that automatic undo management is enabled. What should be the next course of action? The initial question that comes to mind is: although UNDO_MANAGEMENT is set to AUTO on all RAC instances, why did this error occur? Let's re-create the UNDO tablespaces for all RAC instances and drop the existing ones. Step 3: Create new UNDO tablespace for all RAC instances and drop existing UNDO tablespace. SQL> CREATE UNDO TABLESPACE UNDOTBS1_NEW DATAFILE '+DG_DATA' size 31g; SQL> CREATE UNDO TABLESPACE UNDOTBS2_NEW DATAFILE '+DG_DATA' size 31g; SQL> CREATE UNDO TABLESPACE UNDOTBS3_NEW DATAFILE '+DG_DATA' size 31g; SQL> CREATE UNDO TABLESPACE UNDOTBS4_NEW DATAFILE '+DG_DATA' size 31g; SQL> ALTER SYSTEM SET UNDO_TABLESPACE=UNDOTBS1_NEW sid='testdb1' scope=spfile; SQL> ALTER SYSTEM SET UNDO_TABLESPACE=UNDOTBS2_NEW sid='testdb2' scope=spfile; SQL> ALTER SYSTEM SET UNDO_TABLESPACE=UNDOTBS3_NEW sid='testdb3' scope=spfile; SQL> ALTER SYSTEM SET UNDO_TABLESPACE=UNDOTBS4_NEW sid='testdb4' scope=spfile; SQL> set lines 300 pages 3000 SQL> col name for a16 SQL> SELECT r.name,b.status FROM v$rollname r,v$rollstat s WHERE r.usn = s.usn AND r.name IN ( SELECT segment_name FROM dba_segments WHERE tablespace_name = '&TBS' ); SQL> alter system kill session '&sid,&serial,@&inst_id' immediate; SQL> DROP TABLESPACE UNDOTBS1 INCLUDING CONTENTS AND DATAFILES; SQL> DROP TABLESPACE UNDOTBS2 INCLUDING CONTENTS AND DATAFILES; SQL> DROP TABLESPACE UNDOTBS3 INCLUDING CONTENTS AND DATAFILES; SQL> DROP TABLESPACE UNDOTBS4 INCLUDING CONTENTS AND DATAFILES; SQL> SHUT IMMEDIATE; SQL> STARTUP; SQL> select inst_id,name,value from gv$parameter where name like '%undo%' order by 1; INST_ID NAME VALUE ------- ------------------- -------- 1 undo_management AUTO 2 undo_management AUTO 3 undo_management AUTO 4 undo_management AUTO 1 undo_tablespace UNDOTBS1_NEW 2 undo_tablespace UNDOTBS2_NEW 3 undo_tablespace UNDOTBS3_NEW 4 undo_tablespace UNDOTBS4_NEW 1 undo_retention 1800 2 undo_retention 1800 3 undo_retention 1800 4 undo_retention 1800 Even after re-creating the UNDO tablespace, the application team continued to encounter the same issue. This raises the question: what should be the next course of action, and what could be the underlying cause? Step 4: Check if this database is cloned database. One expected scenario for this type of error is when the database is a cloned database. In such cases, any data modification outside the SYSTEM tablespace results in this error because only the SYSTEM rollback segment can be online in a cloned database. In such cases, you may need to apply the following workaround to resolve the issue. sql> shutdown immediate; # set below parameters in pfile. undo_management=manual rollback_segments=(ROLLBACK_PRIV_01, ROLLBACK_PRIV_02) sql> startup mount; sql> create private rollback segment ROLLBACK_PRIV_01 tablespace undotbs1; sql> create private rollback segment ROLLBACK_PRIV_02 tablespace undotbs1; sql> alter database open; Here, as checked with concern DBA team, they confirmed that this database was not a cloned database. Therefore, the above workaround is not applicable and will not resolve the issue. This leads to the next question: what should be investigated next, and what could be the underlying cause of this behavior? Step 5: Let's enable system level tracing at DB level to capture this ORA-01552 error message. SQL> alter system set timed_statistics=true; SQL> alter system set statistics_level=all; SQL> alter system set max_dump_file_size=unlimited; SQL> alter system set events '1552 trace name errorstack level 3'; Request the application team to reproduce the issue, and once the reproduction is complete, disable tracing. SQL> alter system set events '1552 trace name errorstack off'; #Check all instances alert log file for any ORA_ messages. [oracle@testdbnode1 trace]$ grep -i ORA-01552 alert_testdb1.log [oracle@testdbnode1 trace]$ [oracle@testdbnode2 trace]$ grep -i ORA-01552 alert_testdb2.log [oracle@testdbnode2 trace]$ [oracle@testdbnode3 trace]$ grep -i ORA-01552 alert_testdb3.log [oracle@testdbnode3 trace]$ [oracle@testdbnode4 trace]$ grep -i ORA-01552 alert_testdb4.log [oracle@testdbnode4 trace]$ It was observed that no ORA- error messages were reported in the database alert log files across all RAC instances; however, the same error continued to appear in the application logs. In this scenario, enabling database tracing would not be effective. Since the ORA- error is not recorded in the database alert logs, no corresponding trace files are generated at the database level. As a result, tracing does not provide additional diagnostic information when the error is observed only at the application front end and not within the database logs. Step 6: Let us check whether any space-related jobs are enabled in the database that could be causing this issue. As observed, the Oracle default maintenance jobs are enabled in the database. Let us disable these jobs and monitor whether the issue reoccurs. SQL> set lines 300 pages 3000 SQL> col CLIENT_NAME for a40 SQL> col status for a15 SQL> select distinct CLIENT_NAME ,status from dba_autotask_client; CLIENT_NAME STATUS ------------------------------------ -------- sql tuning advisor ENABLED auto space advisor ENABLED auto optimizer stats collection ENABLED #To disable default maintenance tasks BEGIN dbms_auto_task_admin.disable( client_name => 'sql tuning advisor', operation => NULL, window_name => NULL); END; / BEGIN dbms_auto_task_admin.disable( client_name => 'auto space advisor', operation => NULL, window_name => NULL); END; / BEGIN dbms_auto_task_admin.disable( client_name => 'auto optimizer stats collection', operation => NULL, window_name => NULL); END; / SQL> set lines 300 pages 3000 SQL> col CLIENT_NAME for a40 SQL> col status for a15 SQL> select distinct CLIENT_NAME ,status from dba_autotask_client; CLIENT_NAME STATUS --------------------------------- --------- sql tuning advisor DISABLED auto space advisor DISABLED auto optimizer stats collection DISABLED Even after disabling the default maintenance jobs, the application continued to encounter the same error. At this stage: - All standard troubleshooting steps have been performed. - The application team and customer are under significant stress. - The issue is affecting production, and the customer is understandably unhappy. Observations: - The error persists intermittently despite UNDO tablespace recreation, database restart, and disabling maintenance jobs. - No ORA- messages are observed in the database alert logs. - Tracing at the database level does not help because the error is visible only at the application front end. What is the next course of action to troubleshoot and achieve a permanent resolution of this issue? Step 7: Let's check the UNDO tablespace space pressure. SQL> set lines 300 pages 3000 SQL> col file_name for a65 SQL> col TABLESPACE_NAME for a30 SQL> SELECT A.TABLESPACE_NAME,round(SUM(A.SUM_TOT)/1024/1024) "Total size MB", round(SUM(A.SUM_BYTES)/1024/1024) "Total Free MB", (SUM(A.SUM_BYTES)*100/SUM(A.SUM_TOT)) "%FREE", 100-(SUM(A.SUM_BYTES)*100/SUM(A.SUM_TOT)) "%USED", round(SUM(A.LARGEST)/1024/1024) MAX_FREE,SUM(A.CHUNKS) CHUNKS_FREE FROM ( SELECT TABLESPACE_NAME,0 SUM_TOT,SUM(BYTES) SUM_BYTES, MAX(BYTES) LARGEST,COUNT(*) CHUNKS FROM SYS.DBA_FREE_SPACE A GROUP BY TABLESPACE_NAME UNION SELECT TABLESPACE_NAME,SUM(BYTES) SUM_TOT,0,0,0 FROM SYS.DBA_DATA_FILES GROUP BY TABLESPACE_NAME) A, V$INSTANCE B GROUP BY UPPER(B.INSTANCE_NAME),A.TABLESPACE_NAME having (100-(SUM(A.SUM_BYTES)*100/SUM(A.SUM_TOT))) > 0 / TABLESPACE_NAME Total size MB Total Free MB %FREE %USED MAX_FREE CHUNKS_FREE --------------- ------------- ------------- ---------- ---------- -------- ----------- SYSTEM 11160 9922 88.90625 11.09375 3968 6 USERS 2053 2044 99.5433512 .456648807 2043 3 SYSAUX 43180 26484 61.3342404 38.6657596 3968 10 UNDOTBS1_NEW 383170 380971 99.4261196 .573880393 3968 864 UNDOTBS2_NEW 384550 382471 99.4593356 .540664413 3968 865 UNDOTBS3_NEW 319600 318050 99.5149797 .485020338 3968 274 UNDOTBS4_NEW 447860 445785 99.5366018 .46339816 3968 351 You can use the above script to monitor continuous free space in the UNDO tablespace. This confirms that, in this case, the UNDO tablespace was not under pressure during the issue window. The script can be modified as per the requirement. What should be the next course of action? Step 8: Let's check the rollback segment count for all RAC DB instances. SQL> select status,count(*),TABLESPACE_NAME from dba_rollback_segs group by STATUS,TABLESPACE_NAME order by 1; STATUS COUNT(*) TABLESPACE_NAME ---------- ---------- ---------------- ONLINE 1 SYSTEM ONLINE 375 UNDOTBS1_NEW ONLINE 360 UNDOTBS2_NEW OFFLINE 11727 UNDOTBS1_NEW OFFLINE 11000 UNDOTBS2_NEW OFFLINE 9297 UNDOTBS3_NEW Here, the total count (1+375+360+11727+11000+9297) is = 32760 Please note that there is the hard limit of the total undo segments(ONLINE+OFFLINE) across all undo tablespaces in the entire database. This is also the hard limit of all the Instances of RAC database. The limit is 32760 and is defined by the underscore parameter "_rollback_segment_count". This is instance specific parameter. The "_rollback_segment_count" parameter controls the number of undo segments that remain online over time, helping to balance workload across multiple segments. Here, we can observe that the number of offline rollback segments is higher on Node1, Node2, and Node3. This caused the "_rollback_segment_count" parameter to reach its hard limit, which may have led Oracle to use the SYSTEM tablespace for rollback operations. If this is the case, what would be the appropriate solution to resolve the issue? Step 9: Let's apply a workaround to fix the issue. Please perform the following steps during non-business hours, when system load is minimal, or during an approved downtime window. #Check the total number of rollback segment count for all instances. SQL> select status,count(*),TABLESPACE_NAME from dba_rollback_segs group by STATUS,TABLESPACE_NAME order by 1; STATUS COUNT(*) TABLESPACE_NAME ---------- ---------- ---------------- ONLINE 1 SYSTEM ONLINE 375 UNDOTBS1_NEW ONLINE 360 UNDOTBS2_NEW OFFLINE 11727 UNDOTBS1_NEW OFFLINE 11000 UNDOTBS2_NEW OFFLINE 9297 UNDOTBS3_NEW #Enable SMU Debug mode. SQL> alter system set "_smu_debug_mode"=44; #Execute below dynamic SQL to list too many offline rollback segments. SQL> select 'drop rollback segment ' ||segment_name|| ';' from dba_rollback_segs where status = 'OFFLINE' and tablespace_name in('UNDOTBS1_NEW','UNDOTBS2_NEW','UNDOTBS3_NEW','UNDOTBS4_NEW'); #Check the rollback segment count again. SQL> select status,count(*),TABLESPACE_NAME from dba_rollback_segs group by STATUS,TABLESPACE_NAME order by 1; STATUS COUNT(*) TABLESPACE_NAME ---------- ---------- --------------- ONLINE 1 SYSTEM ONLINE 868 UNDOTBS1_NEW ONLINE 191 UNDOTBS2_NEW ONLINE 278 UNDOTBS3_NEW ONLINE 283 UNDOTBS4_NEW OFFLINE 23 UNDOTBS3_NEW OFFLINE 7 UNDOTBS4_NEW #Disable SMU debug mode in memory. SQL> alter system set "_smu_debug_mode"=0; OR #Disable SMU debug mode in spfile. Needs a DB restart. SQL> alter system reset "_smu_debug_mode" scope=spfile; SQL> shut immediate; SQL> startup; Step 10: Now ask the Application team to monitor and report if ORA-01552 occurs again. ✅ Issue Resolution Confirmation: After applying the above workaround, the database was closely monitored for over one week, and no reoccurrence of the issue was observed. The Application team also confirmed that there were no further transaction declines, and they have not encountered the ORA-01552 error again. 📌 Final Conclusion: This clearly indicates that the issue was successfully resolved by dropping the excessive offline rollback segments from DB, thereby stabilizing undo management and preventing Oracle from using the SYSTEM tablespace for rollback operations. 🎉 Enjoy the troubleshooting journey!!! 📝 Stay tuned for a detailed blog post on this case !!! |
Thanks for reading this post ! Please comment if you like this post ! Click FOLLOW to get future blog updates !
Thank you for visiting my blog ! Thanks for your comment !