Instance crashed with ORA-01092 Error

While I was checking a customer environment, I saw that the Repository Database of an Enterprise Manager Cloud Control installation was down. Due to the alert log, there was a hung of processes in the night and the instance was killed by the PMON. Every subsequent startup of the database failed with an ORA-01092 error.

In this blog post, I will demonstrate how to analyze a memory issue which can lead to an ORA-01092 error.

HugePages

The database was configured to use HugePages – see my blog post Verwendung von HugePages (German) or My Oracle Support Note 361468.1 for more details about this topic.

If the system has not enough HugePages to hold the complete SGA, Oracle will use either a combination of HugePages and SmallPages or SmallPages only. To force Oracle to use HugePages only, you have to set use_large_pages to ONLY.

Information about the usage of HugePages/SmallPages are written to the alert log before each instance startup.

*********************************************************************
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

Per process system memlock (soft) limit = UNLIMITED
Expected per process system memlock (soft) limit to lock
SHARED GLOBAL AREA (SGA) into memory: 16G
Available system pagesizes:
4K, 2048K
Supported system pagesize(s):
PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s)
4K Configured 10 10 NONE
2048K 8194 8193 8193 NONE
**********************************************************************

You can use /proc/meminfo to see the amount of configured, free and reserved HugePages.

$> grep "HugePages_" /proc/meminfo
HugePages_Total:  8194
HugePages_Free:      2
HugePages_Rsvd:      1
HugePages_Surp:      0

Problem

When I tried to start up the database using SQL*Plus, I got an ORA-03113 error and the instance crashed.

ORA-03113: end-of-file on communication channel

In the alert log I could observe the following error information.

2018-07-27T08:38:07.788866+02:00
PMON (ospid: 16179): terminating the instance due to error 12752
2018-07-27T08:38:07.798599+02:00
System state dump requested by (instance=1, osid=16179 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_diag_16204_20180727083807.trc
2018-07-27T08:38:07.950150+02:00
Dumping diagnostic data in directory=[cdmp_20180727083807], requested by (instance=1, osid=16179 (PMON)), summary=[abnormal instance termination].
2018-07-27T08:38:08.302669+02:00
Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc (incident=31577):
ORA-1092 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u00/app/oracle/diag/rdbms/emrep/EMREP/incident/incdir_31577/EMREP_lgwr_16228_i31577.trc
2018-07-27T08:38:08.430036+02:00
Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-01092: ORACLE instance terminated. Disconnection forced
Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc (incident=31578):
ORA-1092 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u00/app/oracle/diag/rdbms/emrep/EMREP/incident/incdir_31578/EMREP_lgwr_16228_i31578.trc
2018-07-27T08:38:09.205225+02:00
Instance terminated by PMON, pid = 16179

Analysis

First, I checked the memory allocation on the server, starting with the HugePages. There is only one database on the server, which has currently the status down. So there should be no HugePages allocation. But this was not the case.

$> grep "HugePages_" /proc/meminfo
HugePages_Total:  8194
HugePages_Free:      2
HugePages_Rsvd:      1
HugePages_Surp:      0

All HugePages are still in use. To get the processes, which are still allocating HugePages, you can grep through the numa_maps file of each process.

$> grep huge /proc/*/numa_maps
/proc/16812/numa_maps:60000000 default file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=2 N0=2 kernelpagesize_kB=2048
/proc/16812/numa_maps:62000000 default file=/SYSV00000000\040(deleted) huge dirty=2643 mapmax=2 N0=2643 kernelpagesize_kB=2048
/proc/16812/numa_maps:460000000 default file=/SYSV00000000\040(deleted) huge
/proc/16814/numa_maps:60000000 default file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=2 N0=2 kernelpagesize_kB=2048
/proc/16814/numa_maps:62000000 default file=/SYSV00000000\040(deleted) huge dirty=2644 mapmax=2 N0=2644 kernelpagesize_kB=2048
/proc/16814/numa_maps:460000000 default file=/SYSV00000000\040(deleted) huge

Processes 16812 and 16814 have HugePages in use. With ps you can get further information.

$> ps 16812
PID TTY STAT TIME COMMAND
16812 ? Ds 0:00 oracleEMREP (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Both processes belong to the Repository Database. Buy why are they still running? Check the STAT column, status D means they are in an “uninterruptible sleep”. Even killing these processes would not work. For more details what happens behind the scene, check https://eklitzke.org/uninterruptible-sleep.

Solution

Normally the only solution to get rid of these processes is to restart the whole server.

References