While I was checking a customer environment, I saw that the Repository Database of an Enterprise Manager Cloud Control installation was down. Due to the alert log, there was a hung of processes in the night and the instance was killed by the PMON. Every subsequent startup of the database failed with an ORA-01092 error.
In this blog post, I will demonstrate how to analyze a memory issue which can lead to an ORA-01092 error.
Table of Contents
HugePages
The database was configured to use HugePages – see my blog post Verwendung von HugePages (German) or My Oracle Support Note 361468.1 for more details about this topic.
If the system has not enough HugePages to hold the complete SGA, Oracle will use either a combination of HugePages and SmallPages or SmallPages only. To force Oracle to use HugePages only, you have to set use_large_pages to ONLY.
Information about the usage of HugePages/SmallPages are written to the alert log before each instance startup.
********************************************************************* Dump of system resources acquired for SHARED GLOBAL AREA (SGA) Per process system memlock (soft) limit = UNLIMITED Expected per process system memlock (soft) limit to lock SHARED GLOBAL AREA (SGA) into memory: 16G Available system pagesizes: 4K, 2048K Supported system pagesize(s): PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 4K Configured 10 10 NONE 2048K 8194 8193 8193 NONE **********************************************************************
You can use /proc/meminfo to see the amount of configured, free and reserved HugePages.
$> grep "HugePages_" /proc/meminfo HugePages_Total: 8194 HugePages_Free: 2 HugePages_Rsvd: 1 HugePages_Surp: 0
Problem
When I tried to start up the database using SQL*Plus, I got an ORA-03113 error and the instance crashed.
ORA-03113: end-of-file on communication channel
In the alert log I could observe the following error information.
2018-07-27T08:38:07.788866+02:00 PMON (ospid: 16179): terminating the instance due to error 12752 2018-07-27T08:38:07.798599+02:00 System state dump requested by (instance=1, osid=16179 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_diag_16204_20180727083807.trc 2018-07-27T08:38:07.950150+02:00 Dumping diagnostic data in directory=[cdmp_20180727083807], requested by (instance=1, osid=16179 (PMON)), summary=[abnormal instance termination]. 2018-07-27T08:38:08.302669+02:00 Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc (incident=31577): ORA-1092 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u00/app/oracle/diag/rdbms/emrep/EMREP/incident/incdir_31577/EMREP_lgwr_16228_i31577.trc 2018-07-27T08:38:08.430036+02:00 Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc: ORA-01092: ORACLE instance terminated. Disconnection forced ORA-01092: ORACLE instance terminated. Disconnection forced Errors in file /u00/app/oracle/diag/rdbms/emrep/EMREP/trace/EMREP_lgwr_16228.trc (incident=31578): ORA-1092 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u00/app/oracle/diag/rdbms/emrep/EMREP/incident/incdir_31578/EMREP_lgwr_16228_i31578.trc 2018-07-27T08:38:09.205225+02:00 Instance terminated by PMON, pid = 16179
Analysis
First, I checked the memory allocation on the server, starting with the HugePages. There is only one database on the server, which has currently the status down. So there should be no HugePages allocation. But this was not the case.
$> grep "HugePages_" /proc/meminfo HugePages_Total: 8194 HugePages_Free: 2 HugePages_Rsvd: 1 HugePages_Surp: 0
All HugePages are still in use. To get the processes, which are still allocating HugePages, you can grep through the numa_maps file of each process.
$> grep huge /proc/*/numa_maps /proc/16812/numa_maps:60000000 default file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=2 N0=2 kernelpagesize_kB=2048 /proc/16812/numa_maps:62000000 default file=/SYSV00000000\040(deleted) huge dirty=2643 mapmax=2 N0=2643 kernelpagesize_kB=2048 /proc/16812/numa_maps:460000000 default file=/SYSV00000000\040(deleted) huge /proc/16814/numa_maps:60000000 default file=/SYSV00000000\040(deleted) huge dirty=2 mapmax=2 N0=2 kernelpagesize_kB=2048 /proc/16814/numa_maps:62000000 default file=/SYSV00000000\040(deleted) huge dirty=2644 mapmax=2 N0=2644 kernelpagesize_kB=2048 /proc/16814/numa_maps:460000000 default file=/SYSV00000000\040(deleted) huge
Processes 16812 and 16814 have HugePages in use. With ps you can get further information.
$> ps 16812 PID TTY STAT TIME COMMAND 16812 ? Ds 0:00 oracleEMREP (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
Both processes belong to the Repository Database. Buy why are they still running? Check the STAT column, status D means they are in an “uninterruptible sleep”. Even killing these processes would not work. For more details what happens behind the scene, check https://eklitzke.org/uninterruptible-sleep.
Solution
Normally the only solution to get rid of these processes is to restart the whole server.