ExaCC: Continue failed upgrade of Grid Infrastructure to 23ai

In this blog post I will show how to identify and fix the root cause of a failed Grid Infrastructure upgrade to 23ai and how to complete the upgrade to get the VM Cluster back to the AVAILABLE state.

Problem

Before I started the upgrade of the Grid Infrastructure from version 19.22 to 23.04, I executed the precheck. In my case the precheck was successful.

Precheck status of the 23ai upgrade

However, the upgrade failed after the first node was switched to 23ai with the following errors.

Error messages of the failed work request

The state of the VM Cluster switched to FAILED.

Analysis

The OCI itself provides only a limited amount of diagnostic information. In the case of an error, the generated log files on the first node of the VM Cluster should be checked.

Navigate to /var/opt/oracle/log/grid/upgrade and check the latest log files.

$> ls -ltr
...
-rw-r----- 1 oracle oinstall 3035001 Sep 10 12:29 pilot_2024-09-10_11-47-33-AM_92849
-rw-r----- 1 oracle oinstall   79340 Sep 10 12:29 dbaastools_2024-09-10_11-47-12-AM_89426.log

The dbaastools log provides the output of the internally used command dbaascli grid upgrade. You can use the following command to filter the required information.

$> grep "DBaaSToolsMesgAdvisor.showMessage" dbaastools_2024-09-10_11-47-12-AM_89426.log | sed 's/^.*DBaaSToolsMesgAdvisor\.showMessage:[0-9]*\] //g'
Job id: 94c90fff-80f3-494a-b8ab-995e2ada0b00
Session log: /var/opt/oracle/log/grid/upgrade/dbaastools_2024-09-10_11-47-12-AM_89426.log
CMDLINE: grid upgrade --showOutputDelimiter --targetVersion 23.4.0.24.05 --version 23.4.0.24.05 --waitForCompletion false
Starting console input arguments prompt
Completed console input arguments prompt
Starting secret arguments prompt
Completed secret arguments prompt
Loading PILOT...
Session ID of the current execution is: 21627
Log file location: /var/opt/oracle/log/grid/upgrade/pilot_2024-09-10_11-47-33-AM_92849
-----------------
Running initialize job
Completed initialize job
-----------------
Running validate_target_version job
Completed validate_target_version job
-----------------
Running check_locations_existence job
Completed check_locations_existence job
-----------------
Running check_locations_free_space job
Completed check_locations_free_space job
-----------------
Running check_os_version job
Completed check_os_version job
-----------------
Running check_patch_level job
Completed check_patch_level job
-----------------
Running check_crs_running_all_nodes job
Completed check_crs_running_all_nodes job
-----------------
Running check_asm_rebalance_ops job
Completed check_asm_rebalance_ops job
-----------------
Running check_crs_state job
Completed check_crs_state job
-----------------
Running ocrcheck job
Completed ocrcheck job
-----------------
Running validate_databases job
[WARNING] [DBAAS-70643] Following pluggable databases '{CDB1_fra2rc=[PDB1_CLONE]}' do not have services configured.
   ACTION: Make sure to configure the services of pluggable databases so that pluggable databases are started after the database bounce.
Completed validate_databases job
-----------------
Running validate_image_download_configuration job
Completed validate_image_download_configuration job
-----------------
Running download_image job
Completed download_image job
-----------------
Running create_locations_first_node job
Completed create_locations_first_node job
-----------------
Running unpackage_image_first_node job
Completed unpackage_image_first_node job
-----------------
Running validate_image job
Completed validate_image job
-----------------
Running software_only_prereqs job
Completed software_only_prereqs job
-----------------
Running cluvfy_pre_upgrade job
Completed cluvfy_pre_upgrade job
Acquiring write lock: _u01_app_19.0.0.0_grid
Acquiring write lock: provisioning
-----------------
Running pre_upgrade_lock_manager job
Completed pre_upgrade_lock_manager job
-----------------
Running update_limits job
Completed update_limits job
-----------------
Running unset_css_miscount job
Completed unset_css_miscount job
-----------------
Running setup_software_first_node job
Completed setup_software_first_node job
-----------------
Running create_locations_on_remote_nodes job
Completed create_locations_on_remote_nodes job
-----------------
Running copy_software_to_shared_location job
Completed copy_software_to_shared_location job
-----------------
Running copy_software_to_remote_nodes job
Completed copy_software_to_remote_nodes job
-----------------
Running remove_software_from_shared_location job
Completed remove_software_from_shared_location job
-----------------
Running inventory_rootscript job
Completed inventory_rootscript job
-----------------
Running attach_home job
Completed attach_home job
-----------------
Running software_rootscript job
Grid software install completed with oracle home: /u02/app/23.0.0.0/gridhome_1
Completed software_rootscript job
-----------------
Running config_update job
Completed config_update job
-----------------
Running stop_databases-exanode01 job
Completed stop_databases-exanode01 job
-----------------
Running stop_tfa-exanode01 job
Completed stop_tfa-exanode01 job
-----------------
Running umount_acfs-exanode01 job
Completed umount_acfs-exanode01 job
-----------------
Running execute_rootscript-exanode01 job
Completed execute_rootscript-exanode01 job
-----------------
Running mount_acfs-exanode01 job
Completed mount_acfs-exanode01 job
-----------------
Running start_databases-exanode01 job
Execution of start_databases-exanode01 failed
[FATAL] [DBAAS-70484] Unable to start database instance/s for databases '[(Node:exanode01,dbUniqueName:CDB03_fra26z,instance:CDB031)]'.
Releasing lock: _u01_app_19.0.0.0_grid
Releasing lock: provisioning
*** Executing jobs which need to be run always... ***
-----------------
Running generate_system_details job
Acquiring native write lock: global_dbsystem_details_generation
Releasing native lock: global_dbsystem_details_generation
Completed generate_system_details job
******** PLUGIN EXECUTION FAILED ********

In my case, the execution failed because a database instance could not be started.

Solution

Based on the information I was able to fix the startup issue of the database. In my case, an instance for a terminated node was still configured and I removed this instance using srvctl remove instance.

After all issues are solved, the patching can be resumed. The required command can be found in the dbaastools log file as well.

$>  grep -A3 "Resume command" dbaastools_2024-09-10_11-47-12-AM_89426.log
******************************* Resume command *******************************
To resume this failed session, run the following command:
dbaascli grid upgrade --showOutputDelimiter --targetVersion 23.4.0.24.05 --waitForCompletion false --sessionID 21627 --resume
******************************************************************************

Now execute the command using the root user.

$> dbaascli grid upgrade --showOutputDelimiter --targetVersion 23.4.0.24.05 --waitForCompletion false --sessionID 21627 --resume
DBAAS CLI version 24.3.1.0.0
Executing command grid upgrade --showOutputDelimiter --targetVersion 23.4.0.24.05 --waitForCompletion false --sessionID 21627 --resume
Job id: 4649cca8-dcf7-4238-81a5-f8d397ce3248
Session log: /var/opt/oracle/log/grid/upgrade/dbaastools_2024-09-11_11-58-54-AM_394705.log
Job accepted. Use "dbaascli job getStatus --jobID 4649cca8-dcf7-4238-81a5-f8d397ce3248" to check the job status.

Use dbaascli job getStatus to get the status of the job. After a while, the job should be completed with the status “Success”.

$> dbaascli job getStatus --jobID 4649cca8-dcf7-4238-81a5-f8d397ce3248
DBAAS CLI version 24.3.1.0.0
Executing command job getStatus --jobID 4649cca8-dcf7-4238-81a5-f8d397ce3248
{
  "jobId" : "4649cca8-dcf7-4238-81a5-f8d397ce3248",
  "dbaascliParameters" : "grid upgrade --resume --showOutputDelimiter --targetVersion 23.4.0.24.05 --waitForCompletion false --sessionID 21678",
  "status" : "Success",
  "message" : "grid upgrade job: Success",
  "logFile" : "/var/opt/oracle/log/grid/upgrade/dbaastools_2024-09-11_01-36-56-PM_196887.log",
  "createTimestamp" : 1726054622127,
  "updatedTime" : 1726054850429,
  "description" : "Service job report for operation grid upgrade",
  "appMessages" : {
    "messages" : [ ],
    "localeMessages" : [ ],
    "errorAction" : "SUCCEED_AND_SHOW"
  },
  "resourceList" : [ ],
  "sessionID" : 21683,
  "jobSpecificDetailsJson" : null,
  "pct_complete" : "100"
}

Verification

The status of the Grid Infrastructure should be NORMAL now.

$> /u02/app/23.0.0.0/gridhome_1/bin/crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [23.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [0].

Also, the state of the VM Cluster is back to AVAILABLE.

Leave a Reply

Your email address will not be published. Required fields are marked *