Environment description
2 node rac with Oracle 11.2.0.2.2
Oracle Linux 5.6 with the Unbreakable Enterprise Kernel (2.6.32-100.36.1.el5uek)
Conducted tests
test_srv is a service which has both the instance running on node1 and node2 as preferred instances.
On node1 the service was manually stopped.
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE CARDINALITY_ID=2 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2
Issue a “shutdown abort” on the instance running on node2
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE CARDINALITY_ID=2 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node1
start the instance again
[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2 [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2 CARDINALITY_ID=2 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node1
The service is now running on both instances, although before the crash the service was set offline on node1.
Same test, but this time the service is stopped on all instances
[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE CARDINALITY_ID=2 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE [grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE CARDINALITY_ID=2 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE
This time both services stay offline.
But what happens if we start the instance again:
[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2 [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2 CARDINALITY_ID=2 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE
Now the service has started again on the restarted instance.
Explanation for this is that the service was configured to come up automatically with the instance, which explains why the service is started on the restarted node.
For the failover this seems to me as expected behaviour as it is the same as what would happen with a preferred / available configuration.
For the third test, we will reconfigure the service to have a preferred and an available node
[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv [grid@node1 ~]$ srvctl modify service -d mydb -s test_srv -n -i mydb2 -a mydb1 [grid@node1 ~]$ srvctl config service -d mydb -s test_srv Service name: test_srv Service is enabled Server pool: mydb_test_srv Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: NONE Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: NONE Edition: Preferred instances: mydb2 Available instances: mydb1 [grid@node1 ~]$ srvctl start service -d mydb -s test_srv -i mydb2 [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2
The service is running on its preferred instance, which we will now crash
[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=OFFLINE
eumm, I actually expected a relocation here…
As I have other services which have a preferred / available configuration, I know this service should failover.
[grid@node1 ~]$ srvctl status service -d mydb -s test_srv Service test_srv is not running. [grid@node1 ~]$ srvctl config service -d mydb -s test_srv Service name: test_srv Service is enabled Server pool: mydb_test_srv Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: NONE Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: NONE Edition: Preferred instances: mydb2 Available instances: mydb1 [grid@node1 ~]$ srvctl status database -d mydb Instance mydb1 is running on node node1 Instance mydb2 is not running on node node2
I could find no clues in the different cluster log files as of why the relocation did not occur.
More testing will be necessary.
Also note that the output of the crsctl status resource does not contain information about on which node or instance the service is expected to be online.
But by using the -v flag we can see the last_server attribute:
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -v NAME=ora.mydb.test_srv.svc TYPE=ora.service.type LAST_SERVER=node2 STATE=OFFLINE TARGET=ONLINE CARDINALITY_ID=1 CREATION_SEED=137 RESTART_COUNT=0 FAILURE_COUNT=0 FAILURE_HISTORY= ID=ora.mydb.test_srv.svc 1 1 INCARNATION=5 LAST_RESTART=08/10/2011 16:32:53 LAST_STATE_CHANGE=08/10/2011 16:34:03 STATE_DETAILS= INTERNAL_STATE=STABLE
After starting the instance again, the service was back available
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2
A second run of this test gave the same result.
Manually relocating the service did work though:
[grid@node1 ~]$ srvctl relocate service -d mydb -s test_srv -i mydb1 -t mydb2 [grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2
What if I removed the service and recreated it directly as preferred / available:
[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv [grid@node1 ~]$ srvctl remove service -d mydb -s test_srv [grid@node1 ~]$ srvctl add service -d mydb -s test_srv -r mydb2 -a mydb1 -y AUTOMATIC -P BASIC -e SELECT PRCD-1026 : Failed to create service test_srv for database mydb PRKH-1014 : Current user grid is not the same as oracle owner orauser of oracle home /opt/oracle/orauser/product/11.2.0.2/dbhome_1.
would it?
Let us test it:
[grid@node1 ~]$ su - orauser Password: [orauser@node1 ~]$ srvctl add service -d mydb -s test_srv -r mydb1,mydb2 -y AUTOMATIC -P BASIC -e SELECT [orauser@node1 ~]$ srvctl config service -d mydb -s test_srv Service name: test_srv Service is enabled Server pool: mydb_test_srv Cardinality: 2 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: SELECT Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: BASIC Edition: Preferred instances: mydb1,mydb2 Available instances: [orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE CARDINALITY_ID=2 DEGREE_ID=1 TARGET=OFFLINE STATE=OFFLINE
now modify it:
[orauser@node1 ~]$ srvctl modify service -d mydb -s test_srv -n -i mydb2 -a mydb1 [orauser@node1 ~]$ srvctl config service -d mydb -s test_srv Service name: test_srv Service is enabled Server pool: mydb_test_srv Cardinality: 1 Disconnect: false Service role: PRIMARY Management policy: AUTOMATIC DTP transaction: false AQ HA notifications: false Failover type: SELECT Failover method: NONE TAF failover retries: 0 TAF failover delay: 0 Connection Load Balancing Goal: LONG Runtime Load Balancing Goal: NONE TAF policy specification: BASIC Edition: Preferred instances: mydb2 Available instances: mydb1 [orauser@node1 ~]$ srvctl start service -d mydb -s test_srv -i mydb2 [orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2 [orauser@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort [orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=OFFLINE
Nope, the user modifying the service has nothing to do with it.
I also tested the scenario where I directly created a preferred / available service, but in this case the failover also did not work.
But after some more testing I found the reason.
During the first test I had shutdown the instance via sqlplus, not via srvctl. And the other services I talked about had failed over during this test (I never did a failback).
After doing the shutdown abort again via sqlplus, the failover worked again.
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node2 [orauser@node2 ~]$ export ORACLE_SID=mydb2 [orauser@node2 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Wed Aug 10 18:28:29 2011 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Release 11.2.0.2.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> shutdown abort ORACLE instance shut down. [orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node1 SQL> startup ORACLE instance started. Total System Global Area 3140026368 bytes Fixed Size 2230600 bytes Variable Size 1526728376 bytes Database Buffers 1593835520 bytes Redo Buffers 17231872 bytes Database mounted. Database opened. [orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l NAME=ora.mydb.test_srv.svc TYPE=ora.service.type CARDINALITY_ID=1 DEGREE_ID=1 TARGET=ONLINE STATE=ONLINE on node1
as expected, starting the instance again did not trigger a failback of the service.
Question now is, if the failover not happening when issuing the shutdown via srvctl is expected behaviour or not.
For this, one probably would have to open a service case, answer a couple of question not important for this issue, escalate and still have to wait for several months.
Do I sound bitter now?
Conclusion:
- When restarting an instance, an offline service that has this instance listed as a preferred node will be started (management policy = automatic).
- When an instance on which a service was running fails, the service is started on at least one other preferred instance.
- The service will remain running on this instance, even when the original instance is started again (in which case the service will run on both instances).
- When a service has a preferred / available configuration, the service will failover to the available instance, but not failback afterwards.
- Failover in a preferred / available configuration does not happen when the instance was stopped via “srvctl shutdown <db_unique_name> – o abort”
Questions remaining:
- What if there where more then 2 nodes, with a service that has all three or more nodes listed as preferred, but currently only running on one node.
If the instance on which that service is running fails, would the service then be started on all preferred nodes or on only 1 of them?
- What if, in the above case, the service was running on 2 nodes.
Would it still be started on other nodes?
- And what if one of the nodes was configured as available and not as preferred? Would the service on the preferred node still be started or the one on the available instance or both?
- And last but not least, is the srcvtl shutdown behaviour a bug or not?
It would be neat if someone has access to a 3 or more node rac on which they can run the above tests and send me the results :-)
Update 13/08/2011:
Amar Lettat, one of my colleagues at Uptime, has pointed me to MOS note 1324574.1 – “11gR2 RAC Service Not Failing Over To Other Node When Instance Is Shut Down”.
This note clearly points out that the service not failing over when shutting down with srvctl is expected behaviour in 11.2.
It also points to the Oracle documentation, where this behaviour is also documented.
So not a bug, only a well documented change in behaviour.
(more…)