Irrelevant thoughts of an oracle DBA

11 August 2011

RAC investigations part I

Filed under: rac — Freek D'Hooge @ 0:06
Tags: , ,

Environment description

2 node rac with Oracle 11.2.0.2.2
Oracle Linux 5.6 with the Unbreakable Enterprise Kernel (2.6.32-100.36.1.el5uek)

Conducted tests

test_srv is a service which has both the instance running on node1 and node2 as preferred instances.
On node1 the service was manually stopped.

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
 NAME=ora.mydb.test_srv.svc
 TYPE=ora.service.type
 CARDINALITY_ID=1
 DEGREE_ID=1
 TARGET=OFFLINE
 STATE=OFFLINE
 CARDINALITY_ID=2
 DEGREE_ID=1
 TARGET=ONLINE
 STATE=ONLINE on node2

Issue a “shutdown abort” on the instance running on node2

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

start the instance again

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

The service is now running on both instances, although before the crash the service was set offline on node1.

Same test, but this time the service is stopped on all instances

[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

This time both services stay offline.
But what happens if we start the instance again:

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

Now the service has started again on the restarted instance.
Explanation for this is that the service was configured to come up automatically with the instance, which explains why the service is started on the restarted node.
For the failover this seems to me as expected behaviour as it is the same as what would happen with a preferred / available configuration.

For the third test, we will reconfigure the service to have a preferred and an available node

[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv
[grid@node1 ~]$ srvctl modify service -d mydb -s test_srv -n -i mydb2 -a mydb1

[grid@node1 ~]$ srvctl config service -d mydb -s test_srv
Service name: test_srv
Service is enabled
Server pool: mydb_test_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Preferred instances: mydb2
Available instances: mydb1

[grid@node1 ~]$ srvctl start service -d mydb -s test_srv -i mydb2
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

The service is running on its preferred instance, which we will now crash

[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=OFFLINE

eumm, I actually expected a relocation here…
As I have other services which have a preferred / available configuration, I know this service should failover.

[grid@node1 ~]$ srvctl status service -d mydb -s test_srv
Service test_srv is not running.

[grid@node1 ~]$ srvctl config service -d mydb -s test_srv
Service name: test_srv
Service is enabled
Server pool: mydb_test_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Preferred instances: mydb2
Available instances: mydb1

[grid@node1 ~]$ srvctl status database -d mydb
Instance mydb1 is running on node node1
Instance mydb2 is not running on node node2

I could find no clues in the different cluster log files as of why the relocation did not occur.
More testing will be necessary.
Also note that the output of the crsctl status resource does not contain information about on which node or instance the service is expected to be online.
But by using the -v flag we can see the last_server attribute:

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -v
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
LAST_SERVER=node2
STATE=OFFLINE
TARGET=ONLINE
CARDINALITY_ID=1
CREATION_SEED=137
RESTART_COUNT=0
FAILURE_COUNT=0
FAILURE_HISTORY=
ID=ora.mydb.test_srv.svc 1 1
INCARNATION=5
LAST_RESTART=08/10/2011 16:32:53
LAST_STATE_CHANGE=08/10/2011 16:34:03
STATE_DETAILS=
INTERNAL_STATE=STABLE

After starting the instance again, the service was back available

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

A second run of this test gave the same result.
Manually relocating the service did work though:

[grid@node1 ~]$ srvctl relocate service -d mydb -s test_srv -i mydb1 -t mydb2
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

What if I removed the service and recreated it directly as preferred / available:

[grid@node1 ~]$ srvctl stop service -d mydb -s test_srv

[grid@node1 ~]$ srvctl remove service -d mydb -s test_srv

[grid@node1 ~]$ srvctl add service -d mydb -s test_srv -r mydb2 -a mydb1 -y AUTOMATIC -P BASIC -e SELECT
PRCD-1026 : Failed to create service test_srv for database mydb
PRKH-1014 : Current user grid is not the same as oracle owner orauser of oracle home /opt/oracle/orauser/product/11.2.0.2/dbhome_1.

would it?
Let us test it:

[grid@node1 ~]$ su - orauser
Password:

[orauser@node1 ~]$ srvctl add service -d mydb -s test_srv -r mydb1,mydb2 -y AUTOMATIC -P BASIC -e SELECT

[orauser@node1 ~]$ srvctl config service -d mydb -s test_srv
Service name: test_srv
Service is enabled
Server pool: mydb_test_srv
Cardinality: 2
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SELECT
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: mydb1,mydb2
Available instances:

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

now modify it:

[orauser@node1 ~]$ srvctl modify service -d mydb -s test_srv -n -i mydb2 -a mydb1

[orauser@node1 ~]$ srvctl config service -d mydb -s test_srv
Service name: test_srv
Service is enabled
Server pool: mydb_test_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SELECT
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: mydb2
Available instances: mydb1

[orauser@node1 ~]$ srvctl start service -d mydb -s test_srv -i mydb2

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

[orauser@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=OFFLINE

Nope, the user modifying the service has nothing to do with it.
I also tested the scenario where I directly created a preferred / available service, but in this case the failover also did not work.
But after some more testing I found the reason.
During the first test I had shutdown the instance via sqlplus, not via srvctl. And the other services I talked about had failed over during this test (I never did a failback).
After doing the shutdown abort again via sqlplus, the failover worked again.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

[orauser@node2 ~]$ export ORACLE_SID=mydb2
[orauser@node2 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Wed Aug 10 18:28:29 2011

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> shutdown abort
ORACLE instance shut down.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

SQL> startup
ORACLE instance started.

Total System Global Area 3140026368 bytes
Fixed Size                  2230600 bytes
Variable Size            1526728376 bytes
Database Buffers         1593835520 bytes
Redo Buffers               17231872 bytes
Database mounted.
Database opened.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

as expected, starting the instance again did not trigger a failback of the service.

Question now is, if the failover not happening when issuing the shutdown via srvctl is expected behaviour or not.
For this, one probably would have to open a service case, answer a couple of question not important for this issue, escalate and still have to wait for several months.
Do I sound bitter now?

Conclusion:

  • When restarting an instance, an offline service that has this instance listed as a preferred node will be started (management policy = automatic).
  • When an instance on which a service was running fails, the service is started on at least one other preferred instance.
  • The service will remain running on this instance, even when the original instance is started again (in which case the service will run on both instances).
  • When a service has a preferred / available configuration, the service will failover to the available instance, but not failback afterwards.
  • Failover in a preferred / available configuration does not happen when the instance was stopped via “srvctl shutdown <db_unique_name> – o abort”

Questions remaining:

  • What if there where more then 2 nodes, with a service that has all three or more nodes listed as preferred, but currently only running on one node.
    If the instance on which that service is running fails, would the service then be started on all preferred nodes or on only 1 of them?
  • What if, in the above case, the service was running on 2 nodes.
    Would it still be started on other nodes?
  • And what if one of the nodes was configured as available and not as preferred? Would the service on the preferred node still be started or the one on the available instance or both?
  • And last but not least, is the srcvtl shutdown behaviour a bug or not?

It would be neat if someone has access to a 3 or more node rac on which they can run the above tests and send me the results  :-)

Update 13/08/2011:
Amar Lettat, one of my colleagues at Uptime, has pointed me to MOS note 1324574.1 – “11gR2 RAC Service Not Failing Over To Other Node When Instance Is Shut Down”.
This note clearly points out that the service not failing over when shutting down with srvctl is expected behaviour in 11.2.
It also points to the Oracle documentation, where this behaviour is also documented.
So not a bug, only a well documented change in behaviour.
(more…)

Create a free website or blog at WordPress.com.