Ok, not as good as beer and they can give you a nasty headache, so you have been warned ;)
Reason for this post are 2 bugs I discovered with Oracle RAC, both resulting in a single point of failure.
The platform on which I’m working is Oracle 10gR2 (10.2.0.4) on OEL 4.7.
The first one is when you are using NFS volumes to host the ocr and ocrmirror volumes.
Normally, when the ocr volume gets corrupted or unavailable , oracle should failover to the ocrmirror volume. The exact response is documented in the RAC FAQ on metalink (note 220970.1) and is currently discussed in a series of blog posts by Geert de Paep.
With NFS, however, you must use both the nointr and hard mount options (OS is OEL 4.7) and as a result the process that is trying to read or write an unavailable ocr volume will wait undefinitly on a response. This is not only happening when using commands such as crs_stat or srvctl, but also when an instance or service failover is initiated.
Oracle support however, does not exactly see it this way and has first blamed the os, then the storage and finally stated that there is no failover foreseen between the ocr and ocrmirror volumes…
It took some escalating and a change in support engineer to get some progress in that SR (mind you that after more then 4 months, they still have not acknowledged it as a bug).
The second problem is that, when you made the public interface redundant with os bonding, the racgvip script does not detect when all interfaces in the bond are disconnected.
This is caused because the script, unlike older version, is using mii-tool to check the availability of the public interface. Only when mii-tool states that the link is down, a ping test is done to the public gateway. If that test fails as well, then the vip fails over and the rac instances on that node are placed in a blocked state.
The problem however with mii-tool is that it plays not very well with bonds, and always reports the bond status as being up (in fact, regardless of the link state, mii-tool is always reporting a network bond as “bond0: 10 Mbit, half duplex, link ok”). So, the racgvip script always thinks that the public interface is up.
As mii-tool is an os utility, I first opened a case on the Oracle Enterprise Linux support, to check with them if its behavior was normal (I already confirmed that by googeling, but Oracle support does not seem to accept results from google :) ). And after running multiple tests with different bond options, they finally stated that mii-tool was indeed obsolete and should not be used to verify a bond status (yes, I know. Its own man page already states that mii-tool is obsolete).
So next, I opened a SR on part of the clusterware and oracle development promptly stated that it was not a clusterware bug but an os issue, pointing the finger to mii-tool and asking where it was written that mii-tool is obsolete… . After making them aware of the statement made by their OEL colleagues and the mii-tool man page, they have seemed to have accepted it as a bug.
I have checked the 11gR2 version of the racgvip script, and it seems to suffer the same problem.
ps) Note 365605.1 – “Oracle Bug Status Codes, Descriptions and Usage” is, although it seems incomplete, very usefull to understand the different status codes