Irrelevant thoughts of an oracle DBA

22 September 2011

Rise of the appliances?

Filed under: infrastructure,opinion,unbreakable db appliance — Freek D'Hooge @ 9:52

Some quick thoughts.

Yesterday Oracle announced it’s first database appliance for th SMB market.
Before this, it had already its Exadata and Exalogic appliances for the big environments.
During the presentation Oracle has also indicated that it want’s to continue delivering new appliance products and apparently is no longer interested in selling “commodity” x86 servers.

Symantec has also been busy with appliances for Netbackup.

For some time now, we have seen that the big players in the IT market are leaving their historical background and are trying to offer the complete stack from software over switches to storage. Is this offering of appliances the next step?
Will we see more and more applications offered as appliances?

If so, what will this mean for the independent system integrators?

Also, as these appliances seems to use their own dedicated storage, what does this mean for the SAN?
(I know of some people who will not mourn there decline).

21 September 2011

Oracle anounces the Unbreakable DB Appliance

Filed under: infrastructure,opinion,unbreakable db appliance — Freek D'Hooge @ 19:33
Tags: ,

More then 10 years after Oracle’s first appliance attempt with Raw Iron and 3 years after the release of Exadata, Oracle has now announced the Unbreakable DB Applicance.

This “cluster in a box” consists out of a 4 RU chassis, in which 2 server nodes,  96 GB memory per node, 12 TB raw shared disk storage  (24 disks) and 292 GB flash disks has been placed.
The two server nodes have a total of 24 cpu cores, but cores can be disabled.
This allows for sub-capacity licensing of the software (with a minimum of 4 cores).

On the software side, the appliance is running Oracle linux and 11gR2 grid infrastructure and 11gR2 db software. Databases on this appliance can run as single node, RAC or RAC One Node.
Oracle enterprise manager is also part of the software stack.

Claims are made towards one button installation of software and patching.
The appliance has also a “phone home” functionality which automatically creates a service request when a problem is detected.

List price for the hardware is $ 50,000 (regardless of how many cores you activate) and for the software the standard DB licensing applies.
Which means that existing CPU licences can be transferred to this appliance.

Oracle positions this system below the Exadata quarter rack, and it is also worth mentioning that this appliance is not expandable.

So far the product launch information.

Some questions / remarks I have:

  • According to the presentation the hardware price remains the same, regardless of how many cores you activate (namely $ 50,000).
    In my opinion, this means that no one will buy this appliance to just activate 4 cores.
    There are much cheaper solutions when you only need a low number of cores (certainly when you consider that most companies already have a san which can be used for the Oracle databases)
  • There are 24 disks in the appliance, which seems low (certainly compaired to the 24 cpu cores).
    However, keep in mind that this storage is dedicated and probably (I don’t have confirmation on this) capable of asm intelligent data placement and command queuing.
    Normally SAN vendors are using an estimate of 180 IOPS per san disk. Oracle however is using an estimation of 300 IOPS per cell disk for Exadata, and tests done by Glenn Fawcett show that they can actually perform even better (around 400 IOPS).
    Using the number of 300 IOPS, this would mean that the 24 disks translate to 40 SAN disks (that may not used by any other application, so in reality to even more san disks), which already looks very different.Now, I’m still unsure how it will perform with write intensive databases (oltp or dwh), certainly when several databases are consolidated on this appliance.As this appliance is not expandable, the number of disks may be a weak point, compaired to the number of cpu cores.
    I’m hoping that someone like Kevin Closson (poke poke) will be able to shed some light on this, as my knowledge in this area is rather limited :-)
  • In the presentation it was mentioned that the flash storage is used for the redo logs, but it is unclear if it could also be used to store datafiles or as cache (as with the Exadata smart flash cache)

As with many things the proof of the pudding is in the eating, so I’m looking forward to some benchmarks and presentations by real world customers.
And if anyone from Oracle is reading this, you may always send me a demo machine so I can do some testing on my own  ;-))

update 20:12, fixed wrong memory specification

3 November 2009

Two Oracle RAC bugs on the wall, two Oracle bugs. Take one down …

Filed under: infrastructure,linux,rac — Freek D'Hooge @ 2:50

Ok, not as good as beer and they can give you a nasty headache, so you have been warned   ;)

Reason for this post are 2 bugs I discovered with Oracle RAC, both resulting in a single point of failure.
The platform on which I’m working is Oracle 10gR2 ( on OEL 4.7.

The first one is when you are using NFS volumes to host the ocr and ocrmirror volumes.
Normally, when the ocr volume gets corrupted or unavailable , oracle should failover to the ocrmirror volume. The exact response is documented in the RAC FAQ on metalink (note 220970.1) and is currently discussed in a series of blog posts by Geert de Paep.
With NFS, however, you must use both the nointr and hard mount options (OS is OEL 4.7) and as a result the process that is trying to read or write an unavailable ocr volume will wait undefinitly on a response. This is not only happening when using commands such as crs_stat or srvctl, but also when an instance or service failover is initiated.
Oracle support however, does not exactly see it this way and has first blamed the os, then the storage and finally stated that there is no failover foreseen between the ocr and ocrmirror volumes…
It took some escalating and a change in support engineer to get some progress in that SR (mind you that after more then 4 months, they still have not acknowledged it as a bug).

The second problem is that, when you made the public interface redundant with os bonding, the racgvip script does not detect when all interfaces in the bond are disconnected.
This is caused because the script, unlike older version, is using mii-tool to check the availability of the public interface. Only when mii-tool states that the link is down, a ping test is done to the public gateway. If that test fails as well, then the vip fails over and the rac instances on that node are placed in a blocked state.
The problem however with mii-tool is that it plays not very well with bonds, and always reports the bond status as being up (in fact, regardless of the link state, mii-tool is always reporting a network bond as “bond0: 10 Mbit, half duplex, link ok”). So, the racgvip script always thinks that the public interface is up.
As mii-tool is an os utility, I first opened a case on the Oracle Enterprise Linux support, to check with them if its behavior was normal (I already confirmed that by googeling, but Oracle support does not seem to accept results from google :)   ). And after running multiple tests with different bond options, they finally stated that mii-tool was indeed obsolete and should not be used to verify a bond status (yes, I know. Its own man page already states that mii-tool is obsolete).
So next, I opened a SR on part of the clusterware and oracle development promptly stated that it was not a clusterware bug but an os issue, pointing the finger to mii-tool and asking where it was written that mii-tool is obsolete… . After making them aware of the statement made by their OEL colleagues and the mii-tool man page, they have seemed to have accepted it as a bug.
I have checked the 11gR2 version of the racgvip script, and it seems to suffer the same problem.

ps) Note 365605.1 – “Oracle Bug Status Codes, Descriptions and Usage” is, although it seems incomplete, very usefull to understand the different status codes

25 October 2009

Wintertime (again)

Filed under: infrastructure,Uncategorized — Freek D'Hooge @ 15:28

During my prior post on the effect of daylight saving settings on the Oracle scheduler, I already pointed out that it is best to set your session timezone information to a named timezone and not to an absolute offset. In this post I would like to investigate how the session timezone settings affect the sysdate, current_date, systimestamp and current_timestamp variables during the switchover to or from daylight saving time. Current_date and current_timestamp, are using the date/time information of the server on which the database runs and modify that time using the timezone settings of the session.
As with the last post, the tests where done in response to the switching from wintertime to summertime, and I’m to lazy to redo them.

In the first test, I do not explicitly set timezone information in my session.
Both the server time and the client time has been set to a couple of minutes before the swithover from wintertime to summertime:

sys@GUNNAR> alter session set nls_date_format = 'DD/MM/YYYY HH24:MI:SS';

Session altered.

sys@GUNNAR> alter session set nls_timestamp_tz_format='DD/MM/YYYY HH24:MI:SS "TZ:" TZR "DS:" TZD ';

Session altered.

sys@GUNNAR> column systimestamp format a35
sys@GUNNAR> column current_timestamp format a35
sys@GUNNAR> select sysdate, current_date, systimestamp, current_timestamp from dual;

SYSDATE             CURRENT_DATE        SYSTIMESTAMP                        CURRENT_TIMESTAMP
------------------- ------------------- ----------------------------------- -----------------------------------
29/03/2009 01:58:32 29/03/2009 01:58:32 29/03/2009 01:58:32 TZ: +01:00 DS:  29/03/2009 01:58:32 TZ: +01:00 DS:

As you can see the timezone information uses the absolute offset notation and is set to GMT +1 (which corresponds with wintertime in Belgium).
After some minutes (when the summertime came in effect), I execute the same query again:

sys@GUNNAR> select sysdate, current_date, systimestamp, current_timestamp from dual;

SYSDATE             CURRENT_DATE        SYSTIMESTAMP                        CURRENT_TIMESTAMP
------------------- ------------------- ----------------------------------- -----------------------------------
29/03/2009 03:00:18 29/03/2009 02:00:18 29/03/2009 03:00:18 TZ: +02:00 DS:  29/03/2009 02:00:18 TZ: +01:00 DS:

Both sysdate and systimestamp has jumped 1 hour in the feature and systimestamp now shows the timezone as “GMT + 2” (summertime in Belgium).
Current_date and current_timestamp both show the time without summertime corrections, but with current_timestamp the timezone information places the time in the right context.

Next, I disconnect and reconnect the session:

sys@GUNNAR> select sysdate, current_date, systimestamp, current_timestamp from dual;

SYSDATE             CURRENT_DATE        SYSTIMESTAMP                        CURRENT_TIMESTAMP
------------------- ------------------- ----------------------------------- -----------------------------------
29/03/2009 03:01:15 29/03/2009 03:01:15 29/03/2009 03:01:15 TZ: +02:00 DS:  29/03/2009 03:01:15 TZ: +02:00 DS:

This time, all 4 show the same time and timezone information (all using summertime).
The explanation for this is that the timezone information for a session is determined when the session is created, and Oracle only applies daylight saving settings when using a named timezone. So as long as the session is connected, it uses the “old” timezone of GMT +1. With sysdate and systimestamp the timezone information comes from the server, not from the client.

In the second test, I have set the ORA_SDTZ variable in the client environment to “Europe/Brussels”

sys@GUNNAR> select sysdate, current_date, systimestamp, current_timestamp from dual;

SYSDATE             CURRENT_DATE        SYSTIMESTAMP                        CURRENT_TIMESTAMP
------------------- ------------------- ----------------------------------- -----------------------------------------------
29/03/2009 01:57:37 29/03/2009 01:57:38 29/03/2009 01:57:37 TZ: +01:00 DS:  29/03/2009 01:57:37 TZ: EUROPE/BRUSSELS DS: CET

### a couple of minutes later

sys@GUNNAR> select sysdate, current_date, systimestamp, current_timestamp from dual;

SYSDATE             CURRENT_DATE        SYSTIMESTAMP                        CURRENT_TIMESTAMP
------------------- ------------------- ----------------------------------- ------------------------------------------------
29/03/2009 03:00:04 29/03/2009 03:00:04 29/03/2009 03:00:04 TZ: +02:00 DS:  29/03/2009 03:00:04 TZ: EUROPE/BRUSSELS DS: CEST

Both current_date and current_timestamp have now also jumped 1 hour in the “future” and the daylight saving settings in current_timestamp has changed from CET (Central European Time) to CEST (Central European Summer Time).
To me this shows that it is important to set the timezone of you clients correctly, even if the database is not used from different timezones.
A long running session is sufficient to pollute your data, certainly if you are using current_date as it has no timezone information.

22 October 2009

switching to wintertime

Filed under: infrastructure — Freek D'Hooge @ 18:17

In Belgium we are switching to wintertime this Sunday, which is good opportunity for me to write this post.
I normally intended to write it when we switched to summer time, so everything will be from the point of view of changing from winter time to summer time (confused yet? ).

The reason that I wanted to write about it, where some alerts we got back then from our monitoring considering scheduler jobs which where no longer running on time.
Quickly it became clear that these jobs did not follow the change to summer time, but instead ran an hour later.
The key is to look at the dba_scheduler_jobs table in the correct format. You see, the *_run_date columns are of the datatype “timestamp(6) with timezone”, so to get all the information you need to use the right format model. Using the TZR and TZD models you can respectively see the timezone and the daylight saving information:

sys@WPS50> select job_name, to_char(last_start_date, 'DD/MM/YYYY HH24:MI:SS "TZ:" TZR "DS:" TZD ') last_start_date, to_char(next_run_date, 'DD/MM/YYYY HH24:MI:SS "TS:" TZR "DS:" TZD ') next_run_date from dba_scheduler_jobs;

JOB_NAME                       LAST_START_DATE                                    NEXT_RUN_DATE
------------------------------ -------------------------------------------------- --------------------------------------------------
AUTO_SPACE_ADVISOR_JOB         28/03/2009 06:00:04 TZ: +01:00 DS:
GATHER_STATS_JOB               02/02/2009 22:00:00 TZ: +01:00 DS:
PURGE_LOG                      29/03/2009 03:00:00 TZ: MET DS: MEST               30/03/2009 03:00:00 TS: MET DS: MEST
ANALYZETHIS_PURGEHISTORY       29/03/2009 17:00:00 TZ: +01:00 DS:                 30/03/2009 17:00:00 TS: +01:00 DS:
GATHER_WK_TEST_STATS           29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_SESSIONUSR_STATS        29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_RELEASEUSR_STATS        29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_LMDBUSR_STATS           29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_ICMADMIN_STATS          29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_COMMUNITYUSR_STATS      29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
GATHER_CUSTOMIZATIONUSR_STATS  29/03/2009 18:00:00 TZ: +01:00 DS:                 30/03/2009 18:00:00 TS: +01:00 DS:
MGMT_STATS_CONFIG_JOB          01/03/2009 01:01:01 TZ: +01:00 DS:                 01/04/2009 01:01:01 TS: +01:00 DS:
MGMT_CONFIG_JOB                28/03/2009 06:00:04 TZ: +01:00 DS:

14 rows selected.

(note the additional space after the TZD format, I needed to add this to actually show the information if I used the "DS:" litteral in front (probably this is a bug) )

As you can see, each job has its own timezone offset and some have also daylight saving information.
So, what happened with our jobs? Well, when a job gets created, Oracle stores the timezone information of the start_date parameters. If this timezone is specified in an absolute offset then no daylight saving changes are applied.
When the server switches to summer time (GMT +2 in Belgium), the scheduler job stays in its own little world and remains in the timezone GMT +1.
So, when for the rest of the database the time is 07:00, the job thinks it is still 06:00 and does not start. As the monitoring check did not take the timezone of the job in account, it reported the job as being late.

To avoid this situation, you need to use a named timezone, in which case oracle will apply automatically the correct daylight saving settings.
How do you do this? Well either you use the to_timestamp_tz to convert a text string to a timestamp with timezone information or Oracle retrieves the timezone from your session.
The timezone information in your session can be set with alter session, or by using the ORA_SDTZ variable in your client environment.
But there is a catch. In the following example I have set my timezone to Europe/Brussels, and then verified the timezone information in systimestamp:

sys@WPS50> select sessiontimezone from dual;


sys@WPS50> select to_char(systimestamp, 'DD/MM/YYYY HH24:MI:SS "TZ:" TZR "DS:" TZD ') from dual;

30/03/2009 01:57:15 TZ: +02:00 DS:

As you can see, the timezone part an absolute notation, not a named timezone.
Systimestamp will never use the named timezone notation, so whenever you use systimestamp as value for the next_date parameter in dbms_scheduler, you will use an absolute offset and thus not follow daylight saving switches.
The current_timestamp variable, will however use the correct notation:

sys@WPS50> select to_char(current_timestamp, 'DD/MM/YYYY HH24:MI:SS "TZ:" TZR "DS:" TZD ') from dual;

30/03/2009 01:57:18 TZ: EUROPE/BRUSSELS DS: CEST

So, when you want to specify the current date as value for the next_date parameter, use current_timestamp and not systimestamp.

This timezone stuff is only applicable when you have an interval that is at least 1 day. With smaller intervals, Oracle will make sure that the period between 2 runs remain the same.
If a job runs every 3 hours and last ran on midnight and the clock is then moved forward from 02:00 to 03:00, then the next run date of the job becomes 04:00, so that the 3 hour period between two job runs is retained.

More information on this, including how Oracle behaves when no start_date parameter is given can be found here:

The database version on which the tests where done is

20 October 2009

Multiple standby databases and supplemental logging

Filed under: dataguard,infrastructure — Freek D'Hooge @ 18:10

A quick warning:

When you setup a logical standby database, you need to activate supplemental logging on the primary database.
This is done automatically when you build the data dictionary (by running the procedure).
Activating supplemental logging is however (I know now) a control file change and is thus not replicated to the other physical standby databases.
As a result, the logical standby will become (logical) corrupt when you perform a role switch between your primary and another physical standby database.

I learned this the hard way  :(
Luckily it was during a proof of concept and not in a real production environment … .

Of course, AFTERWARDS, I found the following maa document which points out that you have to enable supplemental logging yourself on the other physical standby databases.
It still makes a good read though

19 January 2009

Silent upgrade troubles

Filed under: bugs,infrastructure,upgrades / migrations — Freek D'Hooge @ 1:08

Last week I was asked to write a little script to automate an upgrade of oracle client to
Reason for this was that we needed to update arround 1.300 clients to enable them to connect to a 10g database (we couldn’t install 10g clients because of other applications restricted the client version to 9i).

Ok, easy enough. Oracle allows you to automate installations and upgrades via response files and the response file for a client upgrade from to is very simple.
When I started testing the upgrade, I immediately spotted a first problem. The setup.exe (it was on windows xp) started a new console and then returned directly to the prompt in the original console. This would make it impossible to check the return codes to know if a upgrade was successful or not.

The upgrade itself finished without a problem, but at the end the following message appeared in the newly started console: “Press enter to exit”.
Huh!? This was supposed to be a “silent” install, meaning no interraction needed. But here it was, asking to press enter to exit.
And the documentation was not telling anything about it.
After some searching, I found that you can specify the “-noconsole” flag when starting the setup, which would surpress the new console and avoid the question to press enter.
You still would see the question in the logfiles, but the installation presumed you responded to it and finishes the upgrade.

This left me with the first problem: the prompt would still directly return while the upgrade was running in the background.
After some searching in the documentation I found a note stating that you need to modify the oraparam.ini file and change the BOOTSTRAP parameter from TRUE to FALSE.
Unfortunately this did not help. Yelling at it did either.

Then I found that in 10g, you had a “-waitforcompletion” flag you could set, that would do exactly what I needed. So I tried if it would work for the oui shipped in the patchset.
At first, it didn’t, but then I found metalink note 293044.1 that said that the setup.exe in Disk1 and Disk1/install where not the same and that the one in Disk1/install should be used for the “-waitforcompletion” flag.
At last it worked.

For those interested, here is the full command I used to start the silent upgrade:

start /wait C:\oracle\patches\\Disk1\install\setup.exe -silent -noconsole -waitforcompletion -responsefile c:\oracle\patches\\patchset.rsp -paramfile c:\oracle\patches\\oraparam.ini


Thanks to Geert for the yelling link :)

10 January 2009

How to use the plan_table table to sabotage your oracle upgrade

Filed under: bugs,infrastructure,upgrades / migrations — Freek D'Hooge @ 12:55

Lets say you need to upgrade your 9i database to 10g ( to be exact), but you actually want to sabotage the upgrade (don’t know why, just assume you do).
Granted, there are many ways to do this, but you want to do it subtle. What are your options then?
Well, one option is to create the plan_table table in your sys schema (or a synonym plan_table to a plan table in another schema if you want to make it really subtle) before the upgrade.
If you do this, you will see the following message in your upgrade log:

Warning: Package Body created with compilation errors.

SQL> show errors;

-------- -----------------------------------------------------------------
113/5    PL/SQL: SQL Statement ignored
118/44   PL/SQL: ORA-00904: "OTHER_XML": invalid identifier

And the “oracle database server” component in the dba_registry will be marked as invalid.
Mission accomplished I would say.

What is that?
You regret your actions and you want to fix the problem?


Ok then, to fix it you can use the following steps:

  • drop the sys.plan_table table
  • drop the sys.plan_table$ table
  • drop all sys synonyms and public synonyms to the plan_table or the plan_table$
  • @?/rdbms/admin/catplan.sql — recreate the plan table
  • @?/rdbms/admin/dbmsxpln.sql — reload dbms_xplan spec
  • @?/rdbms/admin/prvtxpln.plb — reload dbms_xplan implementation
  • @?/rdbms/admin/prvtspao.plb — reload dbms_sqlpa

For those seeking more information:

Metalink note 565600.1 – ERROR IN CATUPGRD: ORA-00904 IN DBMS_SQLPA
Metalink note 605317.1 – DBMS_SQLPA ORA-00904 OTHER_XML invalid identifier

According to the notes, the problem only exists with upgrades to or to

ps. Don’t ask me why I had a synonym called plan_table in my sys schema. I didn’t do it.
pps. This is why you should test your migration (I’m glad I did)

20 August 2008

Just because its printed, doesn’t mean its true

Filed under: infrastructure,rant — Freek D'Hooge @ 0:55

That statement is often written by Jonathan Lewis and today I was reminded on how true it is.
I had a discussion today with two of my colleagues who wanted to increase the number of arch processes on a dataguard system. As reason they pointed to metalink note 468817.1 – “RFS: possible network disconnect while taking rman backup on primary site”, which makes the following statement:

“In a Data Guard Configuration, during Scheduled RMAN Backup no Redo is transported to the Standby Server as the ARCH Process is blocked (as expected ie. RMAN would utilize 1 ARCn Process and the other ARCn for local Archiving ) which means the Standby stays out of sync (assuming max_arch_processes=2) until the ArchiveLog is manually copied across and registered to the Standby Database after which the Standby Database resumes applying the ArchiveLogs.”

The first thing that drew my attention was the part about rman utilizing an ARCn process.
While the ARCn processes are indeed responsible for archiving the online redo log files, it is the sessions own server process that does this when issuing an “alter system archive log current” command. When rman forces a log file to be archived, the same thing happens. This can easily be verified by looking to the “creator” column in the v$archived_log view. A strace of the rman server process would also prove that it is this process which reads the archived redo logs and streams them to the rman client to be written in a backup piece.

1 – 0 for me

The second thing I noticed was that, according to the note, the standby would remain out of sync until the archivelog was manually copied across and registered to the standby db.
Even if the rman process was using an ARCn process, leaving no processes to copy the archivelog over (with max_arch_processes=2), the standby db would normally be able to pickup the synchronization again after rman would have released the ARCn process again.

2 – 0 for me
and end of discussion.

So, “just because its printed, doesn’t mean its true”, would also apply on metalink notes.
Luckily there is a feedback link at the end of the note, so I hope it will soon be removed or modified.

And yes, I can get a little bit competitive in discussions.
How did you guess?

8 March 2008

bugs introduced via patches

Filed under: infrastructure,rac — Freek D'Hooge @ 13:58

“Sometimes” you have to apply patches to your oracle system to fix bugs. This is however not without risk, as patches are just code and thus can contain bugs themself. The same is true for the scripts used to apply the patches, as I found out first hand.
Recently I needed to perform a drop node / add node procedure on an oracle 10gR2 rac (on solaris 10).
During this procedure I ran into the following problems:

When using ocrconfig to check the backups of the cluster registry, I got the following error: ocrconfig: fatal: open failed: No such file or directory

After some searching on metalink I found the following bug: 6342492 – ocrcheck: fatal: open failed after patch 6000740.
Patch 6000740 is bundle patch MLR7, and the problem with this patch is that the path to make has been hardcoded to a value which is not correct on solaris 10. The solution is to create a symbolic link from /usr/ccs/bin/make to /usr/bin/make, and perform a relink all.

Yep, a hardcoded path. The same problem which requires you to create symbolic links for scp and ssh into the /usr/local/bin directory on solaris 10 when installing a rac system. You would think they learn from their mistakes… .

The second problem I ran into, appeared during the addnode procedure. During the copy of the cluster home to the new node, oracle complained that not all files could be copied.
When I checked the logfile I found the following messages:

WARNING: Error while copying directory /opt/oracle/crs with exclude file list '/tmp/OraInstall2008-02-09_04-20-28PM/installExcludeFile.lst' to no
des 'eocpc-rc01'. [PRKC-1073 : Failed to transfer directory "/opt/oracle/crs" to any of the given nodes "eocpc-rc01 ".
Error on node eocpc-rc01:tar: ./lib/prod/lib: Permission denied
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory
tar: ./lib/prod/lib: Permission denied
tar: cannot open ./lib/prod/lib/v9 No such file or directory]

All problem files where located in the same directory, and this directory was lacking write privileges for the oracle user.

[oracle@eocpc-rc02:~]$ find /opt/oracle/crs/lib/prod -exec ls -ald {} \;
dr-xrwx---   3 oracle   oinstall     512 May  5  2007 /opt/oracle/crs/lib/prod
drwxrwx---   3 oracle   oinstall     512 May  5  2007 /opt/oracle/crs/lib/prod/lib
drwxrwx---   2 oracle   oinstall     512 May  5  2007 /opt/oracle/crs/lib/prod/lib/v9
-rw-rw----   1 oracle   oinstall    2824 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/CCrti.o
-rw-rw----   1 oracle   oinstall    1232 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/CCrtn.o
-rw-rw----   1 oracle   oinstall    3064 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crt1.o
-rw-rw----   1 oracle   oinstall     776 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crti.o
-rw-rw----   1 oracle   oinstall     712 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crtn.o
-rw-rw----   1 oracle   oinstall   14736 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/
-rw-rw----   1 oracle   oinstall   19056 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/

A quick check on a new, net yet patched installation confirmed that this was not a normal situation:

-[oracle@eocsp-rc51:~]$ find /opt/oracle/crs/lib/prod -exec ls -ald {} \;
drwxrwx---   3 oracle   oinstall     512 Feb  7 23:47 /opt/oracle/crs/lib/prod
drwxrwx---   3 oracle   oinstall     512 Feb  7 23:47 /opt/oracle/crs/lib/prod/lib
drwxrwx---   2 oracle   oinstall     512 Feb  7 23:48 /opt/oracle/crs/lib/prod/lib/v9
-rw-rw----   1 oracle   oinstall    2824 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/CCrti.o
-rw-rw----   1 oracle   oinstall    1232 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/CCrtn.o
-rw-rw----   1 oracle   oinstall    3064 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crt1.o
-rw-rw----   1 oracle   oinstall     776 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crti.o
-rw-rw----   1 oracle   oinstall     712 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/crtn.o
-rw-rw----   1 oracle   oinstall   14736 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/
-rw-rw----   1 oracle   oinstall   19056 Mar 13  2003 /opt/oracle/crs/lib/prod/lib/v9/

Making the lib/prod directory in the crs home writable again for the oinstall group did indeed solve the problem.
Sofar I have found 2 different patches which introduces this problem:

  • 6000740 MLR7
  • 5749953 ons sigbus error after install patchset for crs

A quick search on metalink did not found this problem listed as a known bug. I will file one next week. (If I find the time)

Interesting to note is that these 2 problems are also not likely to be discovered by applying the patch on a test system.
I don’t think there are many dba’s who are performing an addnode procedure when testing a new patchset…

Next Page »

Blog at