Irrelevant thoughts of an oracle DBA

3 December 2007

disaster recovery troubles

Filed under: backup — Freek D'Hooge @ 0:26
Tags: , ,

First of all an excuse: these days I mainly use rman for backup and recovery tasks. I have used user managed backup and recovery in the past but it all has fade away….

Last Thursday and Friday I was at a client site to setup and test snapmanager for oracle (smo). With this tool you can use Netapp snapshots to backup Oracle databases and create “virtual” clone databases (a db clone without needing the disk space of the original db). Giving a full explanation of what netapp snapshots are and how smo works will lead too far for this post (I am working on a separate post about it), so just think about smo as a user managed backup.
Now this client wanted to extend the functionality of smo and wanted the snapshots (backups) to be replicated to a remote site. These replicated snapshots would then be used to create test/dev databases and for disaster recovery purposes.
One of my tasks was to write a disaster recovery procedure.

In short smo is performing the following steps when taking a full db backup:

  1. Place the entire database in backup mode.
  2. Take snapshots of all the netapp volumes containing database files. (think: take an os copy of all the data located on volumes containing data files).
  3. Take the database out of online backup mode.
  4. Issue an archive log current.
  5. Generate backup control files.
  6. Take a snapshot of the volumes containing the archive log files and the backup controlfile (think: take an os copy of all the data located on volumes containing archived redologs).

This backup data is then copied to a remote site.

As you can suspect by now, something went wrong during the recovery. So lets recreate the situation on my vmware test db.

[oracle@ragna ~]$ export ORACLE_SID=LOKI
[oracle@ragna ~]$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Dec 1 14:14:27 2007

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options

SQL> alter database begin backup;

Database altered.

SQL> select file_name from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------
/opt/oracle/oradata/LOKI/system01.dbf
/opt/oracle/oradata/LOKI/undotbs01.dbf
/opt/oracle/oradata/LOKI/sysaux01.dbf
/opt/oracle/oradata/LOKI/users01.dbf

SQL> show parameter control_files

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
control_files                        string      /opt/oracle/oradata/LOKI/contr
                                                 ol01.ctl, /opt/oracle/oradata/
                                                 LOKI/control02.ctl, /opt/oracl
                                                 e/oradata/LOKI/control03.ctl
SQL> select member from v$logfile;

MEMBER
--------------------------------------------------------------------------------
/opt/oracle/oradata/LOKI/redo01.log
/opt/oracle/oradata/LOKI/redo02.log
/opt/oracle/oradata/LOKI/redo03.log

SQL> !
[oracle@ragna ~]$ cp -r /opt/oracle/oradata/LOKI /opt/oracle/oradata/LOKI_BAK
[oracle@ragna ~]$ exit
exit

SQL> alter database end backup;

Database altered.

SQL> alter system archive log current;

System altered.

SQL> alter database backup controlfile to '/opt/oracle/oradata/LOKI_BAK/ctrlfilebackup.ctl';

Database altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options

[oracle@ragna ~]$ mv /opt/oracle/oradata/LOKI /opt/oracle/oradata/LOKI_OLD
[oracle@ragna ~]$ mv /opt/oracle/oradata/LOKI_BAK /opt/oracle/oradata/LOKI

At this point I have created a simular situation as I had with the smo backup.
Next are the steps I took to recover the database at the remote site.
First I started the database in nomount status and restored the controlfiles using the controlfile backup from step 5. After that I mounted the database.

[oracle@ragna ~]$ rman target / nocatalog

Recovery Manager: Release 10.2.0.3.0 - Production on Sat Dec 1 14:24:40 2007

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database (not started)

RMAN> startup nomount;

Oracle instance started

Total System Global Area     209715200 bytes

Fixed Size                     1260912 bytes
Variable Size                184550032 bytes
Database Buffers              16777216 bytes
Redo Buffers                   7127040 bytes

RMAN> restore controlfile from '/opt/oracle/oradata/LOKI/ctrlfilebackup.ctl';

Starting restore at 01-DEC-07
using channel ORA_DISK_1

channel ORA_DISK_1: copied control file copy
output filename=/opt/oracle/oradata/LOKI/control01.ctl
output filename=/opt/oracle/oradata/LOKI/control02.ctl
output filename=/opt/oracle/oradata/LOKI/control03.ctl
Finished restore at 01-DEC-07

RMAN> alter database mount;

database mounted
released channel: ORA_DISK_1

RMAN> exit

Recovery Manager complete.

Next I checked the backup status of the datafiles and put the database out of backup mode:

[oracle@ragna ~]$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Dec 1 14:27:25 2007

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options

SQL> select distinct status from v$backup;

STATUS
------------------
ACTIVE

SQL> alter database end backup;

Database altered.

SQL> select distinct status from v$backup;

STATUS
------------------
NOT ACTIVE

Now here it went wrong on the customer site. After issuing the “alter database end backup” statement, I got an ORA-600 error and the database would remain in backup mode. But I could not reproduce this at my test db (at the client it was a 5 node linux (64bit) rac system that had to be restored on a single node ).
So lets continue with the test. After ending the backup mode I checked the latest archivelog and started the recovery

SQL> select max(checkpoint_change#) from v$datafile;

MAX(CHECKPOINT_CHANGE#)
-----------------------
                1946934

SQL> set linesize 150
SQL> set pages 999
column name format a50

SQL> SQL> SQL> select thread#, sequence#, to_char(next_time, 'DD/MM/YYYY HH24:MI:SS') ntime, next_change#, status, name
  2  from v$archived_log
  3  where (thread#, next_change#) in
  4          ( select thread#, max(next_change#)
          from v$archived_log
  5    6            where archived = 'YES'
                and status = 'A'
                and resetlogs_id =
  7    8    9                      ( select resetlogs_id
                      from v$database_incarnation
 10   11                        where status = 'CURRENT'
 12                      )
          group by thread#
 13   14          )
      and status = 'A'
 15   16        and resetlogs_id =
 17          ( select resetlogs_id
          from v$database_incarnation
 18   19            where status = 'CURRENT'
        )
 20   21  order by next_change# asc;

   THREAD#  SEQUENCE# NTIME               NEXT_CHANGE# S NAME
---------- ---------- ------------------- ------------ - --------------------------------------------------
         1          4 01/12/2007 14:22:46      1947116 A /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1
                                                         _4.arc

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 1946934 generated at 12/01/2007 14:14:34 needed for thread 1
ORA-00289: suggestion : /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_4.arc
ORA-00280: change 1946934 for thread 1 is in sequence #4

Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 1947116 generated at 12/01/2007 14:22:46 needed for thread 1
ORA-00289: suggestion : /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_5.arc
ORA-00280: change 1947116 for thread 1 is in sequence #5
ORA-00278: log file '/opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_4.arc' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/opt/oracle/oradata/LOKI/system01.dbf'

ORA-01112: media recovery not started

WTF!!! What does this mean: needs more recovery to be consistent. I already applied all the archivelogs I have.

SQL> !ls -l /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_5.arcls: /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_5.arc: No such file or directory

The needed archivelog did indeed not exists on disk. Luckily I still have the online redo logs from the original database (I had renamed the directory), so I could use those to feed the recovery process

SQL> !
[oracle@ragna ~]$ ls -lhtr /opt/oracle/oradata/LOKI_OLD/*.log
-rw-r-----  1 oracle oinstall 16M Dec  1 14:13 /opt/oracle/oradata/LOKI_OLD/redo03.log
-rw-r-----  1 oracle oinstall 16M Dec  1 14:22 /opt/oracle/oradata/LOKI_OLD/redo01.log
-rw-r-----  1 oracle oinstall 16M Dec  1 14:23 /opt/oracle/oradata/LOKI_OLD/redo02.log
[oracle@ragna ~]$ cp /opt/oracle/oradata/LOKI_OLD/redo02.log /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_5.arc
[oracle@ragna ~]$ exit
exit

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 1947116 generated at 12/01/2007 14:22:46 needed for thread 1
ORA-00289: suggestion : /opt/oracle/oradata/archives/LOKI/LOKI_640047464_1_5.arc
ORA-00280: change 1947116 for thread 1 is in sequence #5

Specify log: {=suggested | filename | AUTO | CANCEL}

Log applied.
Media recovery complete.
SQL> alter database open resetlogs;

Database altered.

Ok, so this worked. Apparently the needed change was in the online redo logs of the original database.
But how could this be, I did an alter database archive log current after ending the backup!?

As you recall I said earlier that at the customer site I received an ora-600 error when issuing the end backup statement, meaning that I never reached to the point above. Instead (after a tip from Geert) I just started the recovery without ending the backup, which worked without any problem.
So lets do the test again, this time without ending the backup mode.

[oracle@ragna oradata]$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Dec 1 17:30:37 2007

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options

SQL> alter database begin backup;

Database altered.

SQL> !
[oracle@ragna oradata]$ cp -r /opt/oracle/oradata/LOKI /opt/oracle/oradata/LOKI_BAK
[oracle@ragna oradata]$ exit
exit

SQL> alter database end backup;

Database altered.

SQL> alter system archive log current;

System altered.

SQL> alter database backup controlfile to '/opt/oracle/oradata/LOKI_BAK/ctrlfilebackup.ctl';

Database altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options
[oracle@ragna oradata]$ mv /opt/oracle/oradata/LOKI /opt/oracle/oradata/LOKI_OLD
[oracle@ragna oradata]$ mv /opt/oracle/oradata/LOKI_
LOKI_BAK/ LOKI_OLD/
[oracle@ragna oradata]$ mv /opt/oracle/oradata/LOKI_
LOKI_BAK/ LOKI_OLD/
[oracle@ragna oradata]$ mv /opt/oracle/oradata/LOKI_BAK /opt/oracle/oradata/LOKI
[oracle@ragna oradata]$ rman target / nocatalog

Recovery Manager: Release 10.2.0.3.0 - Production on Sat Dec 1 17:40:52 2007

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database (not started)

RMAN> startup nomount;

Oracle instance started

Total System Global Area     209715200 bytes

Fixed Size                     1260912 bytes
Variable Size                184550032 bytes
Database Buffers              16777216 bytes
Redo Buffers                   7127040 bytes

RMAN> restore controlfile from '/opt/oracle/oradata/LOKI/ctrlfilebackup.ctl';

Starting restore at 01-DEC-07
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=156 devtype=DISK

channel ORA_DISK_1: copied control file copy
output filename=/opt/oracle/oradata/LOKI/control01.ctl
output filename=/opt/oracle/oradata/LOKI/control02.ctl
output filename=/opt/oracle/oradata/LOKI/control03.ctl
Finished restore at 01-DEC-07

RMAN> alter database mount;

database mounted
released channel: ORA_DISK_1

RMAN> exit

Recovery Manager complete.
[oracle@ragna oradata]$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Dec 1 17:42:11 2007

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Production
With the Partitioning and Data Mining options

SQL> select distinct status from v$backup;

STATUS
------------------
ACTIVE

SQL> select max(checkpoint_change#) from v$datafile;

MAX(CHECKPOINT_CHANGE#)
-----------------------
                1948117

SQL> set linesize 150
SQL> set pages 999
SQL> column name format a50
SQL>
SQL> select thread#, sequence#, to_char(next_time, 'DD/MM/YYYY HH24:MI:SS') ntime, next_change#, status, name
from v$archived_log
  2    3  where (thread#, next_change#) in
        ( select thread#, max(next_change#)
  4    5            from v$archived_log
          where archived = 'YES'
  6    7                  and status = 'A'
                and resetlogs_id =
  8    9                      ( select resetlogs_id
 10                        from v$database_incarnation
                      where status = 'CURRENT'
 11   12                      )
 13            group by thread#
 14          )
 15        and status = 'A'
      and resetlogs_id =
        ( select resetlogs_id
 16   17   18            from v$database_incarnation
          where status = 'CURRENT'
 19   20          )
order by next_change# asc;
 21
   THREAD#  SEQUENCE# NTIME               NEXT_CHANGE# S NAME
---------- ---------- ------------------- ------------ - --------------------------------------------------
         1          1 01/12/2007 17:37:51      1948622 A /opt/oracle/oradata/archives/LOKI/LOKI_640199554_1
                                                         _1.arc

SQL> !ls -lhtr /opt/oracle/oradata/archives/LOKI/
total 2.0M
-rw-r-----  1 oracle oinstall 2.0M Dec  1 17:37 LOKI_640199554_1_1.arc

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 1948117 generated at 12/01/2007 17:30:46 needed for thread 1
ORA-00289: suggestion : /opt/oracle/oradata/archives/LOKI/LOKI_640199554_1_1.arc
ORA-00280: change 1948117 for thread 1 is in sequence #1

Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 1948622 generated at 12/01/2007 17:37:51 needed for thread 1
ORA-00289: suggestion : /opt/oracle/oradata/archives/LOKI/LOKI_640199554_1_2.arc
ORA-00280: change 1948622 for thread 1 is in sequence #2
ORA-00278: log file '/opt/oracle/oradata/archives/LOKI/LOKI_640199554_1_1.arc' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
SQL> select distinct status from v$backup;

STATUS
------------------
NOT ACTIVE

SQL> alter database open resetlogs;

Database altered.

Worked like a charm. So clearly I should not take the database out of backup mode for this type of restore (but in case of an instance crash while tablespaces are in backup mode, you have to).
Following quote from the oracle documentation seems to confirm this: “Do not use ALTER DATABASE END BACKUP if you have restored any of the affected files from a backup.”.

I am still puzzled as why this is, and why the recovery succeeded after feeding it the logfile from the original database.

Advertisements

11 Comments »

  1. You need to issue

    SQL> recover database until cancel ;
    SQL> CANCEL
    SQL> alter database open resetlogs;

    Comment by Vishal Gupta — 3 December 2007 @ 1:08 | Reply

  2. Sorry had not read your entire post. You do need to use “until cancel” to be able to open the databse to a point in time. Though you can very well using “until SCN” or “until TIME” clauses as well.

    Comment by Vishal Gupta — 3 December 2007 @ 1:11 | Reply

  3. Vishal,

    I know this, and in the code you can see that I actually used the “until cancel” clause.
    The problem was that after I had feeded all the archivelogs I had available, oracle still said that it needed more redo to make the database consistent.
    Something that I did not expected, as I issued an “alter system archive log current” after the backup.
    The mistake I made was to end the backup mode of the database during the recovery steps, while I just had to start the “recover database …”.
    Question remains as why it is this way.

    Comment by dhoogfr — 3 December 2007 @ 8:35 | Reply

  4. You know, sometimes it pays not to think things through too much :-)

    We are using Netapp snapshots as backups, and I’ve done recoveries as you described couple of times in the past, and it never crossed my mind that since I took the snapshot when the database is in backup mode, I should take it out of backup mode before I start recovery. I’ve never thought of checking the backup status. So, I never ran into the problem you described. Less thinking – less problems :-)

    BTW. Any particular reason you are not using smo to manage the recovery as well? If I recall correctly, it can do that – including applying the logs and such.

    Comment by prodlife — 3 December 2007 @ 23:30 | Reply

  5. Chen,

    Yes smo is capable of doing the restore adn recovery, but in this case the snapshots where transferred with volume snapmirror to a separate netapp, so smo did not know about these snapshots and thus can not do the recovery.
    The now soon to be released (I hope) 2.2 version would have replication buildin (but I don’t know if they will use qtree snapmirror or snapvaulting).

    A second reason to why I don’t use smo for recovery purposes is that it has a problem (at least the 2.0 version has) with determining on rac systems which archived redo logs it can use.
    From the 2.0 admin guide:

    Problem:
    Recovering control files takes a long time

    Symptom:
    You attempted to perform a control file recovery; however, it is taking an exceptionally long time.

    Explanation:
    When SnapManager uses control files to perform a recovery, it follows the Oracle requirement of having the last redo log from each redo log thread be applied in the recovery process. SnapManager attempts all permutations of redo logs from the different threads until one succeeds. As the number of threads increases, the amount of time the recovery takes also increases.

    What to do:
    If you know which redo logs are needed, it might be quicker to manually recover the database.

    On a 4 node rac system it took smo over 2 hours to do all permutations….

    Comment by dhoogfr — 3 December 2007 @ 23:49 | Reply

  6. Hello,

    We are trying to use the same but without SnapManager as SMO costs money. We want to use snapshot with RMAN’s by cataloging the datafile backups and taking controlfile/archive files separately and cataloging them in RMAN.

    Do you foresee any issues?

    Thanks,
    Paru

    Comment by Paru — 4 February 2008 @ 21:53 | Reply

  7. Paru,

    You indeed don’t need smo for backing up oracle databases via snapshots.
    What you would need to do is to

    * put the tablespaces in hot backup mode
    * snapshot the volume(s) containing the datafiles
    * put the tablespaces out of hot backup mode
    * archive the current logfile
    * catalog the datafiles into the controlfile
    * backup the controlfile as a controlfile copy.

    Whether or not to snapshot the archived redo logs, depends on your own likings and on the question if you want to replicate the snapshots to a different netapp. If so, then I would suggest to save the controlfile copy in the archivelog volume and then snapshot the archivelog volume as the latest step.

    Mind you, this is only one possible scenario. You can program all different kind of backup schemes, as long as you remember that a netapp snapshot has to follow the same rules as a filesystem copy of the database.

    Comment by dhoogfr — 4 February 2008 @ 23:27 | Reply

  8. Thanks for the info. We are going to try to put the whole DB in backup mode and snapshot all the files that way snaps are easier as all of them belong to one volume. Then backup the controlfile as controlfile copy. But, when I am cataloging in RMAN and list the list shows as “list copy” and not list backup as I did not create a backupset. I should still be able to restore it right?

    I am going to try the restore.

    Yes, we will be sending the archive/ctl to standby database in remote site so will save the controlfile copy and archives.

    Thanks for your input.

    Paru

    Comment by Paru — 5 February 2008 @ 15:07 | Reply

  9. Can you please elaborate on “catalog the datafiles into controlfile”? I am using recovery catalog to catalog the whole thing, do you suggest not needing one and using controlfile instead?

    Paru

    Comment by Paru — 5 February 2008 @ 15:08 | Reply

  10. Paru,

    All backup (or copies) information are always stored into the controlfiles. If you use a catalog db then this information will also be stored into this catalog db.
    So you don’t realy need a catalog db, but it could be easier to use one.

    Rman is able to restore both copies and backupsets, no problems there.

    Comment by dhoogfr — 5 February 2008 @ 20:49 | Reply

  11. I’m really loving the theme/design of your site.

    Do you ever run into any browser compatibility issues?
    A small number of my blog audience have complained about my website not working correctly in Explorer but looks great in Safari.
    Do you have any solutions to help fix this issue?

    Comment by Garden Awnings — 30 August 2014 @ 12:44 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: