I am taking a break from my regular route to
Beginners Guide because indeed today I faced an issue which typically was
confusing. So confusing that at-last when I got over it I thought of blogging
it so that if similar issues you face in the near future you can get an instant
solution to it and won’t go around pulling your hairs.
So coming to the issue let me brief the scenario in
short.
We have a Production Database where heavy OLTP transactions occur and to
support the disaster recovery for this server we have a Data-guard solution set
up. The Oracle version we are using here is 10.2.0.4 Enterprise Edition on Sun
SPARC OS.
Yesterday, the client reported that DG is not in
Synch with the Production so that’s where my job came into scene. Logging into
the database I found that indeed there was an archive log gap and due to this
the DG had stop synching with the Production.
I ran the following command on DG to verify the archive log
gap.
SQL>
SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;
THREAD# LOW_SEQUENCE HIGH_SEQUENCE#
----------------------------------------------
1 65524 65533
SQL> SELECT max(sequence#) AS "STANDBY",
applied FROM v$archived_log GROUP BY applied;
STANDBY APPLIED
-----------------
65783
NO
65523
YES
From above output it is quite clear
that there is an archive gap and due to which the recovery services had
stopped. The first thing to do insuch scenario is to check if the missing
archives are present at the Production. If they are present.. Pheww!!!! You are
saved from a hectic recovery schedule because you just need to take the backup
of this archives from production or transfer these archives from Production to
DR and apply them there. But in case you are not so lucky and you don’t have
archive backups at the production you need to follow the other way around where
you need to take the incremental RMAN backup at Production and restore the same
at the DR end. But that is the topic for another blog.
In my case I was too
lucky as the retention policy at the Production for archives is 3 days. So the
archives were present there. Lucky me… So what I did was just transferred the
missing archives from Production to DR and at DR stopped the media recovery
process, recovered the DR with RMAN and started the media recovery process.
Guess this should have solved my problem. Right!!!!!! Well I was wrong… So what
went wrong? These are the steps I followed.
- Used Simple SCP
to transfer the missing archives from sequence 65524 to 65533 from
Production to DR.
- Stopped the
Media recovery Process at DR.
ALTER
DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
- Recover the DR
database using simple RMAN recovery.
bash$
rman target /
rman>
RECOVER DATABASE UNTIL SEQUENCE 65533 THREAD 1;
- Start the media
recovery process again.
ALTER
DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM CURRENT LOGFILE;
At
first instance all was well. The recovery process started well and the gap was
filled and synching was completed. Then the real problem arose.
Well
I was checking the DR end regarding the synching using following commands.
sql>
ARCHIVE LOG LIST
Database log mode Archive Mode
Automatic archival Enabled
Archive destination C:\app\mudassar\archive
Oldest online log sequence 65783
Next log sequence to archive
Current log sequence 65794
sql> SELECT THREAD#,
LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;
no rows selected
sql> SELECT max(sequence#) AS
"STANDBY", applied FROM v$archived_log GROUP BY applied;
STANDBY APPLIED
-----------------
65794 YES
Seems
fine… But hell No Client was not convinced he ran the following query at the
Production and the output was always NO.
sql>
select a.sequence#,a.applied,a.archived, b.sequence#,
b.applied,b.archived from (select * from
v$archived_log where sequence# in(select max(sequence#)-1 from v$archived_log)
and name='stndby') a, (select * from v$archived_log where sequence# in(select
max(sequence#)-1 from v$archived_log) and name like '/archive/archive-log%') b where
a.sequence#!=b.sequence# OR a.APPLIED!='YES' OR b.APPLIED!='NO';
SEQUENCE#
APP ARC
------------------------------------
65794 NO
YES
65794 NO
YES
Whats
wrong in this then? Why the output different at Production and DR when it
should be same? After lots of searching, I found that this is bug in Oracle 10g
whereby the MEDIA RECOVERY PROCESS gets hanged due to some reason causing this
issue. I won’t go in more detail regarding the bug but if you have metalink
account on Oracle you can logged in and go through Note: 1369630.1 if not
you can download the documentation here.
Note that this error doesn’t mean that Production
and DR are not in synch but as the Media Recovery Process is hanged at
Production Site it is not able to update the internal views. The solution to
this is upgrade to Oracle 11g or higher (Oracle 12c is already introduced) or
you can restart the Production Database if possible to overcome the issue
temporarily.
Hope this would have been help to all you folks!!!! Comments are WELCOME!!!!
really awesome...thats why I like your blog Mudassar. Thanks
ReplyDeleteThanks Aman .. Your comments mean a lot :)
ReplyDeleteHi Mudassar,
ReplyDeleteThanks for sharing this. It's occurred on my environment if there is any network hiccup for a period even my timeout setting is longer than that. Instead of restart a production database, killing all archived background processes is fine to fix the primary v$archived_log.applied.
Bundit
excellent explanation... helps alot.Thank you
ReplyDeleteI am very thankful to you for providing such a great information. It is simple but very accurate information.
ReplyDelete