Dealing with a corrupt transaction log in SQL Server

I just went through the experience of a corrupted transaction log for a large SQL Server database. It was actually not as bad as I thought it would be. I ended up restoring from a backup and then putting the corrupt database into emergency mode to generate scripts for updated objects and dumped out data added since the last backup. Fortunately, it wasn’t too hard to figure that out.

As it turns out, I really don’t think anything was lost on the database since it had no live activity for several hours prior to the failure. Unless I’m misunderstanding the SQL checkpoint feature for the transaction log, a database that is at rest and has no activity is likely to have very little reliance on the transaction log in order to be current. Based on comparison of the data sources used to load the database and what was in the tables and inspection of the code changes made from scripts, there appears virtually no loss of data or metadata.

What was the most distressing of this was the fact that the transaction log was actually on a Raid-1 device. The Raid-1 was based on Windows raid because it utilized 2 Fusion-IO (HP IO Accelerator version) drives. I had even coupled together drives from two different HP IO Accelerators Duos to minimize the impact of a failure at the card level rather than at the module level. Only one of the drives failed. However, instead of simply going to a failed redundancy state, both halves of the Raid device ended up corrupted. This is Windows Server 2008 R2. The problem happened while running a FIO-STATUS –a with the HP IO Accelerator GUI open. There was an issue at one point with caution recommended for running fio-status while a drive is being accessed, but I thought that was resolved with later versions of the driver and I have done it before without issues. The only explanation I can think of is that either there was a more broad failure of the driver itself or there is a problem with Windows software raid correctly supporting FIO drives.

In any case, the below cleared up my database, but it took 17 hours to finish (database is over 500GB with about 3 billion rows and was using page/row compression to keep it under 1 TB). By then, I had utilized the emergency mode and got the database restored from 3 day old backup up and running with current data and objects.

EXEC sp_resetstatus ‘tp_v5’;
ALTER DATABASE tp_v5 SET EMERGENCY
DBCC checkdb(‘tp_v5’)
ALTER DATABASE tp_v5 SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC CheckDB (‘tp_v5’, REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE tp_v5 SET MULTI_USER
alter database tp_v5 set online

This gave me the below type of messages:

There are 3521253 rows in 19385 pages for object “Load.EquityHistory”.
DBCC results for ‘Web.WebDownload’.
There are 5856171 rows in 25344 pages for object “Web.WebDownload”.
DBCC results for ‘Load.EodError’.
There are 0 rows in 0 pages for object “Load.EodError”.
DBCC results for ‘Trader.TestInterval’.
There are 0 rows in 0 pages for object “Trader.TestInterval”.
DBCC results for ‘Simulator.IntervalSetup’.
There are 0 rows in 0 pages for object “Simulator.IntervalSetup”.
DBCC results for ‘Trader.Portfolio_WhatIf’.
There are 0 rows in 0 pages for object “Trader.Portfolio_WhatIf”.
DBCC results for ‘StrategyInterval’.
There are 56 rows in 1 pages for object “StrategyInterval”.
DBCC results for ‘Load.EtfList’.
There are 776 rows in 5 pages for object “Load.EtfList”.
DBCC results for ‘YahooMapping’.
There are 56 rows in 1 pages for object “YahooMapping”.
DBCC results for ‘EquityAdjust’.
There are 189116 rows in 2583 pages for object “EquityAdjust”.
DBCC results for ‘Trader.SimulationIntervalSetup’.
There are 0 rows in 0 pages for object “Trader.SimulationIntervalSetup”.
DBCC results for ‘Load.EquityIntraday’.
There are 1557783239 rows in 8006190 pages for object “Load.EquityIntraday”.

CHECKDB found 0 allocation errors and 0 consistency errors in database ‘tp_v5’.

Note that I had to actually run with the repair_allow_data_loss to get back online even though the initial CHECKDB without the allow data loss passed OK. Even after the first CheckDb, emergency mode remains set until doing one with check for data loss when you have a critical error such as corruption of the log.

One option I tried without success was to detach the database after putting it into emergency mode and then try to attach it without specifying the log file using the rebuild_log option on the attach. However, that is a fatal mistake because SQL Server will not allow this for a database that was not healthy at the time of the attached. Fortunately, I had the wherewith all to shut down the SQL Server and make a copy of all of the files before trying that experiment.

The first thing to do in any crisis like this is to make sure you have a copy of the database files – even the bad ones. If you do make the mistake of detaching an unhealthy database and you have a complete backup that includes all of the file groups, there is the option of moving the files from the bad database to another location, restoring the backup, stopping SQL Server, replacing the original files with the corrupt files, and starting SQL up to get back to the baseline situation.

A better approach is to simply make sure your database is backed up as often as needed for your business requirements..

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s