Monday, December 20, 2010

Exchange 2010 Two Nodes Enter – ONE LIVES.

Starting Active: MBX1

Starting Passive: MBX2

Starting Primary Active Manager: MBX1

The scenario is one where there are two DAG nodes (yes I know more is ideal, but that’s life) that must be brought down, and the active node has an issue from which it cannot recover. This can apply to site-wide power outages or other disaster scenarios as well.

The issues that can arise in this scenario seem to vary from my testing.

Test lab steps:

1. Shutdown both nodes.

2. Power on MBX2. Leave MBX1 offline.

3. Databases' status comes up as Service Down (as expected) for MBX1, and Disconnected/Resyncing for MBX2.

Issues and solutions that may arise:

1. Cluster won’t start on MBX2.

a. On MBX2, open a CMD prompt and run: ‘net start clussvc /fq’

b. This forces the cluster service up without concern for quorum.

2. EventID: 3154 MSExchangeRepl

Active Manager failed to mount database DB1 on server EXCHANGE. Error: An Active Manager operation failed with a transient error. Please retry the operation. Error: A transient error occurred during discovery of the database availability group topology. Error: Database action failed with transient error. Error: A transient error occurred during a database operation. Error: MapiExceptionNetworkError: Unable to make admin interface connection to server

a. Verify that MBX2 is the Primary Active Manager (PAM).

i. Generally after a failure of this type, the former Active Manager (if it was MBX1) will be offline, or unassigned.

b. The owner of the cluster core service group (“Cluster Group”) is the PAM. If this isn’t true, then we need to force the fact.

c. cluster . group "Cluster Group" /move:MBX2

3. Check PAM via ‘Get-DatabaseAvailabilityGroup -Status | fl name,primaryactivemanager’

a. Validate MBX2 as PAM.

4. EventID: 3170 MSExchangeRepl

Attempt to move active database ‘DB1’ from MBX2 to MBX1 failed. Error: An Active Manager operation failed. Error: The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy ‘DB1’ on server ‘MBX1’ has content index catalog files in the following state: 'Failed'.

a. Validate: Get-MailboxDatabaseCopyStatus | fl name, contentindexstate

b. Content index displays as ‘Failed.’

c. Move-ActiveMailboxDatabase - identity DB1 -ActivateOnServer MBX2 -SkipLagChecks -MountDialOverride BestEffort – SkipClientExperienceChecks

d. Process is repeated on all other databases. SkipClientExperienceChecks will ignore the bad content index.

5. All databases are mounted on MBX2, and MBX1 is still down.

6. Booted up MBX1 post-disaster and resume.

a. Replication comes up as ‘Healthy’ after a few minutes.

b. A reseed may be necessary depending on the size and circumstances, but it was never required for me.

No comments: