Lync and DCOM -1007781356 RollbackMoveAway Failures

Problem Overview

Like a few out there, I’ve encountered the dreaded -1007781356 DCOM error recently.  It started when a client notified me that after migrating 8000+ users from one pool to another, that there was a small handful, about 150, that wouldn’t move.  Most users would show the following error: “Distributed Component Object Model (DCOM) operation begin move away failed.” with “RollbackMoveAway failed -1007781356”.

DComNoMove

About 20 of those users we found had a slightly different error, a DCOM -1007200250.  These errors gave a bit of additional detail in the message: “Distributed Component Object Model (DCOM) operation begin move away failed because user was not found in database.” which can be seen below.

UserNotFound

After some additional investigation, other issues were becoming apparent.  There were issues with the Backup Service which showed itself in an error state.

BackupStateError

Further there were LS Backup Service 4073 errors showing “Microsoft Lync Server 2013, Backup Service user store backup module detected items having pool ownership conflict during import.”  The list of users shown include many of those who wouldn’t move.  This confirmed that we were dealing with a pool ownership conflict, where the user partially exists in multiple pool SQL databases.

Pool_Ownership_Conflict

Export-CSUserData was also failing with the error.   “Export-CsUserData : Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.  This failure occured while attempting to connect to the Principle server.”

Initial Solution Attempt

While I wasn’t entirely sure that all of these issues were related, the timeline in which they started manifesting lined up closely.

The first thing I did?  I’m not ashamed to admit that I went for the search engines, no use reinventing the wheel if this is a common problem.  Unfortunately, I didn’t find too much.  There were similar errors with resolutions that didn’t seem to line up as this was only affecting a handful of users.  Further, any approach I take has to be taken with extreme care as this environment has tens of thousands of users.  The blog I encountered with the most helpful information was John Cook’s.

http://johnacook.wordpress.com/2014/05/08/pool-ownership-conflict-moving-users-between-lync-pools/

With an identical error and symptoms, he was able to contact Microsoft PSS who had of a tool that could resolve the issue.  Before we headed down that path, I wanted to see if there were any additional workarounds I could get at for our environment.

Taking a cue from John’s blog, the VerifyUserDataReplication.exe tool from the Lync Resource Kit gave me an output of the identical user set found in the LS Backup Service 4073 errors, and it also lined up nicely with the users who refused to move.

We had a reasonably good backup of user data despite the Export-CSUserData timeouts, and located one of the users who hadn’t logged in for quite some time to use as a guinea pig.  Using our guinea pig account, we were able to move the user with the command:

Move-CsUser -identity <Identity> -Target -<OtherPool> -Force

The -Force is what got us there, it ignores the user data, which in our case was the issue preventing us from moving the account.  After that, we were able to run an Update-CsUserData to merge the contacts back in for this user from our backup.  The remaining users were scheduled for a forceful move and restore that night.

As a side note, it was comforting to see sharp guys out there such as Flinchböt fighting the same issues at the same time and coming up with the same approach.  http://flinchbot.wordpress.com/2014/09/17/moving-immovable-users/

Almost, but Not Quite

The rest of the users moved successfully accomplishing the initial goal.  However, the LS Backup Service 4073 errors and the VerifyUserDataReplication.exe were still reporting the issue.  A sample of the VerifyUserDataReplication.exe can be seen below.

Info: reading batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo from backup pool.
Info: 6247 batches are returned from deprdl3sql1.adms.lyncfix.com\lync13Berlin.
Info: 116161 items are returned from deprdl3sql1.adms.lyncfix.com\lync13Berlin.
Info: reading batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo from source pool.
Info: 6248 batches are returned from jpprdl3sql1.adms.lyncfix.com\lync13Tokyo.
Info: 116366 items are returned from jpprdl3sql1.adms.lyncfix.com\lync13Tokyo.
Info: comparing batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo in source pool and backup pool.
Error: batch bf36c405-0396-429e-bac3-001dd81d17b6 has item 6abec8cb-5857-4cc4-8c44-dd99d9e47206-urn:hcd:theresa.jerd@lyncfix.com whose partial version 3 in source pool is less than or equal to batch’s partial version 4 in backup pool. It cannot find same item in backup pool.
Error: batch bf36c405-0396-429e-bac3-001dd81d17b6 has item 6abec8cb-5857-4cc4-8c44-dd99d9e47206-urn:lcd:theresa.jerd@lyncfix.com whose partial version 3 in source pool is less than or equal to batch’s partial version 4 in backup pool. It cannot find same item in backup pool.

Clearly, the move got us past the initial DCOM error by ignoring the database issue, but it didn’t clean up the offending database records.  The next approach was to completely disable a user and try again.  If we can’t resolve the issue with a forced move, surely removing the user from Lync would do the trick.  We figured we could always recreate the user later if this approach worked.  We went back to our guinea pig account and ran Disable-CsUser.  This command deletes all the attribute information related to Lync Server from an Active Directory user account.

Even with the account “removed” from Lync, the issue still persisted.  The user still showed up in the LS Backup Service 4073 errors and the VerifyUserDataReplication.exe output.  We re-enabled the user and reimported the user data again from our backup.  It makes sense that this approach didn’t work, as it only removes the Active Directory attributes, there is no backend database cleanup.  But if the forced move and the delete doesn’t do it, what will?

Final Answer

The final answer was simple, now that the DCOM error was resolved due to the move-csuser -force command, we could freely move the users back to the old pool, and then move them back again to the final destination.  Success!  Using this method, the pool conflict error was resolved, the username was removed from the event log, and VerifyUserDataReplication output no longer reported the account as an issue.

Moving the previously forced-moved users back to their old pool, then back again to their new home cleaned up the database enough that not only did our events disappear, but export-csuserdata stopped experiencing it’s timeouts and the backup service error state went back into a normal state.  We’re now back to a healthy state.

A Special Thanks

I’d like to extend a special thanks to John A Cook and Flinchböt for sharing their experiences and letting me talk through mine.  Please feel free to reach out to me here or on twitter @CAnthonyCaragol if you’re experiencing issues of your own.

 

 

9 thoughts on “Lync and DCOM -1007781356 RollbackMoveAway Failures

  1. Pingback: NeWay Technologies – Weekly Newsletter #114 – September 25, 2014 | NeWay

  2. Pingback: NeWay Technologies – Weekly Newsletter #114 – September 26, 2014 | NeWay

  3. Hendrik du Plessis

    Thanks Anthony for your well-written post. I had the same error and your advice had worked for me too but I first had to clear a different hurdle.

    My scenario had been a bit different. We had a user who could not sign into Lync and when I tried to move the user to another pool as a way of testing the integrity of its db entry, it had kicked out this same error. Force-moving the user had succeeded in getting the user across to the other pool but this still did not resolve the login problem and I had noticed that the db entry did not get over-written as I had intended. I could still not move the user “normally” between pools as it kept kicking out the same error as described in your article.

    I then looked at the AD object of the user account in question using ADUC and added the following Security ACL entries with write access to the object:

    CsAdministrator
    RTCUniversalServerAdmins

    Thereafter I once again executed your force-move method and this time it worked for me. The user could log in again and I could move the user between pools “normally” without the error.

    Thanks again!

  4. Pingback: An SfB DCOM error with a difference | greiginsydney.com

  5. Jake H.

    I would like to share some knowledge on this issue as well. It’s not all the same. Our issue showed exactly the same error DCOM when tried to move user to the SfBO or back on-premise.

    It turned out we had multiple Edge pool enable federation route. We had two Edge pool, single server edge pool and two-server edge pool. We cutover the single server edge pool to two-server edge pool but forgot to disable the federation route/prt on the single-server edge pool.

    We disabled the single-server edge pool, published topology and everything works great again!

  6. Pingback: Move-CsUser Fails from command line when Migrating User to New Pool. It succeeds from Graphical Interface | DigitalBamboo's Blog

Comments are closed.