Okay, so I had a weird issue. A user would sign in to Office Communicator 2007 R2. Their login would fail. Not terrificly unusual, but when digging in to the trace log on the client, the trace went like this:
Client: REGISTER
Server: 401 UNAUTHORIZED (Normal since anonymous login isn’t allowed and that kicks in the Kerberos authentication)
Client: REGISTER (with the Kerberos blob)
Server: 480 TEMPORARILY UNAVAILABLE
What? What the heck is temporarily unavailable mean in this context? Scratching my head, I had to go about getting in to some really detailed logging. First, I needed a known working client that I could base my findings against. Here’s the normal SIP conversation:
Client: REGISTER
Server: 401 UNAUTHORIZED
Client: REGISTER (with the Kerberos blob)
Server: 200 OK
Client: SERVICE …
I’m interested in seeing what is going on between that second REGISTER message and the resulting OK response. I used the diagnostic logging on the pool and enabled logging for the SIPStack (all trace flags turned on) and the UserServices (all trace flags turned on) components. I enabled a filter for just the URI of the account I was interested in tracing (which filters the SIPStack log file but doesn’t seem to affect the UserService log).
I start out with the Communicator client signed out. IMPORTANT! Clear out all of your old log files before starting logging since the analysis will include those old files. I start logging, sign in on Communicator (wait for it to actually start to display my contact list), and then stop logging on the server.
I analyze the log files (specifically the SIPStack trace). This lets me find the SIP messages and, more importantly, the timestamps associated with them. Those timestamps are used to locate the area of interest in the raw log files.
First, I check the SIPStack raw log files and locate the timestamp of the second REGISTER message. Lots of data there, but the key part are the log entries for the CAUTH* components. This is where you see that the user was successfully authenticated or not.
Once found and proven that authorization against AD is fine, search down from there for the phrase UserServices. You should see that the SIPStack is setting the authorizor to www.microsoft.com/LCS/UserServices. Continue the search and you’ll see that the SIPStack is going to call over to UserServices to find the user in the database.
There are some back and forth messages between the SIPStack and UserServices components as they populate data from the SIP message into a query that can be performed against the database. Keep searching for UserServices to follow along.
The first query is to see if the endpoint already exists in the database (I suppose to see if this is a re-login type event). Normally, no endpoint is found.
Next, UserServices proceed with the REGISTER portion and work to register the endpoint into the database. Shortly after the registration is completed successfully, there should be a CISIPResponse::SetStatusCode:ISIPResponse.cpp message where the status code is set to 200. This is the start of the return message to the client.
All of that is “normal” processing for a login. For the failed user who was responding with a 480 TEMPORARILY UNAVAILABLE, here’s what was found.
Essentially, I searched for the text [480]. That got me to the first part of the conversation where it was decided that there was something temporarily unavailable. Scrolling up from there, I found the problem.
Turns out that there was a database error when the system was calling the RtcUpdateEndpoint stored procedure. The database error text is:
RtcUpdateEndpoint sproc execution failed : ExecHr = [hr=S_OK], NativeError = [52051], NativeErrorSeverity = [11], NativeErrorLineNumber = [21], NativeErrorSqlState = [1], OdbcSqlState = [42000], ErrorText = [[# [Microsoft][SQL Native Client][SQL Server]###52051:RtcUpdateEndpoint:Timed out waiting for applock for a publisher. #
This aligns with the failure reason given in the SIP/2.0 480 Temporarily Unavailable message ultimately delivered to the client:
Publisher lock request timedout
So, what could the real problem be? I dug into the stored procedure to find out what it does and where the problem may lie. In the procedure, the only reference to the term applock was a call to the procedure rtcdyn.dbo.sp_getapplock. This looks to be the correct procedure to further investigate since the error code resulting translates into error 52051 (also the only reference to that number in the RtcUpdateEndpoint procedure). So, it appears that we cannot set up an applock in the rtcdyn database in order to update the endpoint.
The lock resource name is publisher_ResourceID where ResourceID is the primary key of the entry in the rtc.dbo.Resource table that matches the SIP address of the item asking for the lock. Since the ResourceID is easily discoverable, it is a simple matter to view the application locks in the database (the rtcdyn database) using the sp_lock stored procedure.
Lo and behold, I found the lock there. Unfotunately, there’s doesn’t seem to be a method to remove that application lock. The fix, then is to restart the SQL Server service. By shutting down, all pending transactions are rolled back and all locks are removed. That seemed to fix the problem for these users who were getting the 480 error.