Archive for the ‘ Kerberos ’ Category

Kerberos, SQL Server 2008, Windows Server 2008

Updated Jan. 4th, 2011 to include:

Service account information for SQL Browser  |  Loopback registry setting  |  Additional SPN’s for Analysis Services

So over the past couple of weeks I have been working on a multi-node server 2008 cluster (5 active, 1 passive).  Part of the prep work before the databases were migrated from their old Server 2003 and SQL Server 2005 homes was to set up linked servers.  Simple task.  right?

It always had been in the past.

Ever seen this error? 

Message
Error: 18456, Severity: 14, State: 11.

Loginfailed for user ‘NT AUTHORITY\ANONYMOUS LOGON’. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors.

Now, I’ve seen that error plenty in my life and typically it’s for a handful of well-know reasons such as these :  Resource #1 | Resource #2 | Resource #3  | Resource #4 … each of these provided a great deal of insight, but none of them resolved the issue. 

One of the most frustrating elements of this issue was to get it to occur on demand (either failure or success).  It appeared to be an intermittent issue.  Well, it wasn’t an intermittent issue at all at the end of the day… it was simply the manner in how I was testing it.

To begin I ran kerblist tickets from a command prompt (kerblist.exe is part of the 2003 resource pack) this showed me which tickets had been generated.  … screech… back up.  Prior to all of this, it’s important to validate that your SPN’s are already set and already accurate.  If you are unsure, visit this site and spend some time understanding kerberos and SPN’s.

Once I had a list of the tickets, I could see that some were generated as I expected yet others were missing that I expected to see there… So, I would run “kerblist purge” from the command prompt and successfully TEST a linked server successfully … and not have a ticket generated.  c.o.n.f.u.s.i.n.g.

Well, it’s because my sql server session had already been established and the ticket was created when I originally connected to the server via management studio.  When I ran the “kerblist purge”; it did indeed remove the kerberos tickets, but it did NOT sever my already established connection to sql server; so it continued to work even without a ticket being generated.

Now, the fix for the testing is pretty easy.  Don’t stay connected in management studio.  close it, re-open it or un-register and re-register the server.  viola, this solved the “intermitted” issues with linked servers… kind – of.  Throughout the process I was personally convinced that it was a physical host issue on some level or another.  So, I was moving resources from one node to another, testing, moving, testing, all sorts of differing results.  It got to the point where I created 4 linked servers on each server in an attempt to build out a matrix of what worked and what didn’t.  Even after all of that work it still was inconsistent. 

Enter Microsoft.  Tier one and two spent a lot of time on this (two weeks) and we did a slew of items ranging from registry entries to new service accounts to trying differing ways of connecting to the cluster using kerberos etc… none of it was successful.  I do owe an apology to tier two support for how frustrated I became after a week of working on this withthem.  They were trying their hardest and I did not request an escalation point as soon as I should have.  Once I finally did, tier 3 called me and we got down to business.  Tait Neville with the Enterprise Platforms Support – Directory Services team is one heck of a guy.  We spent about 8 hours on the phone the first day and ran traces, kerblist commands galore, setspncommands six ways from Sunday, read logs, re-read logs, failed over the environment, blew away stuff, re-built stuff… and on and on… That day ended with us realizing that the inconsistency we were seeing was perhaps a larger issue than the lack of connectivity; because sometimes, it did indeed work correctly.

Late in the call (7+hours) Tait came across a newer kb article that showed promise; but didn’t fully explain what we were seeing.  Nonetheless, we installed it.  tried it out, inconsistency.  went home.  frustrated and tired.

The next day brought about a new attitude (namely mine).  It started withthe realization that I wrote about above (Management studio being connected)… Once we could easily re-produce tickets being generated, progress was very swift.  At that point, we decided to clear out a huge swath of SPN’s that had been created over the period of trying and testing everything. 

On each node in the cluster (6 in our case) we ensured that the SQL Server Browser was running under the service account. 

On each node in the cluster we also ensured that the following registry entry was set to 1.  >> HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Lsa>DisableLoopbackCheck

Here’s what we ended up with for SPNs on the SQL Server(s):  *NOTE:  These are all based on the Service account that sql server is running under.

MSSQLSvc/Servername.domainname:Port  (in all cases we are running named instances in SQL Server)  — A total of five entries with the servername and ports being different, as applicable.

MSSQLSvc/Servername.domainname:SQLServerNamedInstanceName  *Note:  .DomainName is the fully qualified domain name (FQDN)

MSSQLSvc/Servername:Port

Here’s what we ended up with for Analysis Services:

MSOLAPDisco.3/ServerName:SQLServerNamedInstanceName

MSOLAPSvc.3/ServerName:SQLServerNamedInstanceName

MSOLAPDisco.3/ServerName

MSOLAPSvc.3/ServerName

MSOLAPDisco.3/ServerName.DomainName  *Note:  .DomainName This is the fully qualified domain name (FQDN)

MSOLAPSvc.3/ServerName.DomainName *Note:  .DomainName is the fully qualified domain name (FQDN)

MSOLAPDisco.3/ServerName.DomainName:SQLServerNamedInstanceName  *Note:  .DomainName is the fully qualified domain name (FQDN)

MSOLAPSvc.3/ServerName.DomainName:SQLServerNamedInstanceName *Note:  .DomainName is the fully qualified domain name (FQDN)

Wait for AD to replicate…. ho hum ho hum….  (in our experience, this can take up to 20 minutes) …  viola.  works like a charm.

      Summary:

It turned out that we had a couple of issues in our environment.

1.)  We had too many SPN’s (none of them malformed) but not all of them were needed and it was causing some odd issues.

2.)  Every time we failed over the SQL Server resource to another node, we would see it work, sometimes and not other times.  The hotfix resolved this.

3.)  SQL Server Browser Service was not set to the Service Account.

4.)  DisableLoopbackCheck was set to 0.

To Paul (the SE here) and Tait (MSFT engineer), thank you.

Fun times… well, not really, but I learned a lot and hopefully this post will also help you if you find yourself in a similar scenario.

The main take away for me is: 

1.)  Attention to detail, specifically with SPN creation and doing everything in the same order on multiple machines in a cluster is critical.

2.)  It’s absolutely critical to have a reproducible set of steps that you can test and reproduce at will.  A lot of time was spent on this case because of what appeared to be inconsistent behavior; when in reality, it was the test that was causing a lot of the confusion.

3.)  When you just can’t figure it out, a call to Microsoft, while long, time-consuming and at times frustrating, can yield tremendous results.