Windows Fabric certificate error – Front End service fails to start

0
101

My team and I were stuck on an error no one had ever seen before and one that was proving very difficult to determine the root cause! This post is to help you avoid the pain and suffering we went through late into the night.

Issue

After rebooting the 2x Skype for Business Standard Edition servers conferencing no longer worked on either pool. Errors seen in the ‘Lync’ event log:

“Skype for Business Server MCUFactory exception was handled Exception: System.AggregateException: One or more errors occurred. —> System.Fabric.FabricException: A communication error caused the operation to fail. —> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BBC”

We also were seeing lots of Fabric errors (occurring every second):

  • 0x1C55350FD40: authorization failure
  • 1c55350fd40: cert chain trust status is in error: 0x40
  • Disconnected: <FE-SERVER>:5092 ([ nodeInstance=0:0 ]) error=E_ABORT

With the Fabric errors in mind, we tried a quorum reset (reset-cspoolfabricstate -resettype Quorumlossrecovery) on one of the Front End servers which caused the Front End service to stop running altogether!

“The Skype for Business Server Front-End service terminated with the following service-specific error: General access denied error”

Now we are in real trouble 🙂 So we tried reinstalling Windows Fabric Host service, no luck.

It seemed very suspect to us that both servers would be affected at the same time and our feeling was that this was environment related namely certificates. However, the certificates all looked OK but we re-issued them just in case. Nope, that didn’t help!

We kept looking all over but kept circling back to certificates. On deeper inspection, we noticed that the Intermediary CA that issued the certificates had a valid and published CRL, however, the Root CA that issued the Intermediary CA certificate didn’t. The CRL was also expired and because the Root CA was offline, it was not re-generated. We verified this using the certutil (C:\Windows\System32\certutil.exe) and entering the DN for the CRL:

You can double click the entries in the above tool to see more detail. In our case you can see the CRL is expired:

In summary:

  • Root CA Offline – No CRL
  • Intermediary – No CRL as Root not publishing one
  • Certs issue by Intermediary – Has CRL
  • Root CA CRL expired

Interestingly the Root CA’s CRL was 20 months expired, so Fabric must have a damn long grace period!? A little weird this issue didn’t surface a long time ago. In fact, this Skype for Business deployment was not even 20 months old, yet it installed and worked happy for 6 months, go figure….

Resolution

It was the middle of the night and we couldn’t resolve the CRL issue right there and then so we looked for a workaround. The temporary fix was to change the ClusterManifest.Xml.Template (C:\Program Files\Skype for Business Server 2015\Server\Core) value to <Parameter Name=”CrlCheckingFlag” Value=”0″ />. Right after doing this, the services started and conferencing came up on both servers.

A big thanks to Luis Ramos for pointing us to this fix!!!

Interestingly the default for <Parameter Name=”IgnoreCrlOfflineError” Value=”true” /> was true, so one would have thought that the inaccessibility wouldn’t matter, but maybe because there was no published CRL point, it broke it before it got to the offline check.

The long term fix was to renew the CRL and make sure a process was put in place to ensure that it is manually renewed in the future.

 

LEAVE A REPLY