Tuesday, October 16, 2012

ADFS 2.0 Event ID 248 and 364: An unsecured or incorrectly secured fault was received

 

We had our first significant outage with ADFS this weekend.  During a Sunday morning change control we updated the communication certificates on all our STS and Proxy servers and promoted a newer signing certificate from secondary to primary, following the directions at AD FS 2.0: How to Replace the SSL, Service Communications, Token-Signing, and Token-Decrypting Certificates.  As our PKI infrastructure was recently changed the new signing certificate chained up to a new root, but all of our Dev and QA tests were successful on the new chain.

All changes tested out successfully; our relying parties that only trust one certificate had switched to trusting the new signing certificate and users could still access the relying parties.  So the change control was closed.

Monday morning we received notification that users connecting externally were receiving an error message rather than getting to the Forms-Based Logon page.  What was odd for this outage was that all our internal access to ADFS was fine, it was only external access through the proxy servers having issues.

The proxy servers ADFS logs were filling with Event ID 364 errors:

Encountered error during federation passive request.

Additional Data

Exception details:
System.ServiceModel.Security.MessageSecurityException: An unsecured or incorrectly secured fault was received from the other party. See the inner FaultException for the fault code and detail. ---> System.ServiceModel.FaultException: An error occurred when verifying security for the message.
   --- End of inner exception stack trace ---

Server stack trace:
   at System.ServiceModel.Channels.SecurityChannelFactory`1.SecurityRequestChannel.ProcessReply(Message reply, SecurityProtocolCorrelationState correlationState, TimeSpan timeout)
   at System.ServiceModel.Channels.SecurityChannelFactory`1.SecurityRequestChannel.Request(Message message, TimeSpan timeout)
   at System.ServiceModel.Dispatcher.RequestChannelBinder.Request(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]:
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   at Microsoft.IdentityServer.Protocols.PolicyStore.IPolicyStoreReadOnlyTransfer.GetState(String serviceObjectType, String mask, FilterData filter, Int32 clientVersionNumber)
   at Microsoft.IdentityServer.PolicyModel.Client.PolicyStoreReadOnlyTransferClient.GetState(String serviceObjectType, String mask, FilterData filter, Int32 clientVersionNumber)
   at Microsoft.IdentityServer.ProxyConfiguration.ProxyConfigurationReader.FetchServiceSettingsData()
   at Microsoft.IdentityServer.ProxyConfiguration.ProxyConfigurationReader.GetServiceSettingsData()
   at Microsoft.IdentityServer.ProxyConfiguration.ProxyConfigurationReader.GetFederationPassiveConfiguration()
   at Microsoft.IdentityServer.Web.PassivePolicyManager.GetPassiveEndpointAbsolutePath()
   at Microsoft.IdentityServer.Web.FederationPassiveAuthentication.GetPassiveEndpointAbsolutePath()

System.ServiceModel.FaultException: An error occurred when verifying security for the message.

Our first troubleshooting activity was to restart the ADFS service on the proxy server.  When we did that it logged an Event ID 248 error:

The federation server proxy was not able to retrieve the list of endpoints from the Federation Service at corp.sts.WIDGETS.com. The error message is 'An unsecured or incorrectly secured fault was received from the other party. See the inner FaultException for the fault code and detail.'.

User Action
Make sure that the Federation Service is running. Troubleshoot network connectivity. If the trust between the federation server proxy and the Federation Service is lost, run the Federation Server Proxy Configuration Wizard again.

Frustratingly, no inner FaultException was present.

We re-ran the Federation Server Proxy Configuration Wizard and it completed successfully but the same 248 error occurred at service start.  We also verified the new signing cert did chain up to a root that the proxy server trusted.  Turning up full debug on the proxy server did not provide any additional useful data.

On a functional proxy server one expects service start to result in Event ID 245

The federation server proxy retrieved the following list of endpoints from the Federation Service at 'https://corp.sts.WIDGETS.com:443/adfs/services/proxytrustpolicystoretransfer':
/FEDERATIONMETADATA/2007-06/FEDERATIONMETADATA.XML
/ADFS/SERVICES/TRUST/MEX
/ADFS/SERVICES/TRUST/2005/WINDOWSTRANSPORT
/ADFS/SERVICES/TRUST/2005/CERTIFICATEMIXED
/ADFS/SERVICES/TRUST/2005/CERTIFICATETRANSPORT
/ADFS/SERVICES/TRUST/2005/USERNAMEMIXED
/ADFS/SERVICES/TRUST/2005/ISSUEDTOKENMIXEDASYMMETRICBASIC256
/ADFS/SERVICES/TRUST/2005/ISSUEDTOKENMIXEDSYMMETRICBASIC256
/ADFS/SERVICES/TRUST/13/CERTIFICATEMIXED
/ADFS/SERVICES/TRUST/13/USERNAMEMIXED
/ADFS/SERVICES/TRUST/13/ISSUEDTOKENMIXEDASYMMETRICBASIC256
/ADFS/SERVICES/TRUST/13/ISSUEDTOKENMIXEDSYMMETRICBASIC256

To help isolate the problem we configured a local hosts entry on the proxy server to bypass the load balancers and hit a single STS.  We could NetMon trace the service start and see the SSL handshake and traffic going only to the expected internal STS.  As we saw no traffic trying to go anywhere except to the STS we were fairly certain there wasn’t an issue with validating the new chain.  But given the error message for receiving an incorrectly secured response and that we just changed all the certificates we were fairly certain the switchover was the problem, but we had yet to figure out the solution.

Finally we decided to try restarting the ADFS service on the STS the proxy server was using, even though that STS was not exhibiting any errors.  So we restarted the STS and restarted the proxy, and the proxy service started without error.  SUCCESS!!

We restarted the service on the other STSs in the pool and restarted our other proxy server and it started working as well.

My guess for what happened is that the proxy servers reached their 4 hour trust renewal cycle after the change control verification had completed.  At that time, I am guessing the SOAP responses to the proxytrustpolicystoretransfer endpoint requests were still being signed with the old signing certificate when the proxy was expecting them to be signed with the new, hence the “incorrectly secured” error.  I’m guessing the service restart forced the STS to pick the new certificate to use to sign its SOAP responses for the proxytrustpolicystoretransfer endpoint.  I’m also guessing we missed this in Dev and QA because the proxy usage is a secondary use case and was likely tested after a service restart or server reboot on the STS.

We’re still waiting on our Microsoft PFE to return with root cause analysis to see if Microsoft acknowledges a bug in the certificate handling.  But for now, the short story is to cycle the STS services when rolling to new certificates.

UPDATE (2012.11.07)

I have updated the instructions in the AD FS 2.0: How to Replace the SSL, Service Communications, Token-Signing, and Token-Decrypting Certificates wiki article to include the STS service restart step.

18 comments:

  1. Hello,

    Our organization had the EXACT same thing just happen to us, I was working with an MS escalation engineer on this and they at no time directed in restarting of the services on the non proxy farm members (Primary/Secondary ADFS servers). This looks to have had the same impact as it had for you at the time of your posting, were there any further notes from MS regarding your case or instructions to be performed that made this permanent?

    Nick M.

    ReplyDelete
  2. Me too, and the same fix helped me. Had googled for hours on other solutions. Thanks for sharing.

    ReplyDelete
  3. Hi David,

    We noticed that you updated the 'how-to' article on the TechNet Wiki regarding proxy trust issues. This behavior is actually by design, and a service restart is not required if the steps are followed correctly. Here's why:

    1. Proxy trust is simply a SAML assertion that is signed and encrypted using the AD FS signing and decryption certs
    2. If you replace either the signing, the decryption, or both certificates, you must leave the OLD certificates in place as Secondary certificates until you are sure of two things:
    a. All users' SSO sessions signed and/or encrypted using the OLD certs have ended
    b. All FS proxy servers have renewed their trust and received a new trust token which is signed and encrypted based on the NEW certs.
    3. If you remove the OLD certs completely from the AD FS MMC, then the proxy cannot service proxy requests since you've taken away its means of authenticating against the internal FS.

    The article states:
    Leave the old certificate as secondary for rollover purposes. You should plan to remove the old certificate once you are confident it is no longer needed for rollover, or when the certificate has expired.

    I will remove the additions you made to the article, and update the above sentences to show that this affects SSO users as well as proxy trust.

    Thank you,
    Adam Conkle - MSFT

    ReplyDelete
    Replies
    1. Thanks for the reply Adam. Unfortunately our outage shows that documented behavior is not correct as we did not remove a certificate. We only promoted a new signing certificate to primary, and yet the fault occurred. We are actually still waiting to hear back on root cause analysis from our PFE.

      Delete
    2. We also had a customer with this issue... updated the communication and signing certificates, and left the original certs as secondaries. To fix I revoked all proxies, re-ran the proxy configuration wizard, and restarted the ADFS windows service on both ADFS internal and proxy servers. It wouldn't work without the ADFS service restarts... Event ID 284.

      David, thanks so much for posting the solution.

      Delete
  4. I had the same fault occur in my environment (two STS, two proxies). Same resolution. Frustrating, because we were using auto-rollover. This was not a manual certificate change. I had expected automatic rollover to obviate the need for a service restart at the time of certificate promotion. Why have "automatic" rollover if you need to intervene manually to restart the STS services?

    ReplyDelete
  5. Same thing happened to me with after we had to create a new token signing certificate for Office 365. The sad thing is I did restart the ADFS primary server after creating the certificate, but I think the problem happened as it auto rolled to using the newly created as primary a week later. Guess I should have set it to primary right away.

    Much thanks from me as well for posting this solution.

    ReplyDelete
  6. Same thing happen here as well; We had to reboot the services on the application servers.

    ReplyDelete
  7. Hi, exactly same issue today, after signing certificate renewal. Renewing the Proxy Trusts did fix the issue during a couple of hours, but after a while, problem appears back. Restarting ADFS Services on the back end farm servers fixed also our issue. Thank you for sharing. I didn't thought about restarting adfs services on the farm members !!

    ReplyDelete
  8. 1 more 'thank you' from another user with the same problem, same fix.

    ReplyDelete
  9. Great post mate, helped me ultimately resolve an issue relating to the proxy server not communicating after a token decrypting cert change. Thanks again!!

    ReplyDelete
  10. Me too! I sent myself a calendar invite for 3 years from now to reboot those ADFS Services.

    ReplyDelete
  11. We had same errors on one of our proxy servers, I see we tried everything suggested here but still - this didnt solve our issue. What DID solve it was that we suddenly discovered a mismatch on the clock/time on this proxy server compared to the rest. It was 4 min out of sync. Corrected that and tested and everything working smoothly.

    ReplyDelete
  12. Great post this was killing me for an entire day!

    ReplyDelete
  13. Same here! Spent over 8 hours troubleshooting! This article made my day!

    ReplyDelete
  14. Ditto here - great post! It saved our bacon. Thanks to your article our resolution time for this mysterious issue was reduced to under an hour.

    Thanks a bunch, David!

    ReplyDelete
  15. This solution was so so helpful to me as well. It took me an entire day researching until I landed on this post. Glad the issue occurred on our Development ADFS farm before making cert renewal on ADFS Production Farm and Proxies

    ReplyDelete
  16. Thanks a lot! Took me about 6 hours before I ran into this solution which temporarily took down our internal users (which is why I didn't try it previously).

    ReplyDelete