During some DR testing recently I came across an interesting situation that caused calls to fail in the event 1 of my 2 SIP services went down. I wanted to prove than in the event of a SIP provider/network failure, my 2 Front End pools would continue to route calls via the secondary gateway.
To simulate a SIP provider failure I pulled the network cable from the back of the Sonus gateway. I noticed fairly quickly that the gateway reported that the SIP signalling group was down. I then made a test call from a Lync user who’s primary gateway was the one with the simulated failure. The call did not go through! After running some traces I found that the gateway was sending a “408 Request Timeout” message back to Lync. This is a problem because Lync will treat response codes in the 400 range as final, and will not attempt to route the call via any other configured gateways. This actually makes sense as 400 range response codes are client failure responses.
So how do we get around this? If we where to send a server failure response code from the 500 range, Lync will recognise that the server/service is down and attempt to re-route the call. To achieve this we will need to use an outbound translation rule on the Lync signalling group to change a 408 (Client Request Timeout) to a 504 (Server Timeout).
First create a “Message Rule Table” to add the new rule to:
To verify the change has had the desired effect I used the Sonus LX tool. You should now see 2 invites – the first is an attempt to route the call via the primary gateway, the second is the call trying the secondary gateway: