How to fix WebSockets/HAProxy timeout
If you encounter the problem of WebSockets connection handled with HAProxy being lost, increasing the tunnel timeout to 60s or disabling it using 0 will probably fix the problem. If you’d like to know why does it happen, read along!
Context
First of all, let’s start with how our single-page app communicates with the backend, which is necessary to understand what happened.
Instead of a standard, XHR approach, we use Phoenix channels as a sort of RPC for our backend. When user logs in to Surfer, we establish a WebSocket connection using Phoenix JS library:
this.socket = new Socket(host(), {
params: { token: this.props.currentUser.token },
reconnectAfterMs: () => 1000,
})
this.socket.connect()
After that we establish a single channel for given user to handle all the communication:
this.channel =
this.socket.channel(`room:${this.props.currentUser.id}`)
this.channel.join()
When we need to make a request to the backend , we channel.push the data and receive the response:
this.channel.push('billing:status', {}).receive('ok', billing => {
this.setState({ billing })
})
Problem
One day we noticed some of the requests seem to fail, ending up with endless loading spinner. After refreshing the page it was clear that responses were successful, so the problem was with receiving the response and not with the request itself. We also noticed that our WebSocket is reconnecting every few seconds. Both problems were visible only on staging environment with production and local being untouched.
After some time scratching our heads and trying different things, we recalled we’ve made some changes in HAProxy configuration recently to support the downtime for a migration we wanted to make in the next few days. The change was simple — we reduced server and client timeout to make sure user won’t wait too long for our custom 503 page to appear:
- timeout client 50000ms
- timeout server 50000ms
- retries 50
+ timeout client 3000ms
+ timeout server 3000ms
+ retries 1
After rolling it back it started to work perfectly again, but we still weren’t 100% sure why. Using the shorter timeout, the channel was reconnecting every 3 seconds and some requests failed to receive the response, but why it wasn’t the case for the 50 seconds timeout?
Cause
In order to keep the WebSocket connection alive, Phoenix Socket library sends a heartbeat message which happens every 30 seconds by default. With 50 seconds timeout for the client/server everything worked fine, reducing the timeout to merely 3 seconds made WebSocket connection to timeout, so that when request took any longer, the frontend was unable to receive the response on time.
Solution
Obvious one was to reduce the heartbeat interval to something less than 3 seconds to make sure we keep the connection alive even with the migration config, but it wasn’t really appealing to spam the server such often.
After some googling we figured out there was another HAProxy timeout setting which is responsible for a tunnel connections:
The tunnel timeout applies when a bidirectional connection is established
between a client and a server, and the connection remains inactive in both
directions
https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#4-timeout%20tunnel
When connection becomes a tunnel (as it happens for WebSockets) this timeout setting supersedes both the client and server timeouts.
Here’s how it looks in the config:
timeout client 3000ms
timeout server 3000ms
timeout tunnel 50000ms
The proper solution was to keep the client/server timeouts low for the period of migration but set the tunnel timeout separately, so that when the app goes back up it will work fine immediately, giving us time to deploy the old HAProxy settings (which are needed to properly handle regular deployments with shorter downtimes, not giving 503 when they happen).