Wednesday, February 13, 2013

A Linux and Vertica Opera: EADDRNOTAVAIL returned fron connect

This short post is required to set up the more in depth follow up.

In Fall 2012, I worked on a problem here at Vertica that we saw at several customer sites. Specifically, the pathology was that when a lot of TCP connections get opened and closed in a short period of time (the Vertica Analytic Database happens to do this to run some queries), certain Linux kernels (unfortunately the stock RHEL6 ones included) will, occasionally, return EADDRNOTAVAIL when the program tries to connect a socket to a remote node.

This was causing some outbound connections to intermittently fail which was causing queries to fail which was causing an unhappy situation for all involved. 

I am fairly sure this is a kernel bug, at least from our point of view.  I can use the more polite phrase of 'bad interaction between kernel and vertica', but it doesn't really matter because at the end of the day the queries were failing, upgrading or downgrading the kernel made the problem go away, but our customers were in pain.

Amusingly the workaround I came up with was if the kernel refuses to open a connection, simply reissue the connect a few times (aka retry when an error was going to happen anyways). This approach was actually far more effective than I would have imagined -- the symptoms just go away.

I like to think of the workaround as the following (abbreviated) opera. You need to sing it in your head with a deep operatic voice:

 
Vertica: please open the connection
Kernel: No! (address is not available)!
Vertica: please open the connection
Kernel: No! (address is not available)!
Vertica: please open the connection
Kernel: Ok, fine

1 comment:

Wade Mealing said...

Do you have a reproducer handy ? Including details of which version and release ?