I have been debugging Linux systems for many years at Vertica, and very often I have been helped by a lucid description of a problem that someone else has written and posted. In this post I hope to pay back some of that help that I have received over the years.
Geek Alert: the rest of the post will delve into socket geekery. If that doesn't get you excited, there are plenty of other ways to spend you time on the internet.
Problem:
In some cases on more recent (introduced at the end of 2012 and early 2013) Redhat based kernels (I don't know about other distributions) when you connect a file descriptor that was previously bind-ed to a specific port, the kernel will intermittently return EADDRNOTAVAIL.Specifically, I have observed this behavior change going from
Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux to:
Linux 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
In my (admittedly abusive) test program, about 150 out of 3000 rapidly opened and then closed sockets (with at most 10 outstanding at any time) fail with the call to connect returning -1 and setting errno to EADDRNOTAVAIL.
Solution:
Don't bind the socket first. Call connect directly.There it is -- in a nutshell -- my contribution to global knowledge. Hopefully it makes more than zero positive karma.
Discussion:
The problem doesn't seem to maifest itself if bind is not called on the socket before calling connect. I have no idea why this particular implementation quirk exists, nor why the change in behavior got backported by RedHat into RHEL5. A comment on the apache.org trafficserver-dev mailing list suggests the problem might be due to the fact that the Linux kernel has two different code paths for assigning ephemeral ports depending on if bind has been called or not.Background:
For those of you reading along not familiar with BSD/POSIX style sockets calls I will try and provide some background while I have ambition and this information is still paged into my head.The canonical pattern to receive incoming connections is:
// create a socket
int fd = socket(....)
// Bind the socket to where we want it to listen
bind(fd, <protocol, interface, port>)
// Tell the network stack to queue incoming connection requests
listen(fd)
// get a file descriptor for a particular client connection:
int cfd = accept(fd);
Note that in the accepting sequence above, the call to bind associates a socket with a particular port address so that the network stack knows on which interface / port it should accept connections from clients and which file descriptor to route such connections to.
The canonical pattern to establish an outgoing client connection to a server is
// create a socket
int fd = socket(....)
// get a file descriptor for a particular client connection:
connect(fd, remote_address);
Note that in this pattern, the connection from the client to the server does not explicitly supply an address and port. Rather, the tcp stack assigns it an 'ephemeral' port for the duration of the connection.
For reasons of code symmetry, Vertica's network code happened to use the following pattern
// create a socket
int fd = socket(....)
// Bind the socket to a specific local address
bind(fd, <protocol, interface, port>)
// get a file descriptor for a particular client connection:
connect(fd, remote_address);
Of course a call to bind above is not really required, and hasn't caused any problems in the last 5 years of production deployments of the Vertica Analytic Database. Actually in this case it doesn't even really seem do anything (to my knowledge) because the tcp stack was assigning the newly created connection an ephemeral port anyways (perhaps due to some options we had set on the socket via setsockopt).
Anyhow, when I changed the networking layer to avoid calling bind on the outgoing socket in this case, the intermittent EADDRNOTAVAIL failures went away.
If anyone can explain the above behavior better, or why RedHat backported something that caused it to start failing / behaving differently after more than 5 years of happiness, I would loev to hear from you.
p.s.
Before you say "of course you are running out of ephemeral ports" (which is the most common reason to get the EADDRNOTAVAIL error, it is not true -- I have 30K available and I can reliably get the problem to occur ~150 times out of 3000 with only 10 concurrently open at a time:
[06:33:58][alamb@tldr:~]$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000