AAL

Wednesday, February 13, 2013

Intermittently seeing EADDRNOTAVAIL when calling connect

Follow on to my previous post.

I have been debugging Linux systems for many years at Vertica, and very often I have been helped by a lucid description of a problem that someone else has written and posted. In this post I hope to pay back some of that help that I have received over the years.

Geek Alert: the rest of the post will delve into socket geekery. If that doesn't get you excited, there are plenty of other ways to spend you time on the internet.

Problem:

In some cases on more recent (introduced at the end of 2012 and early 2013) Redhat based kernels (I don't know about other distributions) when you connect a file descriptor that was previously bind-ed to a specific port, the kernel will intermittently return EADDRNOTAVAIL.

Specifically, I have observed this behavior change going from

Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 17:10:18 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux to:
Linux 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

In my (admittedly abusive) test program, about 150 out of 3000 rapidly opened and then closed sockets (with at most 10 outstanding at any time) fail with the call to connect returning -1 and setting errno to EADDRNOTAVAIL.

Solution:

Don't bind the socket first. Call connect directly.

There it is -- in a nutshell -- my contribution to global knowledge. Hopefully it makes more than zero positive karma.

Discussion:

The problem doesn't seem to maifest itself if bind is not called on the socket before calling connect. I have no idea why this particular implementation quirk exists, nor why the change in behavior got backported by RedHat into RHEL5. A comment on the apache.org trafficserver-dev mailing list suggests the problem might be due to the fact that the Linux kernel has two different code paths for assigning ephemeral ports depending on if bind has been called or not.

Background:

For those of you reading along not familiar with BSD/POSIX style sockets calls I will try and provide some background while I have ambition and this information is still paged into my head.

The canonical pattern to receive incoming connections is:

// create a socket
int fd = socket(....)

// Bind the socket to where we want it to listen
bind(fd, <protocol, interface, port>)

// Tell the network stack to queue incoming connection requests
listen(fd)

// get a file descriptor for a particular client connection:
int cfd = accept(fd);
Note that in the accepting sequence above, the call to bind associates a socket with a particular port address so that the network stack knows on which interface / port it should accept connections from clients and which file descriptor to route such connections to.

The canonical pattern to establish an outgoing client connection to a server is

// create a socket
int fd = socket(....)

// get a file descriptor for a particular client connection:
connect(fd, remote_address);
Note that in this pattern, the connection from the client to the server does not explicitly supply an address and port. Rather, the tcp stack assigns it an 'ephemeral' port for the duration of the connection.

For reasons of code symmetry, Vertica's network code happened to use the following pattern

// create a socket
int fd = socket(....)

// Bind the socket to a specific local address
bind(fd, <protocol, interface, port>)

// get a file descriptor for a particular client connection:
connect(fd, remote_address);
Of course a call to bind above is not really required, and hasn't caused any problems in the last 5 years of production deployments of the Vertica Analytic Database. Actually in this case it doesn't even really seem do anything (to my knowledge) because the tcp stack was assigning the newly created connection an ephemeral port anyways (perhaps due to some options we had set on the socket via setsockopt).

Anyhow, when I changed the networking layer to avoid calling bind on the outgoing socket in this case, the intermittent EADDRNOTAVAIL failures went away.

If anyone can explain the above behavior better, or why RedHat backported something that caused it to start failing / behaving differently after more than 5 years of happiness, I would loev to hear from you.

p.s.

Before you say "of course you are running out of ephemeral ports" (which is the most common reason to get the EADDRNOTAVAIL error, it is not true -- I have 30K available and I can reliably get the problem to occur ~150 times out of 3000 with only 10 concurrently open at a time:

[06:33:58][alamb@tldr:~]$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

A Linux and Vertica Opera: EADDRNOTAVAIL returned fron connect

This short post is required to set up the more in depth follow up.

In Fall 2012, I worked on a problem here at Vertica that we saw at several customer sites. Specifically, the pathology was that when a lot of TCP connections get opened and closed in a short period of time (the Vertica Analytic Database happens to do this to run some queries), certain Linux kernels (unfortunately the stock RHEL6 ones included) will, occasionally, return EADDRNOTAVAIL when the program tries to connect a socket to a remote node.

This was causing some outbound connections to intermittently fail which was causing queries to fail which was causing an unhappy situation for all involved.

I am fairly sure this is a kernel bug, at least from our point of view. I can use the more polite phrase of 'bad interaction between kernel and vertica', but it doesn't really matter because at the end of the day the queries were failing, upgrading or downgrading the kernel made the problem go away, but our customers were in pain.

Amusingly the workaround I came up with was if the kernel refuses to open a connection, simply reissue the connect a few times (aka retry when an error was going to happen anyways). This approach was actually far more effective than I would have imagined -- the symptoms just go away.

I like to think of the workaround as the following (abbreviated) opera. You need to sing it in your head with a deep operatic voice:

Vertica: please open the connection
Kernel: No! (address is not available)!
Vertica: please open the connection
Kernel: No! (address is not available)!
Vertica: please open the connection
Kernel: Ok, fine

Saturday, April 2, 2011

Vertica @ NEDB Summit - Crescando, SharedDB, SwissBox

[This post is waaay late -- I wrote it back in January but I am just putting it up now]

Last Friday a crew from Vertica attended the New England Database Summit at MIT (Many thanks to Sam Madden for organizing such a good event). Despite Mingsheng walking off with my backpack (by accident) and almost giving me a heart attack it was a fun day.

I was especially impressed with the keynotes.

The first was Donald Kossmann, a professor at ETH Zurich entitled Predictable Performance for Unpredictable Workloads about a system they had co-developed with Amadeus IT Sophia-Antipolis for their airline reservation tracking system – think the computer that the boarding attendant uses at the gate to find out who is eligible for first class upgrades, who is a vegetarian, etc.

Overall I thought it was an excellent example of one-size-doesn't-fit-all: by designing a system for the specific workload, their result was both an order of magnitude less expensive and could actually meet the SLA promised by Amadeus to its customers.

The SLA promised: no query will exceed 2 seconds. Ever. Regardless of load on the system. It is totally fine if the system returns results in 2 seconds when it is running a single query, but it also must return results in 2 seconds if it is running thousands queries simultaneously. Stewardesses can smile at you for 2 seconds and you won't even notice, but waiting minutes under heavy load, was unacceptable. Think about the recent Icelandic volcano eruption which hosed air travel for months.

Their system was built around a storage manager called Crescando which is a word play on how it is structured: a sharded, distributed, in-memory table which is scanned continuously. They choose an amount of memory (~1GB) that can be scanned in ~1 second by a single modern CPU core and shard the table across cores and nodes into enough chunks to store the entire database (~1000 nodes). They put a query processor on top called SharedDB to plans queries and instruct each Crescando scan node what rows to return (what predicates). Since the table is completely scanned each second, voila you have your 2 second guarantee. Since it is an in memory database, it also supports arbitrary updates by applying them in batch during the sweep.

Professor Kossmann concluded with a short bit about the SwissBox project which would package the software into an appliance form factor and add some custom hardware (maybe the networking layer?) I have to admit that I find it amusing that even though a parade of companies have proven time and time again that custom FPGAs and ASICs can't keep pace with well written software for x86 processors people keep trying to add custom hardware anyways. Of course as a software guy I am biased, but I can't imagine being able to make hardware in small batches to compete with the economies of scale and R&D budget that Intel can throw at x86 development.

In another post, I will hopefully write up the other keynote: Renee Miller who I thought presented clearly and concisely the theoretical underpinnings to automatically determine functional dependencies from data (it is fascinating, I swear!), which I have discovered is a classic Data Mining technique. That one got me thinking of one of the directions we are headed at Vertica.

Sunday, March 27, 2011

Emacs + Vertica = vsql-mode

The other day it occurred to me that Vertica is grown up and therefore deserve its own emacs mode. Thus, I went and hacked together a bit of elisp (thank you 6.001) and voila! the vsql mode mode was born.

The emacs SQL mode lets a developer/vertica user connect and interact with a database from within emacs. Emacs ships with specializations for all the big databases (Oracle, SQLServer, Postgres, etc.) -- the vsql mode is a specialization for the Vertica client tool.

As a little back story: I am an avid (addicted?) emacs user, and as I get more and more entrenched in the editor I find it does more and more. I am often embarrassed, in fact, when someone asks me “how did you do x”, and I don't know. Typically I need to watch my hands as they recall the pattern from muscle memory to tell them. One mode I particularly like and use daily is the SQL mode and I have used the postgres mode for the last three years at Vertica (killer feature: C-c C-r) while developing the Vertica Optimizer. I can get away by symlinking vsql to psql because they behave similarly enough, but it was time for native vsql support.

To use:

Download vsql.el from github (https://github.com/alamb/Emacs-VSQL-Mode)

Add the following to your .emacs file (pointing at wherever you downloaded vsql.el):

;; vsql specific emacs SQL mode:
;; run via M-x sql-vertica
(load “vsql")

Invoke the mode with M-x sql-vertica

Tip: you can send sql text from other buffers to the *SQL* buffer via C-c C-r.

Things left to do: update the keywords list for the vertica specific SQL dialect, and maybe other stuff. Feel free to help out or tell me what would make it more useful

Andrew

AAL