Ephemeral Port Exhaustion in Kubernetes

Published: 2019-08-04 Sunday | Reading time: ~4 minutes

I recently ran into an interesting situation with a Kubernetes cluster running on the Google Cloud Platform behind a Network Address Translation (NAT) accessing a resource across the public internet.

A unique connection between two hosts is identified by a 5-tuple of the following which are stored in the translation table:

Protocol (TCP or UDP)
Source IP (NAT)
Source Port (NAT)
Destination IP
Destination Port

When data is sent through a NAT from a private network, it modifies the packet’s source IP address and port to it’s own public IP and random port and stores a mapping in its translation table. When the remote server responds, the packet is returned to the NAT’s public IP address and port and the NAT translation table allows it to look-up the original connect to modify the packet again and replace the original private IP and port.

So, for instance, the green network traffic on the diagram above might create an entry in the translation table as follows:

Example NAT table
Source IP (Node)	Source Port (Pod→Node)	NAT IP (Public)	NAT Port	Destination IP	Destination Port	Protocol	State	Timeout
10.2.0.3	43215	34.120.45.67	32456	142.250.80.78	443	TCP	`ESTABLISHED`	300s
10.2.0.3	52134	34.120.45.67	32457	52.84.251.90	80	TCP	`ESTABLISHED`	120s
10.2.0.4	60123	34.120.45.68	32456	142.250.80.78	443	TCP	`ESTABLISHED`	300s
10.2.0.4	43215	34.120.45.68	32457	35.190.27.5	8080	TCP	`TIME_WAIT`	60s

We can’t easily change the source IP or source port since these are disctated by the OS and K8s scheduler¹. Similiarly the protocol or destination IP and port are dictated by the application making the request. So that only leaves changing the NAT IP and port to make a unique entry in the NAT table. An OSI layer 4 network interface has lots of ports, 65,535 of them, to be precise (2¹⁶). The first 1024 are “well-known” ports reserved for things like DNS (53), HTTP (80) and sending the Quote of the Day (17) (no joke…). This leaves 64,512 “ephemeral” ports to play with and so a single NAT IP address, can support that many concurrent connections to a single destination.

That’s shed-loads, right? Not quite: the GCP Cloud NAT implementation operates at the node level (not the pod IP) and pre-allocates a number of ports to each Kubernetes node that’s connected to it. By default this is 64 ports per node/VM.

That means all containers running on a K8s node can make a combined total of 64 connections to the same internet destination.

As an example, let’s take a look at a pod running a Node.js container connecting to a MongoDB instance over the internet (caveat here that it is very much not recommended to leave a database open on the public internet in a production scenario).

If we connect to the pod and look at what open connections there are:

netstat -ptn  | grep -i established

We can see a total of 5 connections to the same destination. Our Node.js Mongo library, Mongoose, maintains a connection pool for us and defaults to a maximum of 5 connections.

tcp        0      0 192.168.150.10:54388    x.x.x.x:27017      ESTABLISHED 289998/node
tcp        0      0 192.168.150.10:32786    x.x.x.x:27017      ESTABLISHED 289998/node
tcp        0      0 192.168.150.10:48844    x.x.x.x:27017      ESTABLISHED 289998/node
tcp        0      0 192.168.150.10:58048    x.x.x.x:27017      ESTABLISHED 289998/node
tcp        0      0 192.168.150.10:49696    x.x.x.x:27017      ESTABLISHED 289998/node

What Can We Do? #

We’ve got a few levers to pull from a networking-basis to give ourselves a decent buffer of ports:

set a higher allocation of ports per Kubernetes node
add more static IPs to the Cloud NAT (this also improves redundancy)

For instance:

use 3 static public IP addresses on the NAT
set the number of ports per VM to 256

This translates to:

(64,512 available ports / 256 ports per K8s node) x 3 static IP addresses

= 756 total outbound connections 
(from all containers on a single K8s node to the same remote IP and port)

In an ideal world, we’d be more pro-active by monitoring the Cloud NAT resource itself for exhaustion and alert in advance. At the time of writing, GCP doesn’t support this but there’s an open issue here: #127496796 Cloud NAT - Stackdriver monitoring integration .

In the meantime, the Cloud NAT does emit a log entry with allocation _status: "DROPPED" in this scenario and it’s possible to alert for it.

From an application perspective, we could also do things like:

adjust connection pooling settings to ensure a manageable number of connections are kept open. For instance with the Mongoose library:
```
mongoose.createConnection(uri, { poolSize: 3 });
```
ensure that, when connection errors happen (either due to application problems or port exhaustion), the application retry logic uses non-aggressive retries with exponential backoffs. Otherwise the reconnection attempts themselves may quickly saturate the NAT table.

Technically, if fewer pods are run on a node (i.e. smaller but more nodes in the cluster), then this would also help avoid port exhaustion but, in practice, you generally want to right-size your cluster based on application performance and scaling needs rather than as a work-around for NAT networking limitations. ↩︎