Skip to main content

Networking for Data

VPC and IP Considerationsโ€‹

Plan for a large amount of IP address usage in your EKS clusters.โ€‹

The AWS VPC CNI maintains a โ€œwarm poolโ€ of IP addresses on the EKS worker nodes to assign to Pods. When more IP addresses are needed for your Pods, the CNI must communicate with EC2 APIs to assign the addresses to your nodes. During periods of high churn or large scale out these EC2 API calls can be rate throttled, which will delay the provisioning of Pods and thus delay the execution of workloads. When designing the VPC for your environment plan for more IP addresses than just your pods to accommodate this warm pool.

With the default VPC CNI configuration larger nodes will consume more IP addresses. For example a m5.8xlarge node that is running 10 pods will hold 60 IPs total (to satisfy WARM_ENI=1). However a m5.16xlarge node would hold 100 IPs. Configuring the VPC CNI to minimize this warm pool can increase the EC2 API calls from your nodes and increase the risk of rate throttling. Planning for this extra IP address usage can avoid rate throttling problems and managing the IP address usage.

Consider using a secondary CIDR if your IP space is constrained.โ€‹

If you are working with a network that spans multiple connected VPCs or sites the routable address space may be limited. For example, your VPC may be limited to small subnets like below. In this VPC we wouldnโ€™t be able to run more than one m5.16xlarge node without adjusting the CNI configuration.

Init VPC

You can add additional VPC CIDRs from a range that is not routable across VPCs (such as the RFC 6598 range, 100.64.0.0/10). In this case we added 100.64.0.0/16, 100.65.0.0/16, and 100.66.0.0/16 to the VPC (as this is the maximum CIDR size), then created new subnets with those CIDRs. Finally we recreated the node groups in the new subnets, leaving the existing EKS cluster control plane in place.

expanded VPC

With this configuration you can still communicate with the EKS cluster control plane from connected VPCs but your nodes and pods have plenty of IP addresses to accommodate your workloads and the warm pool.

Tuning the VPC CNIโ€‹

VPC CNI and EC2 Rate Throttlingโ€‹

When an EKS worker node is launched it initially has a single ENI with a single IP address attached for the EC2 instance to communicate. As the VPC CNI launches it tries to provision a Warm Pool of IP addresses that can be assigned to Kubernetes Pods (More details in the EKS Best Practices Guide).

The VPC CNI must make AWS EC2 API calls (like AssignPrivateIpV4Address and DescribeNetworkInterfaces) to assign those additional IPs and ENIs to the worker node. When the EKS cluster scales out the number of Nodes or Pods there could be a spike in the number of these EC2 API calls. This surge of calls could encounter rate throttling from the EC2 API to help the performance of the service, and to ensure fair usage for all Amazon EC2 customers. This rate throttling can cause the pool of IP address to be exhausted while the CNI tries to allocate more IPs.

These failures will cause errors like the one below, indicating that the provisioning of the container network namespace has failed because the VPC CNI could not provision an IP address.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "xxxxxxxxxxxxxxxxxxxxxx" network for pod "test-pod": networkPlugin cni failed to set up pod test-pod_default" network: add cmd: failed to assign an IP address to container

This failure delays the launch of the Pod and adds pressure to the kubelet and worker node as this action is retried until the IP address is assigned. To avoid this delay you can configure the CNI to reduce the number of EC2 API calls needed.

Avoid using WARM_IP_TARGET in large clusters, or cluster with a lot of churnโ€‹

WARM_IP_TARGET can help limit the โ€œwastedโ€ IPs for small clusters, or clusters that has very low pod churn. However, this environment variable on the VPC CNI needs to be carefully configured in large clusters as it may increase the number of EC2 API calls, increasing the risk and impact of rate throttling.

For clusters that have a lot of Pod churn, it is recommended to set MINIMUM_IP_TARGET to a value slightly higher than the expected number of pods you plan to run on each node. This will allow the CNI to provision all of those IP addresses in a single (or few) calls.

  [...]

# EKS Addons
cluster_addons = {
vpc-cni = {
configuration_values = jsonencode({
env = {
MINIMUM_IP_TARGET = "30"
}
})
}
}

[...]

Limit the number of IPs per node on large instance types with MAX_ENI and max-podsโ€‹

When using larger instance types such as 16xlarge or 24xlarge the number of IP addresses that can be assigned per ENI can be fairly large. For example, a c5.18xlarge instance type with the default CNI configuration of WARM_ENI=1 would end up holding 100 IP addresses (50 IPs per ENI * 2 ENIs) when running a handful of pods.

For some workloads the CPU, Memory, or other resource will limit the number of Pods on that c5.18xlarge before we need more than 50 IPs. In this case you may want to be able to run 30-40 pods maximum on that instance.

  [...]

# EKS Addons
cluster_addons = {
vpc-cni = {
configuration_values = jsonencode({
env = {
MAX_ENI = "1"
}
})
}
}

[...]

Setting the MAX_ENI=1 option on the CNI and that this will limit the number of IP addresses each node is able to provision, but it does not limit the number of pod that kubernetes will try to schedule to the nodes. This can lead to a situation where pods are scheduled to nodes that are unable to provision more IP addresses.

To limit the IPs and stop k8s from scheduling too many pods you will need to:

  1. Update the CNI configuration environment variables to set MAX_ENI=1
  2. Update the --max-pods option for the kubelet on the worker nodes.

To configure the --max-pods option you can update the userdata for your worker nodes to set this option via the --kubelet -extra-args in the bootstrap.sh script. By default this script configures the max-pods value for the kubelet, the --use-max-pods false` option disables this behavior when providing your own value:

  eks_managed_node_groups = {
system = {
instance_types = ["m5.xlarge"]

min_size = 0
max_size = 5
desired_size = 3

pre_bootstrap_user_data = <<-EOT

EOT

bootstrap_extra_args = "--use-max-pods false --kubelet-extra-args '--max-pods=<your_value>'"

}

One problem is the number of IPs per ENI is different based on the Instance type (for example a m5d.2xlarge can have 15 IPs per ENI, where a m5d.4xlarge can hold 30 IPs per ENI). This means hard-coding a value for max-pods may cause problems if you change instance types or in mixed-instance environments.

In the EKS Optimized AMI releases there is a script included that can be used to help calculate the AWS Recommended max-pods value. If youโ€™d like to automate this calculation for mixed instances you will also need to update the userdata for your instances to use the --instance-type-from-imds flag to autodiscover the instance type from instance metadata.

  eks_managed_node_groups = {
system = {
instance_types = ["m5.xlarge"]

min_size = 0
max_size = 5
desired_size = 3

pre_bootstrap_user_data = <<-EOT
/etc/eks/max-pod-calc.sh --instance-type-from-imds โ€”cni-version 1.13.4 โ€”cni-max-eni 1
EOT

bootstrap_extra_args = "--use-max-pods false --kubelet-extra-args '--max-pods=<your_value>'"

}

Maxpods with Karpenterโ€‹

By default, Nodes provisioned by Karpenter will have the max pods on a node based on the node instance type. To configure the --max-pods option as mentioned above by defining at the Provisioner level by specifying maxPods within the .spec.kubeletConfiguration . This value will be used during Karpenter pod scheduling and passed through to --max-pods on kubelet startup.

Below is the example Provisioner spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
providerRef:
name: default
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
operator: In
values: ["spot", "on-demand"]

# Karpenter provides the ability to specify a few additional Kubelet args.
# These are all optional and provide support for additional customization and use cases.
kubeletConfiguration:
maxPods: 30

Applicationโ€‹

DNS Lookups and ndotsโ€‹

In Kubernetes Pods with the default DNS configuration have a resolv.conf file like so:

nameserver 10.100.0.10
search namespace.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

The domain names listed in the search line are appended to DNS names that are not fully qualified domain names (FQDN). For example, if a pod tries to connect to a Kubernetes service using servicename.namespace the domains would be appended in order until the DNS name matched the full kubernetes service name:

servicename.namespace.namespace.svc.cluster.local   <--- Fails with NXDOMAIN
servicename.namespace.svc.cluster.local <-- Succeed

Whether or not a domain is fully qualified is determined by the ndots option in the resolv.conf. This option defines the number of dots that must be in a domain name before the search domains are skipped. These additional searches can add latency to connections to external resources like S3 and RDS endpoints.

The default ndots setting in Kubernetes is five, if your application isnโ€™t talking to other pods in the cluster, we can set the ndots to a low value like โ€œ2โ€. This is a good starting point, because it still allows your application to do service discovery within the same namespace and in other namespaces within the cluster, but allows a domain like s3.us-east-2.amazonaws.com to be recognized as a FQDN (skipping the search domains).

Hereโ€™s an example pod manifest from the Kubernetes documentation with ndots set to โ€œ2โ€:

apiVersion: v1
kind: Pod
metadata:
namespace: default
name: dns-example
spec:
containers:
- name: test
image: nginx
dnsConfig:
options:
- name: ndots
value: "2"
info

While setting ndots to โ€œ2โ€ in your pod deployment is a reasonable place to start, this will not universally work in all situations and shouldnโ€™t be applied across the entire cluster. The ndots configuration needs to be configured at the Pod or Deployment level. Reducing this setting at the Cluster level CoreDNS configuration is not recommended.

Inter AZ Network Optimizationโ€‹

Some workloads may need to exchange data between Pods in the cluster, like Spark executors during the shuffle stage. If the Pods are spread across multiple Availability Zones (AZs), this shuffle operation can turn out to be very expensive, especially on Network I/O front. Hence, for these workloads, it is recommended to colocate executors or worker pods in the same AZ. Colocating workloads in the same AZ serves two main purposes:

  • Reduce inter-AZ traffic costs
  • Reduce network latency between executors/Pods

To have pods co-located on the same AZ, we can use podAffinity based scheduling constraints. The scheduling constraint preferredDuringSchedulingIgnoredDuringExecution can be enforced in the Pod spec. For example, ins Spark we can use a custom template for our driver and executor pods:

spec:
executor:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: sparkoperator.k8s.io/app-name
operator: In
values:
- <<spark-app-name>>
topologyKey: topology.kubernetes.io/zone
...

You can also leverage Kubernetes Topology Aware Routing to have Kubernetes services route traffic in more efficient means once pods have been created: https://aws.amazon.com/blogs/containers/exploring-the-effect-of-topology-aware-hints-on-network-traffic-in-amazon-elastic-kubernetes-service/

info

Having all executors located in a single AZ, means that AZ will be a single point of failure. This is a trade off you should consider between lowering network cost and latency, and the event of an AZ failure interrupting workloads. If your workload is running on instances with constrained capacity you may consider using multiple AZs to avoid Insufficient Capacity errors.