Kubernetes Non-Masquerade CIDR

I've noticed a few people running into DNS issues when attempting to install Kubernetes from scratch on VirtualBox with Vagrant.

As you may already know, VirtualBox's default NAT network assigns an IP address from a 10.0.2.0/24 CIDR to any virtual machine with a network adapter set to NAT. This will result in your virtual machine receiving a DHCP option that sets /etc/resolv.conf to a nameserver that is typically at something like 10.0.2.3.

From Ubuntu 16.04:

ubuntu@worker1:~$ cat /etc/resolv.conf
 # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.2.3

For node-to-node communication, we can use a VirtualBox host-only network (For example, 192.168.56.0/24) and statically assign an ip address to eth1 on our virtual machines.

This will result in the interfaces on our virtual machines looking something like this:

ubuntu@worker1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  inet 127.0.0.1/8 scope host lo
     valid_lft forever preferred_lft forever
  inet6 ::1/128 scope host
     valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
  link/ether 08:00:27:82:73:25 brd ff:ff:ff:ff:ff:ff
  inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
     valid_lft 86041sec preferred_lft 86041sec
  inet6 fe80::a00:27ff:fe82:7325/64 scope link
     valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
  link/ether 08:00:27:fa:05:a3 brd ff:ff:ff:ff:ff:ff
  inet 192.168.56.56/24 brd 192.168.56.255 scope global enp0s8
     valid_lft forever preferred_lft forever
  inet6 fe80::a00:27ff:fefa:5a3/64 scope link
     valid_lft forever preferred_lft forever

To keep things simple in a typical Kubernetes environment, you may want to have each worker node's pod network on a unique /24. This would result in each worker node's network bridge getting assigned something like 10.200.0.1/24, 10.200.1.1/24, 10.200.2.1/24, and so on. If using kubenet, the out-of-the-box Kubernetes networking plugin, this can be achieved by setting the --cluster-cidr flag on the kube-controller-manager daemon to 10.200.0.0/16.

Here's an example of one commonly used kube-controller-manager systemd unit config from Kelsey Hightower's Learn Kubernetes the Hard Way

cat > kube-controller-manager.service <<EOF
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
ExecStart=/usr/bin/kube-controller-manager \\
  --address=0.0.0.0 \\
  --allocate-node-cidrs=true \\
  --cluster-cidr=10.200.0.0/16 \\
  --cluster-name=kubernetes \\
  --cluster-signing-cert-file="/var/lib/kubernetes/ca.pem" \\
  --cluster-signing-key-file="/var/lib/kubernetes/ca-key.pem" \\
  --leader-elect=true \\
  --master=http://${INTERNAL_IP}:8080 \\
  --root-ca-file=/var/lib/kubernetes/ca.pem \\
  --service-account-private-key-file=/var/lib/kubernetes/ca-key.pem \\
  --service-cluster-ip-range=10.32.0.0/16 \\
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

After creating your DNS deployment in the kube-system namespace, resolution of service names inside the cluster may work fine:

ubuntu@worker1:~$ kubectl exec -it mypod /bin/sh
/ # nslookup kubernetes
Server:    10.32.0.10
Address 1: 10.32.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes
Address 1: 10.32.0.1 kubernetes.default.svc.cluster.local

But resolution to something outside your cluster doesn't work:

/ # nslookup kubernetes.io
Server:    10.32.0.10
Address 1: 10.32.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'kubernetes.io'

If we check out the upstream kube-dns deployment object manifest the dnsmasq container does not have the --resolv-file flag set. This flag would allow us to specify our own customized resolv.conf. By default, the dnsmasq container will use the /etc/resolv.conf of the node the pod resides on for resolving all non-cluster queries. In our case with VirtualBox NAT network, it will use the 10.0.2.3 nameserver.

Rather than play around with setting a new DNS server like 8.8.8.8, let's do some tcpdumping on the worker node's NAT network interface to see if we can see the packets attempting to hit the VirtualBox DNS:

ubuntu@worker1:~$ sudo tcpdump -n "dst host 10.0.2.3"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
02:29:54.364528 IP 10.200.0.3.15612 > 10.0.2.3.53: 28501+ AAAA? kubernetes.io. (31)
02:29:54.364629 IP 10.200.0.3.15612 > 10.0.2.3.53: 28501+ AAAA? kubernetes.io. (31)
02:29:59.370219 IP 10.200.0.3.22452 > 10.0.2.3.53: 53929+ AAAA? kubernetes.io. (31)
02:29:59.370413 IP 10.200.0.3.22452 > 10.0.2.3.53: 53929+ AAAA? kubernetes.io. (31)

It looks like the source IP address is still the original pod IP.

Checking iptables NAT table, we can see the following in the POSTROUTING chain:

ubuntu@worker1:~$ sudo iptables -t nat -vxnL POSTROUTING

Chain POSTROUTING (policy ACCEPT 8 packets, 480 bytes)
    pkts      bytes target     prot opt in     out     source               destination
      861    57670 KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
      49     2940 MASQUERADE  all  --  *      *       0.0.0.0/0           !10.0.0.0/8           /* kubenet: SNAT for outbound traffic from cluster */ ADDRTYPE match dst-type !LOCAL

It looks like all traffic destined for something within 10.0.0.0/8 will not be source natted!

Where did it get this 10.0.0.0/8 network? As you can see here, kubelet automatically sets this cidr value when the --non-masquerade-cidr flag for kubelet is not specified. The intention is to ensure proper pod-to-pod communication within the cluster by preserving the pod's source IP. In my dev environment, I have manually generated host-routes, but you may be using flannel or another pod networking solution. Anyway, it looks like this will be a problem since our VirtualBox DNS falls into the 10.0.0.0/8 RFC 1918 range.

We could certainly work around this by changing our cluster-cidr to something like 192.168.0.0/16. We could also manually modify iptables rules on all of our worker nodes. Let's try something easier.

The ip-masq-agent

Enter the ip-masq-agent!

In Kubernetes 1.7 there is some new code that allows you to completely void out the creation of the --non-masquerade-cidr by setting its value to 0.0.0.0/0, i.e. --non-masquerade-cidr=0.0.0.0/0.

I will create ConfigMap that says I do not want to SNAT any traffic destined for the cluster, in my case the user-provided cluster-cidr and service-cluster-ip ranges: 10.200.0.0/16 and 10.32.0.0/16. This is pretty cool because the kubelet's --non-masquerade-cidr is limited to one CIDR. The template for the ConfigMap is available here. Here is my ConfigMap:

nonMasqueradeCIDRs:
  - 10.200.0.0/16
  - 10.32.0.0/16
masqLinkLocal: false
resyncInterval: 60s

masqLinkLocal determinws whether to masquerade traffic to 169.254.0.0/16. False by default.
resyncInterval sets the interval at which the agent attempts to reload config from disk.

Deploy the ip-masq-agent DaemonSet

kubectl create -f https://raw.githubusercontent.com/kubernetes-incubator/ip-masq-agent/v2.0.1/ip-masq-agent.yaml

We should see a brand new POSTROUTING rule that points to a new chain called IP-MASQ-AGENT:

ubuntu@worker1:~$ sudo iptables -t nat -vxnL POSTROUTING
Chain POSTROUTING (policy ACCEPT 4 packets, 240 bytes)
  pkts      bytes target     prot opt in     out     source               destination
 11235   674191 KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
 10067   604182 IP-MASQ-AGENT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* ip-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom IP-MASQ-AGENT chain */ ADDRTYPE match dst-type !LOCAL

Let's look at the new chain:

ubuntu@worker1:~$ sudo iptables -t nat -vxnL IP-MASQ-AGENT
Chain IP-MASQ-AGENT (1 references)
    pkts      bytes target     prot opt in     out     source               destination
       0        0 RETURN     all  --  *      *       0.0.0.0/0            169.254.0.0/16       /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
      12      720 RETURN     all  --  *      *       0.0.0.0/0            10.200.0.0/16        /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
       0        0 RETURN     all  --  *      *       0.0.0.0/0            10.32.0.0/16         /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
       0        0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL

Our non-masquerade rules have been applied!

Let's check DNS resolution:

ubuntu@worker1:~$ kubectl exec -it mypod /bin/sh
/ # nslookup kubernetes.io
Server:    10.32.0.10
Address 1: 10.32.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.io
Address 1: 23.236.58.218 218.58.236.23.bc.googleusercontent.com
/ #

It works!

Let's check out tcpdump again:

ubuntu@worker1:~$ sudo tcpdump -i enp0s3 dst host 10.0.2.3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes
12:30:35.522668 IP 10.0.2.15.65039 > 10.0.2.3.domain: 36319+ AAAA? google.com. (28)
12:30:35.522781 IP 10.0.2.15.65039 > 10.0.2.3.domain: 36319+ AAAA? google.com. (28)
12:30:35.522928 IP 10.0.2.15.63090 > 10.0.2.3.domain: 3284+ AAAA? google.com. (28)
12:30:35.523002 IP 10.0.2.15 > 10.0.2.3: ICMP 10.0.2.15 udp port 65039 unreachable, length 64
12:30:35.523099 IP 10.0.2.15.12876 > 10.0.2.3.domain: 44566+ AAAA? google.com. (28)
12:30:35.523476 IP 10.0.2.15.57223 > 10.0.2.3.domain: 37209+ PTR? 3.2.0.10.in-addr.arpa. (39)
12:30:35.523639 IP 10.0.2.15.40473 > 10.0.2.3.domain: 10359+ AAAA? google.com. (28)

Looks like SNAT is working for our request to the VirtualBox DNS.

If you are interested in diving more into customizing upstream dns servers and/or private dns zones, check out this recent kubernetes blog article.

The ip-masq-agent

Comments