Kubernetes networking with OpenContrail

OpenContrail can be used to provide network micro-segmentation to kubernetes, providing both network isolation as well as the ability to attach a pod to a network that may have endpoints in using different technologies (e.g. bare-metal servers on VLANs or OpenStack VMs).

Watch Lachlan Evenson from Lithium talk about using OpenContrail with Kubernetes at the Kubernetes 1.0 Launch.

[video_lightbox_youtube video_id=”pZjNFcyC6Uo” width=”720″ height=”540″ auto_thumb=”1″]

This post describes how the current prototype works and how packets flow between pods. For illustration purposes we will focus on 2 tiers of the k8petstore example on kubernetes: the web frontend and the redis-master tier that the frontend uses as a data store.

The OpenContrail integration works without modifications to the kubernetes code base (as off v1.0.0 RC2). An additional daemon, by the name of kube-network-manager, is started on the master. The kubelets are executed with the option: “–network_plugin=opencontrail”, which instructs the kubelet to execute the command:

/usr/libexec/kubernetes/kubelet-plugins/net/exec/opencontrail/opencontrail. The source code for both the network-manager and the kubelet plugin are publicly available.

When using OpenContrail as the network implementation the kube-proxy process is disabled and all pod connectivity is implemented via the OpenContrail vrouter module which implements an overlay network using MPLS over UDP as encapsulation. OpenContrail uses a standards based control plane in order to distribute the mapping between endpoint (i.e. pod) and location (k8s node). The fact that the implementation is standards compliant means that it can interoperate with existing network devices (from multiple vendors).

The kube-network-manager process uses the kubernetes controller framework to listening to changes in objects that are defined in the API and add annotations to some of these objects. It then creates a network solution for the application using the OpenContrail API to define objects such as virtual-networks, network interfaces and access control policies.

The kubernetes deployment configuration for this example application consists of a replication controller (RC) and a service object for the web-server and a pod and service object for the redis-master.

The web frontend RC contains the following metadata:

"labels": {
  "name": "frontend",
  "uses": "redis-master"
}

This metadata information is copied to each pod replica created by the kube-controller-manager. When the network-manager sees these pods it will:

Create a virtual-network with the name <namespace:frontend>
Connect this network with the network for the service <namespace:redis-master>
Create an interface per pod replica with a unique private IP address from a cluster-wide address block (e.g. 10.0/16).

The kube-network-manager also annotates the pods with the interface uuid created by OpenContrail as well as the allocated private IP address (and a mac-address). These annotations are then read by the kubelet.

When the pods are started by the respective kubelet invokes the plugin script. This script removes the veth-pair associated with the docker0 bridge and assigns it to the OpenContrail vrouter kernel module, executing on each node. The same script notifies the contrail-vrouter-agent of the interface uuid associated with the veth interface and configures the IP address inside the pod’s network namespace.

At this stage each pod has an unique IP address in the cluster but can only communicate with other pods within the same virtual-network. Subnet broadcast and IP link-local multicast packets will be forwarded to the group of pods that are present in the same virtual-network (defined by the “labels.name” tag).

OpenContrail assigns a private forwarding table to each pod interface. The veth-pair associated with the network namespace used by docker is mapped into a table which has routing entries for each of the other pod instances that are defined within the same network or networks this pod has authorized access to. The routing tables are computed centrally by the OpenContrail control-node(s) and distributed to each of the compute nodes where the vrouter is running.

The deployment defines a service associated with web frontend pods:

"kind": "Service",
"metadata": {
  "name": "frontend",
  "labels": {
    "name": "frontend"
  }
},
"spec": {
  "ports": [{
    "port": 3000
  }],
  "deprecatedPublicIPs":["10.1.4.89"],
  "selector": {
    "name": "frontend"
  }
}

The “selector” tag specifies the pods that belong to the service. The service is then assigned a “ClusterIP” address by the kube-controller-manager. The ClusterIP is an unique IP address that can be used by other pods to consume the service. This particular service also allocates a PublicIP address that is accessible outside the cluster.

When the service is defined, the kube-network-manager creates a virtual-network for the service (with the name of <namespace:service-frontend>) and allocates a floating-ip address with the ClusterIP specified by kubernetes. The floating-ip address is then associated with each of the replicas.

In the k8petstore example, there is a load-generator tier defined by an RC with the following metadata:

"labels": {
  "name": "bps",
  "uses": "frontend"
}

The network-manager process interprets the “uses” tag as an implicit authorization for the “bps” network to access the “service-frontend” network which contains the ClusterIP. That is the mechanism that causes the ClusterIP address to be visible in the private routing tables that are associated with the load-generator pods.

When traffic is sent to this ClusterIP address, the sender has multiple feasible paths available (one per replica). It chooses one of these based on a hash on the 5-tuple of the packet (IP source, IP destination, protocol, source port, destination port). Traffic is sent encapsulated to the destination node such that the destination IP address of the inner packet is the ClusterIP. The vrouter kernel module in destination node then performs a destination NAT operation on the ClusterIP and translates this address to the private IP of the specific pod.

A packet sent by a load-generator pod to the ClusterIP of the web frontend goes through the following steps:

Packet is sent by the IP stack in the container with SourceIP=”load-gen private IP”, DestinationIP=ClusterIP. This packet is send to eth0 inside the container network namespace, which is a Linux veth-pair interface.
The packet is delivered to the vrouter kernel module; a route lookup is performed for the destination IP address (ClusterIP) in the private forwarding table “bps”.
This route lookup returns an equal cost load balancing next-hop (i.e. list of available path). The ECMP algorithm selects one of the available paths and encapsulates the traffic such that and additional IP header is added to the packet with SourceIP=”sender node address”, DestinationIP=”destination node address”; additionally an MPLS label is added to the packet corresponding to the destination pod.
Packet travels in the underlay to the destination node.
The destination node strips the outer headers and performs a lookup on the MPLS label and determines that the destination IP address is a “floating-ip” address and requires NAT translation.
The destination node creates a flow-pair with the NAT mapping of the ClusterIP to the private IP of the destination pod and modifies the destination IP of the payload.
Packet is delivered to the pod such that the source IP is the unique private IP of the source pod and the destination IP is the private IP of the local pod.

The service definition for the web front-end also specified a PublicIP. This address is implemented as a floating-ip address like the ClusterIP, except that the floating-ip is associated with a network that spans across the cluster and the outside world. Typically, OpenContrail deployments configure one or more “external” networks that map to virtual network on external network devices such as a data-center router.

Traffic from the external network is also equal cost load balanced to the pod replicas of the web frontend. The mechanism is the same as described above except that the ingress device is a router rather than a kubernetes node.

To finalize the walk-through of the k8petstore example, the redis-master service defines:

"kind": "Service",
 
"metadata": {
  "name": "redismaster",
  "labels": {
    "name": "redis-master"
  }
},
"spec": {
  "ports": [{
    "port": 6379
  }],
  "selector": {
    "name": "redis-master"
  }
}

Since the web frontend pods contain the label "uses": "redis-master" the network-manager creates a policy that connects the clients (frontend pods) to the service ClusterIP. This policy can also limit the traffic to allow access to the ports specified in the service definition only.

There remains additional of work to be done in this integration, but i do believe that the existing prototype shows how OpenContrail can be used to provide an elegant solution for micro-segmentation that can both provide connectivity outside the cluster as well as pass a security audit.

From an OpenContrail perspective, the delta between a kubernetes and an OpenStack deployment is that in OpenStack the Neutron plugin provides the mapping between Neutron and OpenContrail API objects while in kubernetes the network-manager translates the pod and service definitions into the same objects. The core functionality of the networking solution remains unchanged.

Kubernetes networking with OpenContrail

MENU

Engage

RESOURCES