Saturday, October 28, 2017

So Amazon Develops Its Own NIC - AWS Enhanced Networking!

Ethernet adapter types available for EC2
AWS EC2 instances offer 3 types of network adapters (depending upon the instance types) - the default VIF, which offers low (~100Mbps) to moderate (~300Mbps) network throughput, and two other network adapters that support enhanced networking (~10Gbps or greater) - Intel 82599 Virtual Function adapter, and the next generation Elastic Network Adapter (ENA). Enhanced networking also requires specialized H/W support (the above mentioned physical network adapters installed on the EC2 instance host) and thus this feature is only available through specific EC2 instance types.
Enhanced Networking
The VIF adapter present in an EC2 instance is provided by the underlying virtualization layer i.e. XEN hypervisor for AWS. This adapter employs usual network virtualization technique that involves significant overheads as it relies on traditional interrupt based (IRQ) approach inherent within the PCIe NIC design.
In a attempt to support higher network throughput, Amazon Web Services introduced enhanced networking for which they relied on PCIe NICs equipped with SR-IOV technology that allows a VM (EC2 instance) to bypass the underlying hypervisor (XEN) and use direct memory access (DMA) instead of interrupting the CPU (more about SR-IOV in the last section of this article).
Initially, Amazon offered 10Gbps network throughput for a few EC2 instance types using Intel's 82599 VF PCIe NIC. In January 2015, Amazon bought an Israel based chip maker Annapurna Labs; based upon the newly acquired company's flagship product Alpine, Amazon launched its own new generation PCIe NIC that supported upto 25Gbps of network throughput. Amazon christened this NIC as Elastic Network Adapter (ENA). ENA is only available for some specific EC2 instance types.
The following table, based on a similar table from Amazon Knowledge Center page on Enhanced Networking, summarizes the network adapter types and which EC2 instances types they are available for as of today -

How do I enable enhanced networking on EC2 instances
Refer Amazon documentation on enabling Enhanced Networking for EC2 instances.

The technology behind Enhanced Networking - SR-IOV

Assigning a dedicated PCI Network H/W adapter (or port) for each VM on a host can give a line rate throughput but it is not feasible, and software based sharing of IO device (IO Virtualization) imposes significant overheads and thus unable to use the capabilities of the physical device fully.
Single Root - Input Output Virtualization (SR-IOV) specification, released by PCI SIGin 2007, is one of the technologies to achieve Network Function Virtualization (NFV). It gets its name "Single Root" from the PCIe Root Complex. SR-IOV enables the physical network adapter to be shared directly with the VMs bypassing the hypervisor as shown below.
SR-IOV architecture offers two function types -
  • Physical Functions (PFs) - A NIC feature for configuring and managing SR-IOV functionality on the deivce. It is exploited by PF device driver that is part of the hypervisor.
  • Virtual Functions (VFs) - It is a PCIe function that is used by the respective VMs for communicating directly with the physical NIC. The hypervisor, using the PF device driver, assigns VFs to VMs and then VFs use native virtual function device drivers to directly communicate with the NIC.
So when a data packet is received by the NIC adapter, the classifier at SR-IOV capable NIC - as configured by the PF device driver - places it to the appropriate queue mapped to the appropriate virtual function and its target VM.
Question to ponder upon ...
Since SR-IOV lets a VM directly map to a PCI port bypassing the hypervisor, how does Amazon achieve Inter-EC2 network switching, implement Security Groups or ACLs?
---------------------------------------------------------------------------------------------
Resources:
Amazon's docs, blogs & videos:
  1. Read #9 "The importance of the network" @ All Things Distributed
  2. How do I enable and configure enhanced networking on my EC2 instances?
  3. Enhanced Networking - FAQs
  4. AWS re:Invent 2016: Optimizing Network Performance for Amazon EC2 Instances
  5. AWS re:Invent 2016: Optimizing Network Performance @ YouTube
  6. AWS re:Invent 2016: James Hamilton @ YouTube (between 23:00-36:00 minutes)
  7. Elastic Network Adapter – High Performance Network Interface for Amazon EC2
  8. Amazon EC2 Instance Types
Independent blogs & videos:
  1. How did they build that — EC2 Enhanced Networking?
  2. Single Root I/O Virtualization (SR-IOV) Primer @ RedHat
  3. An Introduction to SR-IOV Technology @ Intel
  4. Single Root I/O Virtualization (SR-IOV) @ VMWare
  5. Accelerating the NFV Data Plane: SR-IOV and DPDK… in my own words
  6. Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking – Part I: Understanding the Basics
  7. Kernel bypass
  8. Network function virtualization (NFV)
  9. PCI Express Architecture In a Nutshell
  10. Amazon buys secretive chip maker Annapurna Labs for $350 million
  11. The chip company Amazon bought for $350 million has a new product that could terrify Intel
  12. Annapurna Labs
  13. Intel VMDq Explanation by Patrick Kutch @ YouTube
  14. Intel SR-IOV Explanation by Patrick Kutch @ YouTube
  15. What is SR-IOV?

Saturday, October 7, 2017

Drooling Over Docker #2 - Understanding Union File Systems

To understand the composition of Container images better, let us take a detour and learn about Union File Systems first.
File Systems and Mounting in Unix/GNU Linux
In Unix (and GNU Linux), everything is a file i.e. - apart from regular data files, even system devices are exposed through a file system name space, e.g. a hard disk can be seen as a file named sda in the directory for device files i.e. /dev and can be accessed through its absolute path being /dev/sda.
Even disk partitions are seen as device files within rootfs e.g. /dev/sda1 is the first primary partition of disk represented by /dev/sda1. Such a partition is further formatted with a file system driver (etx3, ext4 etc.) so that it can store files and directories within it.
Now, if we have a formatted (have a file system) partition /dev/sda1 that has some data stored on it in files organized within a few sub-directories and we want to read from or write to those files - we will have to attach this file system (on file /dev/sda1) to some directory in the logical file system tree. This process is known as mounting (to an existing directory) a file system.
#mount -t ext4 /dev/sda1 /mnt
Union File System - what is that!?
The keyword here is UNION as in SET THEORY.
If you experiment and mount two separate file systems at the same mount point, one after the other, you'll only get to see the files from the file system that was mounted last.
Try out these commands in your Linux terminal and see for yourself -
Unlike simple mount option, a union mount would provide a UNION of both the file systems mounted at the same mount point. -
After these examples, now consider that you have a read-only file system and you want to modify a certain file in there so that you can go ahead with your computing needs - on the lines of above mentioned example, Union File System can help us here. We can create another read-write file system either on disk or in RAM as the case may be, and mount both these file systems to another mount point using Union File System. Now, this mount point can give you access to all the files in both ro and rw file systems. In case, you want to modify any of the files residing on the ro file system, Union File System driver would search for that file and perform a CoW (Copy on Write) to make another copy of the file in rw file system that overrides the copy that exists on ro file system. This newly created copy is finally updated with the new contents. Any new files as part of software installation would also go in to the rw files system.
Please check Link1 and Link2 for some very good examples on the points discussed so far. I have also given these links in the resources section in the end of the article.
A use case for Union File System - Knoppix
Again, what if the file system we attempting to mount is read-only (e.g. from a CD ROM) and we intend to change its contents by editing/removing existing files or adding new files to the file system. Is it possible?
Let us understand another example from Knoppix - a Live CD version of Linux that could boot your machine and allow you to work on your system. You could change system settings or download additional software when the Knoppix OS was running in memory and even save these changes for subsequent runs.
The possibility of making Knoppix settings persistent allowed Knoppix to be a good portable desktop OS - all it required was a Live Knoppix CD and a USB drive. You could boot system through Knoppix and load changed settings from the USB drive and you could get the same desktop environment on any machine.
With the support from Union File Syems (Knoppix 3.9 brought in UnionFS and its later versions used aufs), Knoppix mounted multiple filesystems on top of each other ( in the logical space as mounting happens in logical space) i.e. here mounting rw RamDisk on the top of ro CD file system forming a UNION of both these file systems.
If you write to a file that is in the ro area, the aufs driver would copy it in the rw area and perform the write operation. When next time you access the file,it accesses the modified version of the file from the rw area (which hides the same named file in the ro area). You work on it transparently and aufs does the work for you - you keep working as it it was a writable system.
How Union File Systems help Docker Containers
Docker Containers bring you immutable (unchanging over time) software in the form of layers. During build time, you stack multiple such immutable layers of software to get the desired applications with their dependencies. Many a times there are software layers that override the functionality given by the lower layers in the stack. This is only possible by implementing a Union File System. Also, at run time, you download some additional software within the container that is the newer version of some software; such a case would trigger CoW and have the updated copy of it written in the rw layer of the software stack. Even any newly installed software would settle in this very rwtop layer of the stack. I'll discuss more about Docker Layers in the next chapter...

Contd...
---------------------------------------------------------------------------------------------------------
  1. DON'T MISS THIS VIDEO FROM VMware about Containers
  2. Knoppix 3.8 and UnionFS. Wow. Just Wow. by Kyle Rankin
  3. Knoppix Hacks: Tips and Tools for Hacking, Repairing, and Enjoying Your PC - Hack#25
  4. Docker Storage: An Introduction
  5. Union file systems: Implementations, part I
  6. Digging into Docker layers
  7. Docker Container’s Filesystem Demystified
  8. Lightweight Virtualization LXC containers & AUFS
  9. Why to use AuFS instead of UnionFS
  10. Union Filesystem - FreeBSD
  11. Linux AuFS Examples: Another Union File System Tutorial
  12. Why does Docker need a Union File System
  13. Manage data in Docker
  14. Select a storage driver
  15. Use the AUFS storage driver
  16. Use the BTRFS storage driver
  17. Use the Device Mapper storage driver
  18. Use the OverlayFS storage driver
  19. Filesystems in LiveCD by Junjiro R. Okajima 
  20. AuFS2 - ReadMe
  21. AuFS4 - ReadMe
  22. AuFS - Ubuntu Man Page
  23. AUFS: How to create a read/write branch of only part of a directory tree?
  24. Unionfs: User- and Community-Oriented Development of a Unification File System
  25. UnionMount and Union-type Filesystem (Google Translated from Japanese)

Thursday, October 5, 2017

Drooling Over Docker  #1 - The Genesis of Containers

“I once heard that hypervisors are the living proof of operating system’s incompetence.” — Glauber Costa, LinuxCon Europe 2012
Effectively handling resource contention has been a primary challenge for Operating Systems.
Consider the following resource-allocation challenges:
  • One process demands more memory from the Operating System which in turn pages out (read punishes) the memory used by some other process.
  • It is possible that one application, running a group of related processes, gets better share of CPU cycles than an application running with lesser number of processes.
  • A process with bugs (or fork bombs) causing a denial-of-service attackthat leaves the kernel resources exhausted and thus impacting other unrelated processes.
Hypervisors, and Containers attempt to overcome these (and even some more) challenges by isolating process execution space & controlling the system resources in their own ways that we’ll explore in this Docker Seriesfurther; however, our focus here will remain on Containers.

Genesis of Containers

A Container, as we know it today (Docker, Rkt etc), is a group of processes running on the underlying Operating System in its own isolated process space. It can be configured to have its own share of system resources; has its own file-system that, though mounted on some directory of the file-system of the host Operating System, is seen as a separate file-system tree with its own root directory, by the processes running inside it. This lightweight execution unit has its own copy of software, shared libraries, and any other necessary files required to run the software it is prepared to run.
Running Containerized software could only become possible because of the development of some underlying technologies in recent years.
Three such major technologies are -
#1 — CGroups — Controlling What You Can USE
  • In 2006, a team of engineers at GOOGLE developed this Linux Kernel feature.
  • Cgroups (Control Groups) feature was eventually introduced in Linux Kernel version 2.6.24 — released in January 2008.
  • It allows to limit (apply quota), account for, and isolate the use of computing resources (CPU, memory, disk I/O, network, etc.).
  • Containers management software would make use of CGroups for effective resource control by Containers.
The following diagrams shows limited system resources (CPU, Memory, N/W, Storage, etc.) being distributed amongst separate Control Groups of processes.




The above picture is taken from Mairin Duffey’s Blog
#2 — Namespaces — Controlling What You Can SEE
  • Initial version released in 2002 with Linux Kernel version 2.4.19.
  • Functionality usable by Containers was added with Linux Kernel version 3.8.
  • Using Namespaces, it became possible to isolate and virtualize resources for processes e.g. separate process IDs, hostnames, user IDs, network access, IPCs, and filesystems.
  • Namespaces are a fundamental aspect of containers on Linux.
  • Namespaces would provide much needed process virtualization space to the Containers.
#3 — Union File Systems - Managing Software Image for Containers
  • A file system that allows a collection of different file systems and directories (called branches) to be transparently overlaid (does a UNION of) into a single logical file system.
  • Union File Systems show Containers a way to manage their software using a layered image and mounting them by doing a UNION of underlying file systems branches.
Note — This concept deserves a separate detailed description and I would cover it in a chapter of its own.
CGroups and Namespaces, when used together, provide an isolated environment within a Linux System where the running processes can only see the boot directory, from related processes, own user id, and related network interfaces (common NAMESPACE) & a control over CPU, memory, network and IO usage (common CGroup). These functionalities were combined with easy to manage command line interface under the name of LXC (Linux Containers).

Arrival of Docker

In 2013, Docker Inc. introduced Docker — a technology to create and run Docker Containers.
Docker makes use of CGroups, Namespaces, & Union File Systems to package an application and its dependencies to make a Docker image which can be run as a Container on a Linux Operating System — a much more portable and efficient method of running an application than running it on a Virtual Machine.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
  1. Resource Isolation: The Failure of Operating Systems & How We Can Fix It — Glauber Costa, Parallels
  2. Evolution of Linux Containers and Future — Imesh Gunaratne
  3. The failure of operating systems and how we can fix it — Michael Kerrisk
  4. How Linux Kernel Cgroups And Namespaces Made Modern Containers Possible — Duncan Macrae
  5. Cgroups @ Wikipedia
  6. Introduction to Control Groups (CGROUPS) @ RedHat
  7. How I Used CGroups to Manage System Resources In Oracle Linux 6
  8. Linux Kernel Documentation on CGroups
  9. Namespaces in operation
  10. Resource management: Linux kernel Namespaces and cgroups by Rami Rosen
  11. Union File System
  12. Docker — Wikipedia
  13. What is Docker?

Drooling Over Docker #4 — Installing Docker CE on Linux

Choosing the right product Docker engine comes in 2 avatars — Docker Community Edition (CE) and Docker Enterprise Edition (EE). While the...