Blog:Containers for HPC

Jump to: navigation, search
KogenceAndDockers.png

Containers for HPC?

Docker containers are now ubiquitous and widely adopted in IT industry. Why are they not being used in High Performance Computing (HPC) community? Lets take a look and pros and cons of using Docker containers for HPC workloads.

What are Containers?

Containers are a way to shrink-wrap up an application in its own isolated package. This package is called a "container image". Starting an application container (i.e running it) is same thing as starting the application with its shrink-wrap. You can start multiple simultaneous copies of this shrink-wrapped application on the same machine or different machines. Running copies are called "containers". Each container is said to be an instance of its container image. We can have containers of same applications or containers of different applications running on same or different machines.

Containers are popular because they make life of IT developer's and IT operations engineer's ("DevOps" in short) really easy. Benefits are numerous but they all can be linked to two main key properties of containers:

  1. Process isolation: When an application is running in its shrink-wrap (i.e. its running inside container), this application is not affected by other applications or processes that may exist or running outside of the container on same machine.
  2. Dependency bundling: Everything that the application depends on to run successfully as a process is already inside the container. You can copy the container image to a different machine and start a new container. Wherever the container image might move, the requirements of the application will always be met, in terms of direct dependencies, because it is bundled with everything that it needs to run (library dependencies, runtimes, and so on).

For example, your application may need python 3.0. Host machine may have python 2.7 installed. When you run your containerized application on the host machine, neither your application nor the host machine suffers any consequences, they stay isolated as if they are completely ignorant of each other. Similarly, you can build and run 2 different application containers on same machine -- one built using python 2.7 and other using python 3.0.

How are Containers Different from Virtual Machines?

Although virtual machines (VMs) can also be used to isolate applications, containers should not be confused with VM's. They are completely different beasts. Computational cost of containers is much less compared to that of VMs. Containers images are also much light weight (in terms of its storage size) as compared to the size of VM images.

Some of the major reasons container technology is being preferred over virtualization technology are

  1. The basic linux kernel architecture inherently supports the concept of containerization. Containerized applications can run on any distribution of linux (i.e. Ubuntu, RedHat, CentOS, Fedora etc.). Moreover, as long as the host machine does not have a really ancient version linux kernel, chances are that it can run any containerized application. With LCOW, you can also run linux containers on Microsoft windows machines (note: reverse is not possible as of now).
  2. It takes very little computational-cost to run an application with this shrink-wrap (i.e. inside a container) as compared to running it without this shrink-wrap.
  3. Containers images are much light weight (in terms of its storage size) as compared to the size of VM images. VM images contain entire OS whereas container images only contain your application and it dependencies. They are also "layered". Many containers may share several layers, if some layers already exist in your machine then you don't have to download them again. Container images are also be easier to manage, auto-update and can be version controlled.

What are Dockers?

There are many containerization technologies/platforms out there. By far the most popular platform is the docker platform. So much so that the words "containers" and "dockers" have almost become synonymous. Docker is also the name of the company that maintains the docker platform. Container images are called docker images. Docker images can be maintained in a remote version controlled repository for ease of sharing and distribution of application. Remote repository is called docker registry. Dockerhub is most popular registry hosted by Docker, but it is also possible to host your own docker registries.

In your host machine, you need to install and start a docker daemon (i.e. a server process). The docker daemon manages all docker related processes such as starting and stopping of containers, managing local version controlled docker repository, as well as for pushing and pulling docker images to remote registry.

Kogence uses docker platform technology. Docker is very well adopted in the industry, it has really large development community. Any question you might have in mind you can be rest assured that it has already been answered. It has been well vetted and all large IT companies -- Microsoft, Google, Amazon -- you name it, everyone has adopted it and has their own efforts to extend it.

Virtual Machines and HPC

High performance computing (HPC) on cloud (aka cloud-HPC) is one of the fastest growing IT software industry. There are many apps and startups that are trying to solve the old issue of cost and operational inefficiencies of onprem HPC.

In the cloud computing world software is often distributed as virtual machine (VM) images. Amazon calls them AMI's (Amazon Machine Images). On AWS marketplace you may be able to find AMI's of many HPC software such as Matlab, Comsol, Dassault, Siemens etc. You can select an AMI and then launch a virtual machine (VM) with that AMI.

For sometime, this was quite popular. But soon you realize that AMI's have same level of IT management headaches as those involved in managing traditional workstations in enterprises. If you have 2 different software deployed on 2 different AMI, you cannot use them simultaneously with each other. For example, if you want to use Comsol with Matlab Livelink, you would have to deploy both Matlab and Comsol on same AMI. Now you start running on dependency issues again. As soon as you have more than a handful of users using your system and you have more than a handful of AMI's, version control also starts to become a huge headache. User A wants to keep using AMI version 1 but user B wants to use AMI version 2. Some days later, both user A and user B want new version of Matlab but they want to keep everything else same in their respective AMI's. You soon realize that managing AMI's is no different from managing workstations in traditional clusters or enterprise environments.

There is also performance related issue with AMI's that very few of us appreciates until it is too late. Often you end up spending days and weeks debugging poor performance of you HPC applications on AWS machines and then realize that this fundamental to how VM AMI's work on cloud. The way AMI's work is that they are actually stored in an slow and cheap storage what AWS calls as S3. When you launch a machine AWS pulls all the data from S3 into your VM. Actually, cloud providers use what is called block storage (EBS in AWS parlance). These EBS are very pretentious -- they make you believe that all the data in your AMI is already present when you launch a VM. But that is just "virtual"! Data would be pulled from S3 when you try to access it. Imagine you have a 1TB disk snapshot in your AMI. It will take tens of hours to get that data from S3. But you will only observe this latency as and when your application starts asking for that data. You get a feeling that everything is there when you launched the VM and then you wonder why your HPC application is running so slow!

Docker Containers and HPC

Kogence is container cloud HPC platform. We maintain and launch your applications in docker containers.

We talked about the benefits of docker containers --- process isolation, dependency bundling, portability, light weight version controlled images. Together these bring amazing amount of efficiency, portability and agility in moving applications move from one machine to another machine. Then why are dockers not being popular in HPC community? Lets discuss what are technical challenges in using docker containers for HPC.

  1. Security: One of the most serious challenge is the security model of all container platforms including Dockers. Docker's security model is "all or nothing". If you allow a user to do anything with dockers (either an ability to create or run docker containers) then you are effectively making that user a "root user" or administrator in the host computer system. All container platforms operate on the premise, trusted users are running trusted containers. This works for IT industry and there is no reason for them to try to solve this. Imagine Netflix service being deployed as cluster of containers running on cloud. Netflix controls the development of containers ("trusted containers") and Netflix controls who deploys and how ("trusted users"). Netflix customers only interact with services exposed by these containers. Customers can not start containers, they cannot stop containers, they cannot bring their own containers on their cloud infrastructure. HPC use case is entirely different. HPC cluster users will be starting HPC application containers, they will be stopping them and they will want to bring their own containers to the cluster. As soon as you give this ability to a user, they can pull any container images on the platform and modify it, they can get into anybody's running container, they can stop any running container etc. We will see how Kogence solves all these issues in a moment.
  2. Orchestration: Lets now consider second major headache regarding using dockers for HPC. Lets say we have isolated Matlab in its own docker image and Comsol in its own docker image. Lets say you want to use two software simultaneously, for example, you want to use Comsol Livelink that allows you to write Matlab scripts that invoke Comsol functionalities. In cloud computing world we refer to this as "orchestrating multi container services". There are many container orchestration engines. Swarm is engine that comes with docker platform. Google built its own engine called Kubernettes. Mesos is another orchestration engine out there. Making various containers being aware of each other and be able to talk to each other is involved and requires quite a bit of know how and effort. Again imagine that Netflix orchestrates and runs its services in multiple containers. Netflix service would be running for years, so it is not unreasonable for them to hire a few IT engineers to write orchestration scripts and get all containers talking to each other. On the other hand, in HPC world, a HPC simulation code may run for few days. It takes months to build the simulation mode and get all the science right. It is absolutely not reasonable for scientists and engineers to have to spend substantial amount of effort to make isolated containers talk to each other. It is just way too IT-ish for HPC industry to even be interested in it. Scientists expect these HPC applications to just work. I can be inside Comsol container and if I type matlab then I expect Matlab to come up.
  3. Scheduling: Orchestration and scheduling actually go hand in hand. Swarm, Kubernettes, Mesos -- all of these do orchestration as well as scheduling. HPC ecosystem has its own schedulers - SGE, PBS, Torque, Slurm etc. Moreover, how cluster should scale up and down and what triggers scheduling events in HPC is very different from the needs of IT industry. When Netflix faces a surge of more customers it launches additional containers that are running exactly same services. In HPC world, as a simulation deck progresses, we start completely different application containers from completely different state and doing completely different things.
Kogence Container Cloud Big Compute Platform.png

Enter Kogence

On Kogence HPC Grand Central, all scientific simulation software are maintained in isolated docker images. Users can build their own docker images and bring it to Kogence Grand Central. Users only deploy the application -- they don't have to worry about scheduling, orchestration, graphics, data sharing, file system etc. Containers run as non-privileged users. Users can start any container as long as they have access to it. They start it as non-privileged user. They can only start container using invocation options that are allowed by the container admin who deployed the container on platform. One user cannot see, stop or modify other users container in any way.

Kogence provides automatic "docker compose" functionality making containers talk to each other automatically. You can choose Matlab and Comsol before you start your simulation. If you type comsol in Matlab container, Comsol will come up and visa-versa with you needing to do anything.

On Kogence we have integrated traditional HPC schedulers such as Slurm, SGE, PBS with our docker container platform so users can write traditional job submission scripts and submit it to the scheduler of their choice.

This brings best of both worlds. Ease of use and simulation development using multiple simulators + Efficiency, portability and agility in maintaining these applications and moving them from one machine to another.

Contact Us

Please contact us to discuss further.