Kogence HPC Grand Central Terminology

Miscellaneous Kogence Terminology

What is a Model?

Every cloud HPC project users create on Kogence is called a Model. Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators.

Each model consists of:

1. An independent project documentation wiki. Wiki comes with a full featured WYSIWYG editor. Any graphics generated on execution of your Model is automatically pulled into the project wiki. Permission controls you choose for your Model apply to the Model's wiki as well. A private wiki would show up in the Model Library page only if one of the collaborator with correct permission logs in.
2. An independent discussion board. Permission controls you choose for your Model apply to the Model's discussion boards as well.
3. An independent control of permissions at the Model level. If you give your collaborators a edit permission then your collaborators would have edit permission on all assets under that Model including project files, wiki, discussion boards, cluster settings and stack settings. You cannot control permissions on individual file level, for example.
4. A connected cluster. You can chose to run your Model on a single cloud HPC server or or you can choose to run on an autoscaling cloud HPC cluster. You can choose the maximum number of nodes that you want your cluster to scale. Nodes are created and deleted automatically based on the progression of your Model execution. Any settings you choose for setting up your cluster remain stored with the Model.
5. A connected software stack. Model can be connected to multiple software. You can create a Workflow using these multiple connected software. Some of these can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command. Any settings you choose for setting up your software stack and invocation commands remain stored with the Model.
6. All assets of each Model are independently version controlled.
7. When you Copy a Model you are creating a new Model with all its assets being duplicated. From there on, new Model maintains its own new version control history.

What is a Simulator and a Container?

On Kogence, any software application, solver or a simulation tool, each being referred to as a Simulator, is deployed as an independent entity, referred to as a Container, that can be connected to Models using the Stack tab of the Model and invoked to do some tasks/computations on input files/data provided under the Files tab of the Model. Containers cannot contain user specific or Model specific data or setup. It has to be an independent entity that can be linked and invoked from any Model on Kogence.

We use the term Simulator in much wider sense than what is commonly understood. In the context of definition above, Matlab is a simulator and so is Python. Both are deployed in independent Container. On Kogence we use Docker container technology.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we do not launch a Container; we launch a Model. Model is connected to a cluster and a stack of software and simulators

What is a Simulation?

Execution of the Model on a cloud HPC Cluster is called a Simulation. A Simulation is NOT a single job. Billing on Kogence works on per Simulation basis (i.e. per execution of a Model) not on per job basis. Simulation can consists of a complex Workflow of multiple jobs using multiple software as defined by the user on the Stack tab of the Model. Some of these jobs can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators

How Long Does it Take for My Simulation to Start?

If your Model is connected to a single cloud HPC sever that is not already persisting (i.e. when you start your Simulation for the first time) then it can take about 2 minutes for the server to boot up and start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. If you press the Stop button to stop the simulation and then press the Run button again then server persists through use of kPersistent technology. That means that server would start executing your jobs immediately and you can connect using Visualizer tab immediately. This allows you to work interactively, stop the execution, edit and debug code using code editor accessible under the Files tab of your model, and the restart your Model again just like your would do when you are on the onprem workstation.

If your Model is connected to an autoscaling cloud HPC cluster then it can take about 5 minutes for the cluster to boot up and configured to start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. kPersistent technology is not yet available with clusters.

What are CPU-Hrs?

CPU-Hr is the unit we use on Kogence to measure amount of computing power that has been reserved or consumed independent of the type of hardware being used. CPU-Hrs = # of CPUs X # of Hours.

We measure time in the steps of hours. This does not mean that every time you start your Simulation you will be billed for full one hour. Only the first time you start a Simulation we charge your for one full hour. If, before the full hour is completed, you stop and restart same or different Model on same hardware then we do not charge you anything until the completion of the full hour since the time you first started your first Model. After the completion of a full hour, you would be charged for another full hour. Same logic continues for all subsequent hours.

What is 1 CPU Credit or 1 HPC Token?

HPC Token and CPU Credit are one and the same thing depending upon the release version of Kogence HPC Grand Central App deployed under your subscription. On Kogence, HPC Tokens or the CPU Credits is the currency that you purchase and then spend it while you consume the HPC compute resources on the Kogence Container Cloud HPC platform.

The cost of compute resources is specified in terms of CPU Credits or HPC Tokens. Please check the pricing page for the currently available pricing. Typically, 1 CPU Credit = 1 CPU for 1 hour. Accelerated hardware such as a GPU compute node is also priced in terms of CPU Credits. Typically, 10 CPU Credit = 1 GPU-accelerated-CPU for 1 hour.

So for example if you connect your Model to a 4 CPU machine and select the wall time limit of 10 hours, then your accounts needs to have at least 40 CPU Credits remaining otherwise your Simulation would not start. If your Simulation gets started and either ends automatically or you stopped it by pressing the stop button after 1 hour and 20 minutes, for example, then we will refund your account with 32 CPU Credits. If you restarted the Simulation on same hardware and with same wall time limit within next 40minutes, say after 20minutes to be exact, then we would again block 40 CPU Credits at the start. But if you stoped the Simulation after 10 minutes only then we would refund your account with full 40 CPU Credits back. You are not charged anything for this second Simulation because you already paid for 2 hours during the first Simulation.

As of this writing, everybody starts with free 20 CPU Credits. We top up your accounts every month for free. You can earn more free Credits. You can also purchase more Credits. Credits do not have any expiration date. Currently you can purchase CPU Credits for as low as as $0.02. This means you can purchase 1 CPU-Hr of computing for$0.02. Kogence reserves the right to change these pricing without notice. Please check the pricing page for the currently available pricing.

Cloud HPC Hardware Terminology

What is 1 CPU on Kogence?

1 CPU on Kogence is same as what is called as 1 CPU in the Resource Monitor of Microsoft Windows 10 PC, for example. Similarly, # of CPUs shown on Kogence is same as what is shown as "CPU(s)" by the lscpu utility of linux platforms. Different platforms and utilities may call this same logical computing unit by different names. On same Microsoft Windows 10 PC, for example, System Information and Task Manager utilities calls it as "number of logical processors" and Amazon AWS (see here) and Microsoft Azure (see here) call this "number of vCPU". In general,

${\displaystyle no\_of\_CPUs=no\_of\_hardware\_thread\_per\_core\times no\ of\ cores\ per\ socket\times no\ of\ sockets\_per\_server}$.

This is the most relevant logical unit of computing power in the context of cloud High Performance Computing (HPC). Each logical processor in fact has an independent architectural state (i.e. instruction pointer, instruction register sets etc.) and each can independently and simultaneously execute instruction sets from one worker thread or worker process each without needing to do context switching. Each of these register sets are clocked at full CPU clock speed.

In the context of MPI (Message Passing Interface, multi-processing framework) and OpenMP (multi-threading library), for example, each of these CPUs can run an individual MPI process or an individual OpenMP thread without requiring frequent context switches. So if you are scheduling an MPI job on a 4 CPU machine on Kogence then you can run mpirun -np 4 and there will be no clock cycle sharing or forced context switching among processes.

Between CPUs, Cores, Sockets, Processors, vCPUs, logical processors, hardware threads etc. terminology can get confusing if you are not a computer scientist! Lets take a look at common Microsoft Windows 10 PC that we are all so familiar with. All other computer system will also show similar info. Lets first look at the task manager. This machine has an Intel Core i5-7200 CPU. But don't jump to call it a 1 CPU machine, that is is wrong. You will see below. Here we also notice that this machine has 1 socket, 2 cores and 4 logical processors. Now if you open the System Information, you will see something similar. Now if you open the Resource Monitor, you will see something like this. Notice that this is a 4 CPU machine -- from CPU 0 to CPU 3. If you are on a linux machine you run the lscpu command, you will see 4 CPU(s) as well.

Hardware threads within a core do share one common floating point processor and cache memory, though. Please check FAQ section to learn more about CPUs and hardware threads. In FLOPS heavy HPC applications it is the utilization of floating point processor that is more important than the utilization of CPU.If a worker thread or a worker processes running on one of the hardware thread can keep the floating point processor of that core fully occupied then even though the system/OS resource management and monitoring utilities (such as top or gnome system-monitor on linux platforms) may show overall CPU utilization being capped at most at 50%, you are already getting most out of your HPC server. 50% CPU utilization may often be misleading in the context of HPC. Remember that the CPU utilization percentage is averaged across total number of CPUs (i.e. total number of hardware threads) in your HPC server and if you are using only one hardware thread per core then maximum possible CPU utilization is 50%. This just means that CPU can in theory execute twice as many instruction per unit time provided they are not all asking for access to floating point processor. If they are then 50% utilization is as best you can get.

For these reasons, some FLOPS heavy HPC applications available on Kogence (e.g. Matlab) will not run one independent worker process or worker thread on each hardware thread. Instead they run one worker process or worker thread per core. But Kogence allows you to start multiple instances of these applications. You can experiment with single instance and multiple instance and see if you get better net throughput.

Most High Performance Computing (HPC) servers that Kogence offers are built using Intel microprocessor chips. HPC server motherboards consists of multiple-sockets with each socket plugged with one Intel microprocessor chip. Each microprocessor chip has multiple cores. Each core is built using Hyper-Threading technology (HTT). HTT creates 2 logical processors out of each core. Different operating system (OS) utilities and program identify these logical processors by different names: some identify them as 2 CPUs and others identify them as 2 hardware threads. Hardware threads is a very confusing terminology. It has nothing to do with user (worker) threads that your program might start. Each hardware thread is capable of independently executing an independent task -- ether an independent worker thread or an independent process.

Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core. Just like a dual-core or dual-socket configuration that uses two separate physical processor, each of these logical processor has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified process/thread, independently from the other logical processor sharing the same physical core. On the other hand, unlike a traditional a dual-core or dual-socket configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface. The sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core. A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The processor may stall due to a cache miss, branch misprediction, or data dependency. The degree of benefit seen when using a hyper-threaded processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.

In vast majority of modern HPC use cases, we find that HTT helps speeding up the application. It is a very effective approach to get most performance for a given cost. At high level any HPC program when loaded into CPU registers as a set of machine instructions can be thought of as directive to CPU to repeatedly looping over following: 1/ Fetch instruction, 2/ Decode instruction and fetch register operands; 3/ Execute arithmetic computation; 4/ Possible memory access (read or write); 5/ Write back results to register. It is the step #4 that most people ignore logically when thinking about speed of executions. In this context, even cached memory is slow, much less main memory. L1 cache typically has a latency of ~2 CPU cycles, L2 cache typically has a latency of ~8 CPU cycles while L3 cache typically has a latency of ~100 CPU cycles. Main memory has about 2X more latency than the L3 cache (~200 CPU-cycles away). If your code opens some I/O pipes to read/write on files, for example, than I/O devices on main memory bus has 100X-1000X more latency than main memory (~20K to ~200K CPU-cycles away). If I/O devices are on network or on PCIe bus than you are looking at miliseconds (~2million CPU-cycles away) level latency at the least.

Now imagine your HPC code is running on a CPU. CPUs are really good in doing step #1 to #3 and step #5. For example, modern CPUs can do 4 floating points (4 FLOPS) per CPU-cycle. If your program needs to access some data from main memory than it will be waiting for ~200 CPU-cycles and the floating point arithmetic compute engine of CPU will be just sitting idle. If it is waiting for data from an I/O then compute engine may be waiting idle for millions of CPU-cycles. If you could use that time, you could have completed 4 million floating point operations utilizing that idle time. That is exactly what the HTT technology accomplishes. Even the most heavily optimized real world HPC code would need to access cache memory frequently at the least. This means there are several 10s to several 100s of CPU-cycles, at the least, that could be utilized by other thread. Whether hardware threading (HTT) will enhance the performance or not basically boils down to the ratio of floating point instructions to instructions that need to fetch or write data from cache, main memory or I/O devices. Even if that ratio is 100 or 1000, you would still expect HTT to boost performance.

In case the HTT is turned off, it is not that you would be completely wasting those idle cpu cycles. The OS can still do the context switching and load other ready to be executed process or thread (user worker thread or OS thread). The benefit of HTT is that the time it takes to do the context switching between hardware threads is much less than that needed by the OS to do the context switch as the Thread Control Block, TCB, is already loaded in the hardware.

However, in cases where both threads are primarily operating on very close (e.g., registers) or relatively close (first-level cache) instructions or data, the overall throughput occasionally decreases compared to non-interleaved, serial execution of the two lines of execution. In the case of LINPACK benchmark, a popular benchmark used to measure supercomputers on the TOP500 list, many studies have shown that you get better performance by disabling HT Technology. We think these are largely artificial constructs and do not represent how real life HPC codes work. These LINPACK benchmarks are specifically designed to probe the speed of floating point operations and deliberately avoid accessing memory or I/O. Real world HPC workloads cannot avoid memory access or I/O as efficiently as these benchmarks do. These benchmarks are hand crafted specifically to probe the speed of floating point operations.

One of the drawbacks of traditional onprem HPC cluster is that it hinders experimentation with HTT for your specific use case at hand. Either entire cluster needs to switch ON HTT or entire cluster needs to switch OFF HTT. Your onprem cluster admin typically makes that decision based on all types of workloads that all other users typically run on that cluster. On kogence cloud HPC platform you create your own personal autoscaling HPC cluster for each model you are executing. You can run same model multiple times on brand new clusters and disable HTT Technology in some cases while leaving it enabled in others. You can easily test both configurations and decide which is best based on empirical evidence. With that said, for the overwhelming majority of workloads, you should leave HTT Technology enabled.

OS configured on Kogence are HTT-aware, meaning they know the distinction between hardware threads and physical cores and would properly schedule user threads to reduce stalled CPU states and enhance performance unless you specifically instruct Kogence platform to pin your processes and worker threads to specific cores or to specific hardware threads.

What are the SISD, SIMD, MISD and MIMD Architectures?

Oldest microprocessor architectures were SISD, which is short for single instruction single data. This means that a single piece of data (say two operands on which add instruction needs to be executed) is loaded on CPU registers and a single instruction (such as the add instruction) is executed on them.

SIMD, which is short for single instruction multiple data, enables us to process multiple data with a single instruction. This allows data to be "vectorized". For example, we may be able add two columns of a matrix in a single CPU clock cycle. The width of SIMD is the number of data that can be processed simultaneously and it is determined by the bit length of registers. Intel hardware has offered vectorized SIMD capability beginning with SSE (Streaming SIMD extensions) which supports 128-bit registers. AVX (advanced vector extensions) offered 256-bit registers and AVX-512 offers 512-bit registers. On Kogence, all hardware offered is AVX-512 capable. Since the theoretical peak performance of a CPU is the value when the application uses the vector width to the full, the utilization of SIMD is crucial for the performance of applications on modern CPU architecture. However, it is not trivial to transform a scalar kernel to a SIMD-vectorized one. Additionally, the optimal method of vectorization is different for each instruction set architecture (ISA). Most software, simulators and solvers that we offer on Kogence, such as GROMACS, LAMMPS, NAMD etc. have been effectively vectorized and are available to use out of the box to our users.

Multiple instruction single data (MISD) is a theoretical processor architecture concept that, to our best knowledge, has never been implemented in any commercially available computing hardware.

All hardware that we offer on Kogence are Multiple instruction multiple data (MIMD) capable. Each CPU in itself is SIMD but all computing hardware we offer consists of multiple CPUs that can independently execute multiple instructions, each instruction working on vectorized data, simultaneously. So the computing the computing hardware in aggregate is known as an MIMD system.

MIMD computing systems can be further divided into two categories -- the shared memory architecture (known as symmetric multiprocessors or SMP) and distributed memory architecture (also known as cluster computing). We offer both on Kogence. Please refer to other sections of this document for more details.

What is the Memory Latency and Bandwidth on Kogence?

Before we answer that question, lets cover some basics.

Virtual Memory and Page Tables

In HPC, we often deal with shared memory parallelism that means processes can share memory. That means multiple virtual memory pages (as addressed by different processes) may be mapped to same physical memory frame. If NUMA is enabled then reverse if also possible, i.e. same virtual memory page may be mapped to multiple physical memory frames (for read only frames) with individual frames being on local physical memory of each individual CPU.

Cache Memory, Translation Look Aside Buffer (TLB) and Main Physical Memory

Cache and TLB are part of the Memory Management Unity (MMU) which is integrated on same silicon chip together with the microprocessor. Both cache and TLB are used to reduce the time it takes for a running process to access a physical memory location which is located off the microprocessor chip and on a different silicon chip on the motherboard connected on the memory bus on all hardware that Kogence offers. Cache is basically a quick access copy of small sections of physical memory in a static RAM on the microprocessor's silicon chip while TLB is a quick access copy of small sections of page table (which also resides in physical memory) in the static RAM on the microprocessor's silicon chip.

Latency and Throughputs

Servers offered on Kogence have 3 levels of cache: L1, L2 and L3. L1 cache is further broken down into instruction (L1i) and data (L1d) cache. Each core gets its own L1 and L2 cache while L3 cache is shared among all cores in a socket. Hyperthreads do not get their own cache. Typically our hardware will have 32KB each of L1i and L1d cache, 256KB of L2 cashe and >10MB of L3 cache. Amount of each level of cache differs from hardware to hardware. Check the specification page for more details. On Kogence hardware, TLB may reside between the CPU and the L1 cache. On servers Kogence offer,

• L1 cache typically has a latency of ~2 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L2 cache typically has a latency of ~8 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L3 cache typically has a latency of ~100 CPU cycles and a bandwidth of about ~200 bytes/CPU-cycle.
• Main memory has a latency of ~200 CPU cycles and a bandwidth of about ~100 bytes/CPU-cycle.

What are Shared and Distributed Memory Computing, SMP, NUMA and ccNUMA?

Oldest microprocessor chips had one CPU connected to an off chip RAM through a memory bus. Then came the multi-core microprocessors. Single chip had multiple CPUs in it connected to the same shared off chip RAM through a common memory bus. This shared memory computing architecture is referred to as the symmetric multiprocessor (SMP). What this means is that each CPU can access the common shared RAM in a completely autonomous fashion independent of the other CPUs. The architecture is "symmetric" when see from each core's perspective. There is no master/slave type of concept, there is a single OS across all CPUs and neither of the CPUs has to go through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP). Processes running on each CPU can address the entire common memory space and access that memory space directly though the memory controller. An SMP system is a tightly-coupled system in which multiple processors working under a single operating system and a single virtual memory space can access entire system memory over a common bus or interconnect path.

Applications using open MP for parallelism rely on this shared memory architecture. You can not two threads of same process on CPUs that don't share memory. Older versions of MPI didn't leverage shared memory capability. Meaning if you were running two MPI ranks (i.e. two processes in same process group) on an SMP machine then those two process would have used inter process communication methods to access each others data. Starting from MPI-3, MPI now offers shared memory capability as well.

Of course, in the SMP architecture, the memory access latency will depend on the memory bus contention or memory access conflicts. As the number of cores connected to single memory bus started to grow this contention quickly became an issue and the performance boost expected from multiple CPUs quickly started to disappear. Cache memory architectures, integrated memory controllers (IMC) and multiple channel memory architectures were developed to reduce this memory contention in SMP architectures. In modern hardware that we offer on Kogence, memory architecture is quite complex to deal with possible contentions. Please see Intel Cascade Lake and Intel Skylake architecture documentation pages as well as other sections of this document.

SMP architectures are further broken down into two conceptual frameworks -- Uniform Memory Access (UMA) and Non-uniform Memory Access (NUMA). UMA architectures are basically the earliest SMP architectures where, in the absence of contentions, the memory access latency for each CPU is exactly the same. These UMA SMP architectures are clearly not scalable as more cores and CPUs are added. SMP architectures that we offer on Kogence are NUMA architectures. In NUMA SMP architectures, each CPU does not necessarily have completely same access latency to entire system RAM but, logically, there is still a single common virtual memory address space for all processes running on the single node. Each CPU can address the same entire virtual memory address space. Therefore, architecture is still known as SMP architecture. The older SMP applications will run on newer NUMA memory architecture unchanged -- hence the architecture is still SMP at a logical level.

Within each socket of the architectures we offer, there are two sub-NUMA clusters created by the two IMCs. Each socket gets a mesh interconnect between cores and IMCs, two IMCs and 6 memory channels (3 memory channels per IMC ) with the sockets on same node being connected by the UPI links. Absence any contentions, all the cores in a localization zone (i.e. in the sub-NUMA cluster) have same memory access latency. One can use OS provided utilities to specify process or thread affinities to within sub-NUMA clusters.

Cache management with these modern NUMA SMP architectures is significantly more complex as we need to make sure that the potentially multiple duplicate copies of data loaded into L1, L2 cache of each core as well as the L3 cache of the socket is in sync as multiple CPUs perform read write operations. Hardware architectures we offer are cache coherent NUMA (ccNUMA) which means that the cache coherence is ensured by the hardware itself.

Distributed memory computing (also known as cluster computing or loosely-coupled computing) should be contrasted with the tightly coupled shared memory (SMP) computing. Clusters consists of "nodes". A node is the largest unit of computing where cache coherency can be ensured. Different nodes are running independent copies of OS and have their own virtual memory address space. Processes running on different nodes need to access data through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP).

What are "Number of Processes" and "Number of Threads"?

A process is a logical representation of an instance of a program that has been submitted to CPU and OS to manage its execution. Process is distinct from a thread -- both user or OS (worker) threads as well as hardware threads (HTT technology). Hardware thread is a confusing terminology. Each hardware thread can execute an independent process or an independent worker thread. On Kogence platform we refer to hardware thread as a CPU.

As opposed to threads, processes provides stronger logical isolation as you are stating independent instances of program to work independently. Intuitively, one can understand starting multiple processes of same program to be similar to starting same program multiple times. For example, if you start multiple instances of Matlab by tying matlab -desktop & multiple times on the CloudShell terminal then you are basically starting multiple Matlab processes. You should not be afraid of multiple instances of Matlab corrupting data or files because if one instance of Matlab opens a file then other instance cannot access it. Similarly both instances are creating there own temporary data in the memory to work on so you do not have to think about controlling access to variable in the memory.

Using multi-processing libraries such as MPI to start multiple instances of Matlab is much more preferred method as compared to above mentioned method of starting multiple instances on a shell terminal. As an example, if your code running under each instance of Matlab is trying to access same file then all instances except will crash complaining they cannot access the file. If you start processes through standard libraries such as MPI then you can be assured that MPI will manage those things. It will take care of keeping a process on wait until other processes close the file.

At a lower level, a process is defined by a Process Control Block (PCB). PCB stores state of execution of a process. It has all the information that OS and CPU need to freeze or unfreeze a process. Any set of instructions submitted to CPU that comes with its own PCB is a process. Each process gets a unique process ID (PID) that you can check using OS utilities. So any task that has a unique PID is a process. As an example, if you execute a bash shell script then the bash-shell started to execute the script instructions will get a PID and would be a process. If from that shell script you provide instructions to call another bash shell script then that script will be executed in another child bash shell and that bash shell will gets its own PCB and PID. Modern OS are smart enough to not duplicate the machine code for each instance in the main memory and in cache. For example, they will only load one copy of the text or the code segment in the memory. But each process will get its own program counter, a CPU register, that keeps track of which instruction is being executed currently. Program counter is part of PCB and gets saved and restored when processes is taken out by OS from running state and then brought back to running state. Typically, virtually memory allocated to a process and files and other I/Os opened by a process are restricted to that process. Other processes cannot access those resources. But processes can specifically issue instructions to share these resources with other processes. In HPC we do this by using standard libraries such as MPI.

Processes can start children processes (and those can start grand-children processes etc). Each of those will get their own PCB (and therefore a PID). Processes can also start multiple threads. Threads will operate under same PCB. Threads do not get their own PCB (instead they get a Thread Control Block, TCB, which has a link to the PCB of the parent process). Threads can access resources of the process.

Most HPC solvers can also do multi-processing as opposed to multi-threading. Meaning multiple CPUs across multiple nodes can be instructed to use independent instances of same solver code (children processes) and operate independently on a specific set of data and instruct children processes to work on them in parallel. As children may be running on different nodes, they don't have access to memory of parent process in the parent compute node. Children processes exchange data with each other and with parent using message passing interface (MPI) library subroutines. If MPI processes are running on same node then they do have ability to access shared memory as well. But all request to access data/memory still goes through MPI subroutines. MPI subroutines manage mode of access to the data automatically by keeping track of which children process is running on which node.

On Kogence, we restrict product of processes and threads to be equal or less than the number of CPUs in the cluster. This eliminates serious overhead from context switching.

What is the Overhead of Process Context Switching?

Context switch refers to which instruction set is a given core executing at a given time. Context switch can refer to switching from one hardware thread to another hardware thread (HTT technology), switching from one user (worker) thread to another user (worker) thread or from one process to another process. Context switching from one hardware thread to another hardware thread has negligible overhead. This FAQ discusses the overhead of process context switch. There is another FAQ that discusses the overhead of worker thread context switch.

When a request is made to execute a different process on the same CPU (either triggered by user code instruction, i.e. voluntary context switch, or by the OS scheduler because running process went into idle state waiting for some I/O or because OS scheduler wants to give processor time to other processes, i.e. involutary context switch) then first a switch from user mode to OS mode is triggered. This simply means that a subroutine from OS code base needs to be called that will perform the task of context switch. OS code is just like usual user code that is loaded into the physical memory. This OS code will save the process control block (PCB, which consists of current values in a set of registers in the CPU such as the program counter, pointer to page table etc.) of the current process to the physical memory and load the PCB of new process from physical memory into the CPU registers. OS will then make another switch from OS mode to user mode to let CPU start executing instruction from new process. In older hardware and OS architectures, a switch from user mode to OS mode itself was implemented like a full process context switch. That meant saving existing process PCB, loading OS PCB, executing OS code, saving OS PCB and then loading the new process PCB. But in modern hardware and OS, cost of switching out of user mode to OS mdoe and then back is much less expensive and in not a significant portion of the cost of context switch anymore. Also, on virtualized cloud HPC hardware, some hypervisors need to give control to the host machine OS and guest OS cannot do this switch. This increases the computational cost of context switch by an order of magnitude. On Kogence platform, we have carefully configured that system to eliminate this latency.

Therefore the fixed cost of doing a context switch consists of: switching from user mode to OS mode; storing and loading of PCB; and switching back from OS mode to user mode. On Kogence cloud HPC platform, switching in and out OS mode back to user mode typically costs about 100 CPU clock cycles. With CPU clock cycles of about ~2 to 3GHz, the time it takes to switch back and forth between OS mode and user mode would take ~50ns. Among the fixed cost components, the cost of storing and loading process PCB is much larger. This is easily 2 orders of magnitude bigger and takes about 2,000 CPU cycles or ~1µs.

Much bigger portion of cost is the variable cost. The variable CPU cycle cost of doing context switch consists of: flushing of the cache; and flushing of the TLB. Note that each process uses the same virtual memory address space but they are each mapped to different physical memory address space through page table. A small portion of page table is cached on the TLB. Now depending upon the number of pages of virtual memory that the old and new processes use (called the working set), CPU might encounter lots of TLB misses on context switching. So it may have to flush the TLB and load new sections of page table. In addition, the old section of physical memory frames that were cached on various levels of cache may also become useless and CPU might encounter lots of cache misses and it may have to flush the cache and lot new physical memory frames into the cache.

The variable CPU cycle cost of switching from old process to new process and back to old process will change dramatically depending the process pinning you instruct the OS scheduler to use. Lets say CPU1 is executing process1. Then OS asked the same CPU1 to start executing process2. After some time, now OS wants to restart the process1. The number of CPUS cycles we have to waste before we can successfully start executing process1 changes dramatically if OS schedules process1 back to original CPU1 or if OS schedules it to a new CPU2 (and instead starts a process3 on original CPU1). Note that each core gets its own L1 and L2 cache while L3 cache is shared between cores. Since address translations from virtual memory pages that preocess1 needs to access the physical memory address may already be present in the TLB of CPU1 and those required frames of physical memory may also be already present in L1 or L2 cache of CPU1 (since process1 was running on CPU1 sometime back), the process1 may restart on CPU1 much quicker than on CPU2. If new CPU2 sits on a different socket then cost will be even higher as both TLB and all levels of cashes will need to be repopulated. By the way the process pinning affects fixed cost of context switching as well because now PCB is also not available in cache and needs to read from main physical memory.

Variable cost also depends strongly on the size of virtual memory pages that a process is using. If processes use very large page sizes than chances of need to flush TLB may be lower (as number of entries needed in the TLB may become smaller) but chances of need to flush cache may increase.

All of the above depend strongly on the size of the caches in the CPU you're using. Typically, if pages are available in cache then time it takes to write pages can easily take 100,000 cycles or few microseconds. If we miss cache and TLB then this can increase by an order of magnitude or to few 10's of microseconds. As we discussed above this also depends on the process pinning instructions your code might give to OS. Past a certain working set size, the fixed cost of context switching is negligible compared to the variable cost due to the cost of accessing memory. Because of all these reasons, it is very hard to put any representative/average number on the cost of context switching.

In summary,

• Switching from user mode to OS mode and back takes about 100 CPU cycle or about 50ns.
• A simple process context switch (i.e. without cache and TLB flushes) costs about 2K CPU-cycles or about 1µs.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Launching a new process takes about 100K CPU cycles or about 50µs.
• In the HPC world, creating more active threads or processes than there are hardware threads available is extremely detrimental (e.g. in 100K CPU-cycles, CPU could have execute 400K FLOPS). If number of worker threads is same as the number of hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

• If you are doing thread switch within same CPU (proper CPU pinning) with no need to do TLB and cache flushes then context switch takes about 100 CPU-cycles or about 50ns.
• A thread switch going to different CPU will cost about the same as a process context switch does i.e. its costs about 2K CPU-cycles or about 1µs. Here we are assuming that virtual memory that thread needs is already in cache and we dont need TLB and cache flushing.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each thread context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Creating more active threads than there are hardware threads available is extremely detrimental. If number of worker threads is same as the number f hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

When using a tool like top or gnome-monitor on Kogence HPC servers, you will see CPU usage being reported as being divided into 8 different CPU states. For example.
%Cpu(s): 13.2 us,  1.3 sy,  0.0 ni, 85.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

These eight CPU states are: “user” (us), “system”, (sy), “nice” (ni), “idle” (id), “iowait” (wa), “hardware interrupt” (hi), “software interrupt” (si), and “steal” (st). top is showing percentage of time server is spending in each of the eight possible states. Of these 8 states, “system”, “user” and “idle” are the 3 main CPU states. The ni state is a subset of us state and represents a fraction of CPU time that is being spent on low priority tasks. The wa state is the subset of the id state and represents a fraction of CPU time that is being spent while waiting for an I/O operation to complete. These 3 main CPU states and the si, hi and st states add up to 100%.

Please note that these are averaged over all CPUs of your HPC server. So if you started an 8 CPU HPC server then top will show utilization averaged across all 8 CPUs. You can press 1 to get per-CPU statistics.

• system (sy)

The “system” CPU state shows the amount of CPU time used by the kernel. The kernel is responsible for low-level tasks, like interacting with the hardware, memory allocation, communicating between OS processes, running device drivers and managing the file system. Even the CPU scheduler, which determines which process gets access to the CPU, is run by the kernel. While usually low, the system state utilization can spike when a lot of data is being read from or written to disk, for example. If it stays high for longer periods of time, you might have a problem. So, for example, if CPU is doing a lot context switching then you will see that the CPU may be spending a lot more time in the system state.

• user (us)

The “user” CPU state shows CPU time used by user space processes. These are processes, like your application, or some management daemons and applications started automatically by Kogence that would be running in the background. In short, every CPU time used by anything other than the kernel is marked “user” (including root user), even if it wasn’t started from any user account. If a user-space process needs access to the hardware, it needs to ask the kernel, meaning that would count towards “system” state. Usually, the “user” state uses most of your CPU time. In properly coded HPC applications, it can stay close to the maximum of 100%

• nice (ni)

The “nice” CPU state is a subset of the “user” state and shows the CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks. The nice utility is used to start a program with a particular priority. The default niceness is 0, but can be set anywhere between -20 for the highest priority to 19 for the lowest. CPU time in the “nice” category marks lower-priority tasks that are run when the CPU has some time left to do extra tasks.

• idle (id)

The “idle” CPU state shows the CPU time that’s not actively being used. Internally, idle time is usually calculated by a task with the lowest possible priority (using a positive nice value).

• iowait (wa)

“iowait” is a sub category of the “idle” state. It marks time spent waiting for input or output operations, like reading or writing to disk. When the processor waits for a file to be opened, for example, the time spend will be marked as “iowait”. Instead, if a task running on a given CPU blocks on a synchronous I/O operation, the kernel will suspend that task and allow other tasks to be scheduled on that CPU. In that case, CPU is not idle and this will not be shown in id or wa states.

• hardware interrupt (hi)

The CPU time spent servicing hardware interrupts.

• software interrupt (si)

The CPU time spent servicing software interrupts.

• steal (st)

The “steal” (st) state marks time taken over by the hypervisior.

How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

One can use top or the gnome-monitor utilities to monitor the memory utilization.

Cloud HPC Network Architecture

Is the Kogence Cloud HPC Cluster Connected to Internet?

If you are starting your simulation on a single cloud HPC server, your HPC server is connected to internet. Once your Model is in the running state, you can click on the Visualaizer button on top right corner to connect to your HPC server over the internet under a secure and encrypted channel of SSL/TLS. If you connect a CloudShell to the software stack of your model through the Stack tab of your Model then you can use the CloudShell terminal to pull repositories over the internet using pip, git pull, curl, wget etc.

If you are starting an autoscaling cloud HPC cluster the the master node of the cluster is connected to the internet in same way.

Please refer to Network Architecture document for more details.

How is Network the Architected Among the Compute Nodes?

Cloud HPC clusters you start on Kogence are equipped with standard TCP/IP network as well as infiniband like OS Bypass networks. Please refer to Network Architecture document for more details.

Do the Kogence Cloud HPC Cluster Come with Infiniband Network?

Yes. If you select Network Limited Work Load node then that node is equipped with an OS Bypass network with network bandwidth up to 100Gbps. Please refer to OS Bypass Network and Remote Direct Memory Access documents for more details.

How is the Network Configured Among the Containers?

Please refer to the Container Network document for details. The OS Bypass Remote RDMA Network is also available for workloads inside containers if same is accessible to the HPC server itself.

Cloud HPC Parallel Computing Libraries and Tools

What is BLAS?

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software.

On linux systems, if you don't configure your system properly, you would be using the default GNU BLAS (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS, AtlasBLAS, GotoBLAS and Intel MKL ) that can be used instead of the default base libraries. These libraries are optimized to take advantage of the hardware they are run on, and can be significantly faster than the base implementation (operations such as Matrix multiplications may be over 40 times faster). Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

What is BLACS?

The BLACS (Basic Linear Algebra Communication Subprograms) is a linear algebra oriented message passing interface for distributed memory cluster computing. It provide basic communication subroutines that are used in PBLAS and ScaLAPACK libraries. It implements a portable message passing library to higher level libraries such as PBLAS and ScaLAPACK libraries. To provide this functionality BLACS can be built using any of the communication layer such as MPI, PVM, NX, MPL, etc available on the hardware and provide a hardware agnostic interface to higher level libraries. It is important to use hardware optimized communication layer to build BLACS to get most optimized performance for your HPC applications.

What is PBLAS?

PBLAS (Parallel BLAS) is the distributed memory versions of the Level 1, 2 and 3 BLAS library. BLAS is used for shared memory parallelism while PBLAS is used for distributed memory parallelism appropriate for clusters of parallel computers (heterogeneous computing). PBLAS uses BLACS for distributed memory communication. Just like BLAS, on Kogence we automatically configure software/solver/simulators with appropriate hardware optimized versions of PBLAS based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

What is LAPACK?

LAPACK is a large, multi-author, Fortran library for numerical linear algebra. It provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision. LAPACK is the modern replacement for LINPACK and EISPACK libraries. LAPACK uses block algorithms, which operate on several columns of a matrix at a time. On machines with high-speed cache memory, these block operations can provide a significant speed advantage.

LAPACK uses BLAS. The speed of subroutines of LAPACK depends on the speed of BLAS. EISPACK did not use any BLAS. LINPACK used only the Level 1 BLAS, which operate on only one or two vectors, or columns of a matrix, at a time. LAPACK's block algorithms also make use of Level 2 and Level 3 BLAS, which operate on larger portions of entire matrices. LAPACK is portable in the sense that LAPACK will run on any machine where the BLAS are available but performance will not be optimized if hardware optimized BLAS and LAPACK are not used.

On linux systems, if you don't configure your system properly, you would be using the default GNU LAPCK (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS/LAPACK, Atlas/LAPACK, GotoBLAS/LAPACK and Intel MKL ) that can be used instead of the default base libraries. If you configure your system properly with hardware optimized LAPACK libraries then your HPC applications will acquire serious boost in performance. Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

What is ScaLAPACK?

ScaLAPACK (Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory clusters of parallel computers (heterogeneous computing). ScaLAPACK uses explicit message passing for inter-process communication using MPI or PVM (this functionality is provided by the BLACS that is used in ScaLAPACK). Just like LAPACK, ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy that includes the off-processor memory of other processors (or processors of other computers), in addition to the hierarchy of registers, cache, and local memory on each processor. ScaLAPACK uses PBLAS and BLACS. ScaLAPACK is portable in the sense that ScaLAPACK will run on any machine where PBLAS, LAPACK and the BLACS are available but performance will not be optimized if hardware optimized PBLAS and LAPACK are not used.

Kogence use ScaLAPACK automatically whenever appropriate for you based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

What is ELPA

ELPA (Eigenvalues SoLvers for Petaflop-Applications) is a matrix diagonalization library for distributed memory clusters of parallel computers (heterogeneous computing). For some operations, ELPA methods can be used instead of ScaLAPACK methods to achieve better performance. ELPA API is very similar to ScaLAPACK API and therefore it is very convenient for HPC application developers to provide support for both libraries and offer ability to link ELPA in additions to ScaLAPACK during the application build. ELPA does not provide full functionality of ScaLAPACK and is not a complete substitute. Whenever HPC applications deployed on Kogence offer support for ELPA, on Kogence, we provision those applications (e.g. Quantum Espresso) with hardware optimized ELPA.

What is FFTW?

FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data as well as of even/odd data, i.e. the discrete cosine(DCT)/sine(DST) transforms. At Kogence we automatically use the hardware optimized versions of FFTW based on the hardware you selected on the Cluster tab of your model.

What is Intel MKL?

Intel Math Kernel Library (Intel MKL) is a library that includes Intel's hardware optimized versions of BLASLAPACKScaLAPACK, FFTW as well as some miscellaneous sparse solvers and vector math subroutines. The routines in MKL are hand-optimized specifically for Intel processors. On Kogence, Intel MKL is automatically configured and linked whenever appropriate based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

Background

Message Passing Interface (MPI) is a portable message-passing standard designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a library of routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. HPC software applications, solver and simulators written using MPI library routines and properly compiled and linked against MPI libraries enable multiprocessing. Multiple processes of MPI applications can be run in parallel either on single multi-CPU cloud HPC server or on an autoscaling cluster of multiple multi-CPU cloud HPC servers.

Traditionally, inter-process communication used by MPI is slower. This is expected when two MPI processes of an application are running on different nodes as the communication has to go through the network hardware. Older versions of MPI used the same inter-process communication framework even when these two MPI processes of the application are running on the same multi-CPU server. As opposed to MPI, OpenMP, on the other hand, is a library for multithreading. OpenMP applications enable you to start multiple threads of a single process of the application. A processes, and therefore all its threads, necessarily run on single cloud HPC server. Inter-thread communication is much faster than the inter-process communication as multiple threads of same process have access to same memory and such communication can avoid routing of data exchange through network hardware. This discrepancy in the communication speed when shared memory is available had lead to the adoption of multiprocessing-multithreading hybrid parallel programing constructs. If an application/solver provides support for both MPI and OpenMP then one can use both MPI (multiprocessing) and OpenMP (multithreading) at the same time and is some times known as hybrid parallelism. User is recommended to start as many MPI processes as there are nodes in the cluster and within each process user is recemented to start as many openMP threads as there are CPUs (with hardware hyperthreading) or cores (without hardware hyperthreading) in each node. This MPI/OpenMP approach uses an MPI model for communicating between nodes while utilizing groups of threads running on each computing node in order to take advantage of faster communication enabled by the shared memory multicore/multi-CPU architectures. On Kogence, user can configure such choices very conveniently on the Stack tab of the Model by selecting number of processes and threads independently from dropdown menus. Alternatively, user can use command line options of qsub and mpirun to configure such choices (see below for details).

The MPI-3 standard introduces another approach to hybrid parallel programming that uses the new MPI Shared Memory (SHM) model. The MPI SHM model enables changes to existing MPI codes incrementally in order to accelerate communication between processes on the shared-memory nodes. Any HPC application/solver that supports MPI SHM model, is preconfigured on Kogence to use shared memory communication when two processes of an application are running on same node.

For shared memory architectures across the cluster of nodes, please check OpenSHMEM. Newer version of MPI libraries have started offering capabilities to run OpenSHMEM over MPI. Of course, the HPC application/solver itself must be programmed to support OpenSHMEM.

Furthermore, traditionally, MPI inter-process communication used the standard TCP/IP stack for communication. Newer versions of MPI support various hardware network interfaces either through Open Fabrics Interface (OFI) or through natively implemented support which provides OS Bypass Remote Direct Memory Access capabilities and avoid the inter process communication to go through traditional TCP/IP stack providing significant performance boost. On Kogence, users would not have to worry about the details of hardware and network infrastructure as we will configure appropriate transport layer for best performance for their HPC applications.

Kogence mpirun Wrapper

There are several well-tested and efficient implementations of MPI. On Kogence, currently we support Open MPI (not to be confused with OpenMP), MPICH and Intel MPI (an MPICH derivative) libraries. MPI is, at the least, as old as the world-wide-web itself. This has lead to a lot of legacy features and extremely varied ways in which MPI has been integrated with different tools in the HPC stack. As cloud computing and container based dependency isolation and portability is becoming more popular, the struggles of integrating MPI with more modern Docker based computing frameworks has been widely reported by the industry and academic researchers. In fact, these struggles had lead to the development of an alternate containerization technology called Singularity. Although, singularity solves many of the problems of integrating MPI with containerization, at Kogence we have taken a different approach. We have natively integrated dockers with MPI. On Kogence, users do not need to worry about these integration details. Kogence users can bring their own custom MPI HPC application as docker containers and users will invoke this MPI application just like they would do on any other traditional onprem HPC cluster. All of the Kogence HPC tools such as data management, version management, remote interactive display orchestration, autoscaling of cloud HPC clusters, automatic composition of users container with all other containers on Kogence --- all of these will be automatically be available for any newly added customer docker container.

MPI applications are traditionally invoked either using orterun (in case of openMPI), mpirun, mpiexec, mpiexec.hydra (in case of MPICH and Intel MPI) or through PMI interfaces offered by resource managers (for example, under Slurm, one may be able invoke MPI applications using srun if PMI is properly configured on the system). On Kogence, users will invoke MPI applications using Kogence mpirun wrapper. This should be available in $PATH when you are logged in. You can check by running which mpirun command on a CloudShell. Make sure you are using the Kogence mpirun wrapper (/tmp/software/mpirun) and not the mpirun wrapper that comes packaged with various distributions of the MPI libraries or with the OS. Kogence mpirun wrapper is a powerful integration that allows you to use same invocation constructs to invoke MPI application inside docker container, inside singularity containers or installed on host machine. Lets say you add 2 containers in the Stack tab of your Kogence Model. Lets say container1 has an MPI application built using MPICH and container2 has an MPI application built using openMPI. Once you run this Model on Kogence, both containers will get binaries automatically composed to invoke each other's entrypoints. If you use Kogence mpirun wrapper inside container1 to invoke application binary inside container2, wrapper will properly take care of wrapping the commands with correct MPI command line options, starting correct daemons and executables depending on the version on MPI packaged in the target container. It also automatically detects and provides integration with various resource managers such as Slurm, PBS, SGE and LSF. It also automatically scales the cluster of cloud HPC servers as well as provides integration with container cluster orchestration frameworks docker swarm and kubernettes. This wrapper will automatically detect the version of MPI that is packaged with containerized application and invoke that MPI executable with provided command line options after orchestrating a docker swarm or kubernettes cluster of containers. If you are invoking the MPI application using Stack tab graphical user interface of the Model, then Kogence will automatically choose appropriate MPI command line options for you. Advanced users should use CloudShell or other scripting language to manually invoke mpirun wrapper with command line options to suite their needs. With Kogence mpirun wrapper, you can provide any of the following command line options depending upon the version of MPI packaged in the container that you are invoking. You can check the version of MPI packaged in the container by looking at the Settings tab of the concerned container. OpenMPI OpenMPI Slots • Number of slots is one of the basic concepts in MPI. You can define any number of slots per host in the host file and then provide that host file to MPI using -hostfile option. By defining number of slots, you are letting MPI to schedule that many number of processes on that host. If you do not define it then, by default MPI assumes number of slots to be same as number of core. By using the switch --use-hwthread-cpus, you can override this and tell MPI to assume number of slots to be same as number of CPUs (i.e. the number of hardware threads). Defining any more slots than the number of CPUs will cause a lot of context switching and may degrade the performance of your applications. • With -np option you can request either fewer processes to be launched than there are slots or you can also oversubscribe the slots. Oversubscribing the slots will cause lots of context switching and may degrade performance of your application. One can prevent oversubscription by using the -nooversubscribe option. Oversubscription can also be prevented on per host basis by specifying the max_slots=N in the hostfile (resource managers and job schedulers do not share this with MPI, this only works when you are explicitly providing host file to the MPI). There are alternative ways to specify number of processes to launch. If you don't specify anything but provide a host file then MPI launches as many processes on each host as there are slots. The number of processes launched can be specified as a multiple of the number of nodes or sockets available using -npernode N and -npersocket N options respectively. Most Common OpenMPI Command Line Options • -c, -n, --n, -np <N>: Run N copies of the program across all nodes in a round-robin fashion by slots. If no value is provided for the number of copies to execute (i.e., neither the -np nor its synonyms are provided on the command line), Open MPI will automatically execute a copy of the program on each process slot. • Both -n and -np are also supported by Intel MPI and MPICH. • If you wrap your command with mpirun -np N, for example, then MPI will launch N copies of your command as N processes in a process group in a round-robin fashion by slots. These processes are called rank0, rank1 ... and so on. It is expected that the command you are launching is properly compiled and linked with MPI so as soon as these N copies launch, they can talk to each other and decide which tasks each would be working on. This logic is coded by the application coder. Typically all ranks wait for message from rank0. • If you wrap a non-MPI application command with mpirun --np N then each of the N copies would be doing exactly same thing. If you scheduled another mpirun wrapped command then MPI will launch another set of processes in another process group. These new processes will also be called rank0, rank1 ... and so on within that new process-group. The processes from process-group1 and the processes from process-group2 can only communicate through inter-process communication and not using intra-process communication. They also can not share memory. OpenMPI Command Line Options for Process Mapping to Hardware Resources • Note that none of the following options imply a particular binding policy - e.g., requesting N processes for each socket does not imply that the processes will remain bound to the socket as OS does rescheduling after context switching between processes. One can map the processes to specific objects in the cluster. This is the initial launch of the processes. One can then further bind the processes to objects in the cluster so that when OS does rescheduling after context switching, these processes remain bound to specific cluster objects. • Also note that, depending on value of --np, it is possible that a socket or node, for example, gets more than processes specified by following command options as the round robin scheduling continues. Following command options are round robin mapping parameters and not the options to control total number of processes in each hardware resource. • Following are all deprecated but still supported on Kogence. • -npersocket, --npersocket <N> (deprecated in favor of --map-by ppr:N:socket): On each node, launch N times the number of processor sockets on the node number of processes. N consecutive process will go in one socket, then scheduler moves to next socket and so on in round robin. • The -npersocket option also turns on the -bind-to-socket option. • -N, -npernode, --npernode <N> (deprecated in favor of --map-by ppr:N:node): On each node, launch N consecutive processes and then move to next node and so on in round robin. • Intel MPI and MPICH uses -perhost or -ppn. • -pernode, --pernode (deprecated in favor of --map-by ppr:1:node): On each node, launch one process. Equivalent to -npernode 1. • Following are preferred on Kogence cloud HPC platform. • -map-by ppr:N:socket: Launch N consecutive process per socket then move to next socket in round robin. ppr="process per resource". • -map-by socket: Launch 1 process per socket in round robin. For example, rank0 to go to first socket, rank1 to the next socket and so on until all sockets have one process on all nodes and then it will restart from first socket on first node as needed. • -map-by core: Launch 1 process per core in round robin. • Similarly you can do map-by for slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, node, sequential, distance. • --map-by socket is the default. OpenMPI Command Line Options for Process Binding to Hardware Resources • Binding processes to specific CPUs is also possible. This tells OS that a given process should always stick to a given CPU as OS is doing context switching between multiples processes and threads. This can improve performance if the operating system is placing processes suboptimally. For example, when we are launching less number of processes than the number of CPUs in a node, OS might oversubscribe some multi-core sockets while leaving other sockets idle; this can lead processes to contend unnecessarily for common resources. Or, OS might spread processes out too widely; this can be suboptimal if application performance is sensitive to inter-process communication costs. Binding can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with. • The processors to be used for mapping and binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a --bind-to l3cache directive will cause the process to be bound to the processors that share a single L3 cache within that socket. To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding algorithm will assign each process located to a socket to a unique core in a round-robin manner. Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located. • -bind-to socket • Similarly you can do bind-to for slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none. • By default, open MPI uses --bind-to core when the number of processes is <= 2, --bind-to socket when the number of processes is >2 and --bind-to none when nodes are being oversubscribed. If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process. OpenMPI Command Line Options for Porting Environment Variables • -x <env>=<value>: Export the specified environment variables to the remote nodes before executing the program. Only one environment variable can be specified per -x option. Existing environment variables can be specified or new variable names specified with corresponding values. For example: mpirun -x DISPLAY -x OFILE=/tmp/out .... The parser for the -x option is not very sophisticated; it does not even understand quoted values. Users are advised to set variables in the environment, and then use -x to export (not define) them. Miscellaneous OpenMPI Command Line Options • -v, --verbose: Be verbose • -V, --version: Print version number. If no other arguments are given, this will also cause orterun to exit. • -use-hwthread-cpus, --use-hwthread-cpus: Use hardware threads as independent CPUs. Note that if a number of slots in a node is not explicitly provided to Open MPI, then the use of this option changes the default calculation of number of slots on a node. • -nooversubscribe: Throw an error if user is trying to oversubscribe the node (i.e. requesting to start more processes in node than the number of slots). • -display-map, --display-map: Display a table showing the mapped location of each process prior to launch. • -display-allocation, --display-allocation: Display the detected resource allocation. OpenMPI Command Line Options That Should NOT Be Used On Kogence • -machinefile, --machinefile, -hostfile, --hostfile <hostfile>: Provide a hostfile to use. • Hostnames in the files are those shown by the hostname command or the IP address using which launcher can access the host. • For openMPI, -hostfile and -machinefile is synonymous. For Intel MPI and MPICH, the format of hostfile and machinefile is different. # This is an example hostfile. Comments begin with # foo.example.com bar.example.com slots=2 yow.example.com slots=4 max-slots=4  • -H, -host, --host <host1,host2,...,hostN>: Launch processes on the listed hosts. • Intel MPI and MPICH uses -hosts. Intel MPI Most Common Intel MPI Command Line Options • -n, -np <N>: Launch N processes in total across all nodes. If the option is not specified, then use the number of cores on each machine as default. • -perhost, -ppn <N>: Launch N consecutive processes on one host before moving to next host. • <N> can also be replaced by all or allcores. all is a synonym for number of CPUs in the host while allcores is a synonym for number of cores in the host. • Unless specified explicitly, the -perhost option is implied with the value set in $I_MPI_PERHOST which is set as allcores by default. This means that Intel MPI, by default, launches first N consecutive processes on first node with N being same as the number of cores in that node before scheduling processes to the next node.
• Notice the syntax difference with open MPI. Open MPI uses -N, -npernode, --npernode <N>.
• If you are working under a job scheduler or resource manager (as determined by looking at certian environment variables) then these options are ignored. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/job-schedulers-support.html for more details.
• -rr: Same as -perhost 1. rr="round robin". Consecutive processes go to different host and once all hosts are exhausted only then next process goes to host1.
• -nolocal will not run process on localhost.
• Remember that Intel MPI will always go by -n or -np and oversubscribe the nodes if needed. -perhost, -ppn, -rr etc. only control the "number of consecutive" processes on same node.

Intel MPI Command Line Options for Process Mapping and Binding to Hardware Resources

• Intel MPI allows user to specify computing resources for process mapping and binding either using the logical enumeration of CPUs (as provides by the cpuinfo utility) or using heirarchial levels (such as socket, core and CPU) or by topological enumeration. For example, for the case of a 2 socket HPC server with each socket being dual core and each core having 2 CPUs (being hyperthreaded), the logical CPU #, the heirachial CPU # and the topological CPU # would generally be reported like this: Logical Enumeration: 0 4 1 5 2 6 3 7 Hierarchical Levels: Socket 0 0 0 0 1 1 1 1 Core 0 0 1 1 0 0 1 1 Thread 0 1 0 1 0 1 0 1 Topological Enumeration: 0 1 2 3 4 5 6 7
• Intel MPI uses following 7 main environment variables to control process mapping and binding. Default values for the first 6 variables are shown below.
I_MPI_PIN=on
I_MPI_PIN_RESPECT_CPUSET=on
I_MPI_PIN_RESPECT_HCA=on
I_MPI_PIN_CELL=unit
I_MPI_PIN_DOMAIN=auto:compact
I_MPI_PIN_ORDER=bunch
I_MPI_PIN_PROCESSOR_LIST

• The environment variable I_MPI_PIN_PROCESSOR_LIST controls the one-to-one pinning while the I_MPI_PIN_DOMAIN environment variable performs the one-to-many pinning. If I_MPI_PIN_DOMAIN is defined then I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
• If hyperthreading is on then:
• the number or processes on the node is greater than the number of cores
• no one process pinning environment variable is set
• the "spread" order will automatically be used instead of the default "compact" order.
• I_MPI_PIN_DOMAIN:
• Defines a separate set of CPUs (domain) on a single node. One process will be started per domain in round robin.
• If this is defined then I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
• 3 different syntax:
• -env I_MPI_PIN_DOMAIN <arg>: Here <arg> can be either core, socket, node, cache1, cache2, cache3, or cache (the largest domain among cache 1/2/3). If, for example, -env I_MPI_PIN_DOMAIN socket is set then the domain size is one socket and number of of domains is same as number of sockets. One process per socket will be started in round robin fashion.
• -env I_MPI_PIN_DOMAIN=<size>[:<layout>]:
• Here <size> can be auto, omp or a number N. auto means domain size is number of CPUs divided by number of requested processes. This is the default. Meaning number of domains is same as number of processes requested. omp specifies the domain size to be same as the setting of OMP_NUM_THREADS. N specifies the domain size to be equal to N CPUs.
• And <layout> can be platform, compact or scatter. compact layout means that the domain members are located as close to each other as possible in terms of common resources (cores, caches, sockets, and so on). This is the default. scatter layout is the opposite of compact layout and implies that the domain members are located as far away from each other as possible in terms of common resources. platform layout implies that the domain members are ordered according to their BIOS numbering (platform-depended numbering).
• As an example, for the case of a 2 socket HPC server with each socket being dual core and each core having 2 CPUs (being hyperthreaded), mpirun -n 4 -env I_MPI_PIN_DOMAIN auto:scatter ./a.out implies domain size=2 (defined by the number of CPUs=8 / number of processes=4) and domain layout=scatter. This means that there are four domains {0,2}, {1,3}, {4,6}, {5,7} defined. Domain members do not share any common resources (refer to the heirarchieal enumeration shown above, cpu #0 and cpu#2 belong to two different sockets).
• As an another example, export OMP_NUM_THREADS=2; mpirun -n 4 -env I_MPI_PIN_DOMAIN omp:platform ./a.out implies domain size=2 (defined by OMP_NUM_THREADS=2) and layout=platform. In this case, four domains {0,1}, {2,3}, {4,5}, {6,7} are defined with domain members (CPUs) have consecutive numbering.
• I_MPI_PIN_ORDER <arg>:
• Here <arg> can be range, scatter, spread, compact, or bunch. bunch implies that the processes are mapped proportionally to sockets and the domains are ordered as close as possible on the sockets. Meaning node that has more sockets will get proportionately more processes and the consecutive processes go to separate domains that share as much as possible with previous domain. This is the default. compact means that the domains are ordered such that adjacent domains have maximum sharing of common resources. This is similar to bunch but here processes are not proportionate to number of sockets. scatter means that the domains are ordered such that adjacent domains have minimal sharing of common resources. spread means that the domains are ordered consecutively with the possibility not to share common resources. range means that the domains are ordered according to the processor's BIOS numbering. This is a platform-dependent numbering.
• This variable defines the order of mapping of MPI processes to the MPI_PIN_DOMAINS. Contrast this variable to I_MPI_PIN_DOMAIN=<size>[:<layout>]. There <layout> defines the organization of CPUs within a domain.
• The optimal setting for this environment variable is application-specific. If adjacent MPI processes prefer to share common resources, such as cores, caches, sockets, FSB, use the compact or bunch values. Otherwise, use the scatter or spread values. Use the range value as needed.
• I_MPI_PIN_CELL <arg>: Here <arg> can be unit (which means a logical CPU) or core (which means a physical core).
• This environment variable has effect on both pinning types: one-to-one pinning through the I_MPI_PIN_PROCESSOR_LIST environment variable and the one-to-many pinning through the I_MPI_PIN_DOMAIN environment variable. The default value rules are: If you use I_MPI_PIN_DOMAIN, then the cell granularity is unit (that is is one member of domain is one logical CPU). If you use I_MPI_PIN_PROCESSOR_LIST, then the following rules apply: When the number of processes is greater than the number of cores, the cell granularity is unit. When the number of processes is equal to or less than the number of cores, the cell granularity is core.
• I_MPI_PIN_PROCESSOR_LIST:
• Does one-to-one-pinning as opposed to I_MPI_PIN_DOMAIN
• One-to-one pinning is not recommended for multi-threaded applications (i.e hybrid parallelism).
• 3 different syntax:
• Syntax1: To pin the processes to CPU0 and CPU3 on each node globally, use the following command: mpirun -genv I_MPI_PIN_PROCESSOR_LIST=0,3 -n <number-of-processes> <executable>. Format like this is acceptable as well: 0,1,2,4-9,10,12,17-19,22.
• Syntax2: I_MPI_PIN_PROCESSOR_LIST=all[:grain=cache3][,shift=<shift>][,preoffset=<preoffset>][,postoffset=<postoffset>. The all can be replaced by allcores or allsocks. The allcores is default. The grain is the granularity. The grain value must be a multiple of the procset (ie all, allcores or allsocks) value. Otherwise, the minimal grain value is assumed. The default value is the minimal grain value. The grain can be <n> (meaning n cpus), fine, core, cache, cache1, cache2, cache3, socket, half (i.e. half of a socket), third, quarter or octavo (i.e. 1/8th of a socket).
• Syntax 3: I_MPI_PIN_PROCESSOR_LIST=all:map=scatter. You can use map=bunch, map=spread as well. Instead of all, you can also use allcores or allsocks. bunch means that the processes are mapped proportionally to sockets and the processes are ordered as close as possible on the sockets. scatter means that the processes are mapped as remotely as possible so as not to share common resources: FSB, caches, and core. spread means that the processes are mapped consecutively with the possibility not to share common resources.

Intel MPI Command Line Options for Porting Environment Variables

• -env <envar> <value>
• -envall
• -envnonne

Running Intel MPI and OpenMP Hybrid Parallel

• source <install-dir>/env/vars.sh release: Make sure "thread safe" version of MPI is being used.
• export I_MPI_PIN_DOMAIN=omp
• export OMP_NUM_THREADS=<N>

or alternatively:

• mpirun -n 4 -env OMP_NUM_THREADS=4 -env I_MPI_PIN_DOMAIN=omp ./myprog

Miscellaneous Intel MPI Command Line Options

• -info: all other options ignored and info printed
• -V, -version: Prints version.
• -v, -verbose: Print debug info.
• -print-rank-map: Print MPI rank map.
• -iface ib0: Use ib0 network interface.

Intel MPI Command Line Options That Should NOT Be Used On Kogence

• -hostfile, -f <hostfile>. Hostfile format is just one hostname or IP address per line:
host1
host2
host3

-machinefile or -machine <machinefile>. Machinefile format on the other hand can specify number of processes (i.e slots in open MPI terminology) per host. Process count is assumed 1 if not specified.
host1:4
host2:8
host3
host4

• -hosts <host1,host2,host3>: Launch processes on the specified list of hosts.
• -host <host1>: Can only specify one host. Notice the difference compared to open MPI syntax.

Intel MPI with external PMI  (instead of Hydra)

• Default process manager is Hydra. Hydra supports many different process launcher (with default being ssh) which can be changed using the -bootstrap command line option.
• Although not documented by Intel, Intel MPI can be invoked using PMI, PMI2 or PMIx process managers instead of Hydra if system is properly configured. Please refer to https://slurm.schedmd.com/mpi_guide.html#intel_mpi.

MPICH

Most Common MPICH Command Line Options

• MPICH uses hydra process manager just like Intel MPI. As such all command line options that Intel MPI uses can also be used with MPICH except that the -bootstrap option used in Intel MPI to use non-default process launcher is called -launcher in MPICH. Please refer t https://www.mpich.org/static/downloads/3.4/mpich-3.4-README.txt for more details.
• Machinefile format of MPICH is same as that of Imtel MPI.
• For process mapping and process binding, instead of environment variables used by Intel MPI, MPICH uses -bind-to and -map-by just like openMPI.

What is OpenMP?

OpenMP is an application programming interface (API) supports shared-memory multithreading programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and openMP (multithreading) at the same time and is some times known as hybrid parallelism.

What is Charm++

Similar to MPI, Cham++ is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD computing systems with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Charm++ can use MPI as the transport layer (aka Charm over MPI), or you can run MPI programs using Charm++ Adaptive-MPI (AMPI) library (aka MPI over Charm). Charm++ can also use TCP/IP stack or other OS bypass network protocols for messaging needs. The communication protocols and infrastructures supported by Charm++ are UDP, MPI, OFI, UCX, Infiniband, uGNI, and PAMI. Charm++ programs can run without changing the source on all these platforms.

MPI is much more widely adopted standard and as of today only few HPC applications, such as NAMD, uses Charm++. Charm++ does offer some benefits over MPI such relative ease of programing to developers, fault tolerance and separating out machine layer so that application developer does not distinguish whether user executes the application leveraging multi-threading parallelism on shared memory SMP architectures or multi-processing parallelism on distributed memory clusters.

The Charm++ software stack has three components - Charm++, Converse and a machine layer. In general, machine layer handles the exchange of messages among nodes, and interacts with the next layer in the stack - Converse. Converse is responsible for scheduling of tasks (including user code) and is used by Charm++ to execute the user application. Charm++ is the top-most level in which the applications are written.

Building Charm++ and Charm++ Applications

Charm++ can be build to suite different OS, CPU and Networking architectures (i.e. different networking protocols as detailed above). SMP capabilities can be requested as an option during build. SMP capabilities allow hybrid parallelism (multi-threading + multi-processing). "Multicore" builds do not carry any transport layer and therefore can only be used on single SMP machines. Benefit of multicore build over SMP build (of course this comparison only applicable when user wishes to run applications only on single machine) is that in SMP mode, Charm++ spawns one communication thread per process and therefore there is only less thread per process available for application parallelism. Multicore builds can use all available CPUs for worker threads. Also, multicore builds only support multi-threading. Similarly, any SMP capable version of Charm++ can be built with integrated open MP runtime. Charm++ creates its own thread pool and if you use compiler provided open MP library while compiling your own application, such as NAMD, then you can end up oversubscribing. It is also better to have a single runtime control all threads/process (unlike MPI+Open MP). Charm++ built with integrated OMP runtime that can control OMP threads in the application and you don't need to provide compile time OMP options while compile your own applications. To use this library on your applications, you have to add -module OmpCharm in compile ﬂags to link this library instead of the compiler-provided library in compilers. Without -module OmpCharm, your application will use the compiler-provided Open MP library which running on its own separate runtime. Also you don’t need to add -fopenmp or -openmp with gcc and icc compilers respectively. These ﬂags are already included in the predeﬁned compile options when you build SMP capable Charm++ with OMP option.

Charm++ Terminology

In terms of physical resources, Cham++ assume the parallel machine consists of one or more physical nodes, where a physical node is a largest unit over which cache coherent shared memory is feasible (and therefore, the maximal set of cores per which a single process can run). Each physical node may include one or more processor chips (sockets), with shared or private caches between them. Each chip may contain multiple cores, and each core may support multiple hardware threads (SMT/HTT for example).

A chare is a charm++ object such as an array that you want to distribute and do some computation.

Charm++ recognizes two logical entities:

• PE (processing element) = can be a thread (smp mode) or a process (non-smp mode).
• Logical node = one OS process.

SMP Mode: is like MPI + Open MP. You can have multiple processes on each physical node and each process can have multiple threads. In Charm++ terminology, you will say physical node is partitioned into logical nodes (one process = one logical node) and each local node has multiple PEs (in SMP mode, PE = thread).

Non-SMP Mode: is like just MPI. Each PE is a process. Physical node is partitioned into many logical nodes. Each logical node is running one PE (that is one process).

Just like in MPI case, depending on the runtime command-line parameters, a PE (i.e. thread or process) may be associated with a "core" or with a Intel "hardware thread".

Charm++ program, a PE is a unit of mapping and scheduling: each PE has a scheduler with an associated pool of messages. Each chare is assumed to reside on one PE at a time. Physical nodes may be partitioned into one or more logical nodes. Since PEs within a logical node share the same memory address space, the Charm++ runtime system optimizes communication between them by using shared memory.

A Charm++ program can be launched with one or more (logical) nodes per physical node. For example, on a machine with a four-core processor, where each core has two hardware threads, common conﬁgurations in non-SMP mode would be one node per core (four nodes/PEs total) or one node per hardware thread (eight nodes/PEs total). In SMP mode, the most common choice to fully subscribe the physical node would be one logical node containing seven PEs-one OS thread is set aside per process for network communications. When built in the “multicore” mode that lacks network support, a comm thread is unnecessary, and eight PEs can be used in this case. A communication thread is also omitted when using some high-performance network layers such as PAMI.

Alternatively, one can choose to partition the physical node into multiple logical nodes, each containing multiple PEs. One example would be three PEs per logical node and two logical nodes per physical node, again reserving a communication thread per logical node.

It is not a general practice in Charm++ to oversubscribe the underlying physical cores or hardware threads on each node. In other words, a Charm++ program is usually not launched with more PEs than there are physical cores or hardware threads allocated to it. To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain.

Non-SMP Mode: PE = Logical Node (= OS Process)

Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:

8 logical nodes = 8 OS processes. Each running on an hardware thread.

Logical Node = one OS Process. Contain more than one PE (i.e. more than one OS thread)

Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:

7 PE (each running as a OS thread).

There will be only one process.

Reserving 1 hardware thread for network communication.

If you had multiple physical nodes then there will be one process per node with 7 PEs on each physical node and there will be one communication thread per process.

There are various benefits associated with SMP mode. For instance, when using SMP mode there is no waiting to receive messages due to long running entry methods. There is also no time spent in sending messages by the worker threads and memory is limited by the node instead of per core. In SMP mode, intra-node messages use simple pointer passing, which bypasses the overhead associated with the network and extraneous copies. Another beneﬁt is that the runtime will not pollute the caches of worker threads with communication-related data in SMP mode.

However, there are also some drawbacks associated with using SMP mode. First and foremost, you sacriﬁce one core to the communication thread. This is not ideal for compute bound applications. Additionally, this communication thread may become a serialization bottleneck in applications with large amounts of communication. Finally, any library code the application may call needs to be thread-safe.

In Kogence, your workflow launch is already optimized considering these tradeoffs when evaluating whether to use SMP mode for your application or deciding how many processes to launch per physical node when using SMP mode.

Charm++ Program Options

charmrun is the standard launcher for Charm++ applications. When compiling Charm++ programs, the charmc linker produces both an executable file and an utility called charmrun, which is used to load the executable onto the parallel machine.

Unlike mpirun, you can pass options either to charmrun or to the program itself. For example, charmrun ++ppn N program program-arguments or charmrun program ++ppn N program-arguments.

Execution on platforms which use platform specific launchers, (i.e., mpirun, aprun, ibrun), can proceed without charmrun, or charmrun can be used in coordination with those launchers via the ++mpiexec (see below). Programs built using the network version (i.e. non multi-core) of Charm++ can be run alone, without charmrun. This restricts you to using the processors on the local machine. If the program needs some environment variables to be set for its execution on compute nodes (such as library paths), they can be set in .charmrunrc under home directory. charmrun will run that shell script before running the executable. Or you can use, ++runscript option (see below).

In all of the below,

• "p" stands for PE (can be thread or process depending on smp of non-smp modes)
• "n" stands for "node" (i.e. logical node = process)
• "ppn" stands for PE per logical node (i.e. PE per Process)
Single Machine Options to Charm++ Program

+autoProvision: Automatically determine the number of worker threads to launch in order to fully subscribe the machine running the program.

+oneWthPerHost: Launch one worker thread per compute host. By the definition of standalone mode, this always results in exactly one worker thread.

+oneWthPerSocket: Launch one worker thread per CPU socket.

+oneWthPerCore: Launch one worker thread per CPU core.

+oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.

+pN: Explicitly request exactly N worker threads. The default is 1.

Cluster Options to Charm++ Program

Note: Notice 2 ++ sign below.

++autoProvision: Automatically determine the number of processes and threads to launch in order to fully subscribe the available resources.

++processPerHost N: Launch N processes per compute host.

++processPerSocket N : Launch N processes per CPU socket.

++processPerCore N : Launch N processes per CPU core.

++processPerPU N : Launch N processes per CPU processing unit, i.e. hardware thread.

++nN: Run the program with N processes. Functionally identical to ++pN in non-SMP mode (see below). The default is 1.

++pN: Total number of PE (threads or processes) to create. In SMP mode, this refers to worker threads (where n∗ppn=p), otherwise in non-smp mode it refers to processes (n=p). The default is 1. Use of ++p is discouraged in favor of ++processPer* (and ++oneWthPer* in SMP mode) where desirable, or ++n (and ++ppn) otherwise.

++nodelist: File containing list of nodes. Dile is of the format:

host <hostname> <qualifiers>

such as

host node512.kogence.com ++cpus 2 ++shell ssh

++runscript : Script to run node-program with. For example: $./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script. In this case, the set_env_script is invoked on each node before launching the program pgm. ++local: Run charm program only on local machines. ++mpiexec: Use the cluster’s mpiexec job launcher instead of the built in ssh method. This will pass -n$P to indicate how many processes to launch. If -n $P is not required because the number of processes to launch is determined via queueing system environment variables then use ++mpiexec-no-n rather than ++mpiexec. An executable named something other than mpiexec can be used with the additional argument ++remote-shell runmpi, with ‘runmpi’ replaced by the necessary name. To pass additional arguments to mpiexec, specify ++remote-shell and list them as part of the value after the executable name as follows: $ ./charmrun ++mpiexec ++remote-shell "mpiexec --YourArgumentsHere" ./pgm. Use of this option can potentially provide a few benefits: Faster startup compared to the SSH approach charmrun would otherwise use; No need to generate a nodelist file; Multi-node job startup on clusters that do not allow connections from the head/login nodes to the compute nodes. At present, this option depends on the environment variables for some common MPI implementations. It supports OpenMPI (OMPI_COMM_WORLD_RANK and OMPI_COMM_WORLD_SIZE), M(VA)PICH (MPIRUN_RANK and MPIRUN_NPROCS or PMI_RANK and PMI_SIZE), and IBM POE (MP_CHILD and MP_PROCS).

CPU Affinity Options

+setcpuaffinity: Set cpu affinity automatically for processes (when Charm++ is based on non-smp versions) or threads (when smp). This option is recommended, as it prevents the OS from unnecessarily moving processes/threads around the processors of a physical node.

+excludecore<core#>: Do not set cpu affinity for the given core number. One can use this option multiple times to provide a list of core numbers to avoid.

++oneWthPerHost: Launch one worker thread per compute host.

++oneWthPerSocket: Launch one worker thread per CPU socket.

++oneWthPerCore: Launch one worker thread per CPU core.

++oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.

++ppn N: Number of PEs (or worker threads) per logical node (OS process).

+pemap L[-U[:S[.R]+O]][,...] : Bind the execution threads to the sequence of cores described by the arguments using the operating system’s CPU afﬁnity functions. Can be used outside SMP mode. PU = hardware thread. By default, this option accepts PU indices assigned by the OS. The user might want to instead provide logical PU indices used by the hwloc (see here for details). To do this, prepend the sequence with an alphabet L (case-insensitive). For instance, +pemap L0-3 will instruct the runtime to bind threads to PUs with logical indices 0-3.

+commap p[,q,...] : Bind communication threads to the listed cores, one per process.

To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain. ++ppn N will spawn N threads in addition to 1 thread spawned by the runtime for the communication threads, so the total number of threads will be N+1 per node. Consequently, you should map both the worker and communication threads to separate cores. Depending on your system and application, it may be necessary to spawn one thread less than the number of cores in order to leave one free for the OS to run on. An example run command might look like: ./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app <args>. This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per node), each with three worker threads/PEs (++ppn 3). The worker threads/PEs will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 to core 3 and PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. PEs/worker threads 0-2 compromise the ﬁrst logical node and 3-5 are the second logical node. Additionally, the communication threads will be mapped to core 0, for the communication thread of the ﬁrst logical node, and to core 4, for the communication thread of the second logical node. Please keep in mind that +p always speciﬁes the total number of PEs created by Charm++, regardless of mode. The +p option does not include the communication thread, there will always be exactly one of those per logical node. Recommendation is to not do oversubscription. Dont run more threads/processes than the number of hardware threads available.

What are Resource Managers and Job Schedulers?

Resource Manager and Job Scheduler phrases are used interchangeably. Kogence autoscaling cloud HPC clusters are configured with the Grid Engine resource manager. Any command that you invoke on the shell terminal would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster using the shell terminal then please use the qsub command of the Grid Engine resource manager like below:

qsub -b y -pe mpi num_of_CPU -cwd YourCommand


Be careful with the num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the qsub man page. Specifically you might find following switches to be useful:

• -pe mpi: Name of the "parallel environment" on Kogence clusters.
• -b y: Command you are invoking is treated as a binary and not as a job submission script.
• -cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
• -wd working_dir_path: Makes the working_dir_path as the working directory. Output and error files would be generated in this folder.
• -o stdout_file_name: Job output would go in this file.
• -e stderr_file_name: Job error would go in this file.
• -j y: Sends both the output and error to the output file.
• -N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using -o and/or -e switches. You can also monitor and manage your job using the job name.
• -sync y: By default qsub command returns control to the user immediately after submitting the job to cluster. So you can continue to do other things on the CloudShell terminal including submitting more jobs to the scheduler. This option tells qsub to not return the control until the job is complete.
• -V: Exports the current environment variables (such as \$PATH) to the compute nodes.

You can use qready to check if cluster is ready before submitting the jobs to cluster. You will submit simulations using the qsub command. You can use qwait job_name to wait for job to finish before doing some post prepossessing for submitting a dependent job. We recommend adding a CloudShell to your software stack in the Stack tab GUI. This CloudShell terminal can be used to monitor and manage cluster jobs. You monitor cluster jobs by typing qstat on the terminal. You can delete jobs using qdel jobID command. You can monitor compute nodes in the cluster by typing qhost on the terminal. qhost lets you monitor the performance and utilization of the compute nodes in your cluster.

Cloud HPC Hardware Available on Kogence

Kogence Cloud HPC Servers for CPU Limited Workloads (CLWL nodes)

For the CPU limited workloads (CLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake (SKU 8124M) and Intel Cascade Lake (SKU 8275CL) microarchitectures that are optimized for best compute performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for workloads that are limited by the CPU and do not perform excessive hard-disk read/write operations, do not generate excessive internode network communication and do not use excessive amount of RAM. These nodes can clocked at up to 3.6GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

Kogence Cloud HPC Servers for Memory Limited Workloads (MLWL nodes)

For the memory limited workloads (MLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake microarchitectures (SKU 8175M) that are optimized for best memory performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. Compared to the CPU limited workload (CLWL) nodes, these MLWL nodes provide larger amount of RAM per unit CPU and provide larger L3 cache per socket but, for non-AVX instructions, these nodes are clocked at lower speed compared to the CLWL nodes. These nodes are most appropriate for workloads that are limited by the RAM, are moderately CPU intensive, do not perform excessive hard-disk read/write operations and do not generate excessive internode network communication. These nodes can clocked at up to 3.1GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

Kogence Cloud HPC Servers for Network Limited Workloads (NLWL nodes)

For the network limited workloads (NLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake microarchitectures (SKU 8124M) that are optimized for both the best network performance as well as the best CPU performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. Compared to the CPU limited workload (CLWL) nodes, these NLWL nodes provide up to 100Gbps of network bandwidth. These nodes are most appropriate for heavily communicating jobs such as the distributed memory MPI jobs that generate large internode network traffic but do not perform excessive hard-disk read/write operations and do not need excessive amount of memory. These nodes can clocked at up to 3.4GHz speeds (non-AVX instructions, with all cores active) and can provide more than 3 teraflops of performance per node (Dual Precision, FP64).

Kogence Cloud HPC Servers for Storage Limited Workloads (SLWL nodes)

For the storage limited workloads (SLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake (SKU 8124M) and Intel Cascade Lake (SKU 8275CL) microarchitectures that are optimized for both the best compute as well as best storage performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for workloads that are limited by CPU performance and perform a lot of hard-disk read/write operations but otherwise do not generate excessive internode network communication and do not use excessive amount of RAM. Compared to the CPU limited workload (CLWL) nodes, these SLWL nodes come with NVMe solid state storage and can provide up to 1400K IOPS and up to 5500 MiBps of strorage throughput. These nodes can clocked at up to 3.6GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

Kogence Cloud HPC Servers for Hybrid Workloads (HWL nodes)

For hybrid workloads (HWL) that are constraint from multiple performance bottlenecks we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Cascade Lake (SKU 8259CL) microarchitectures that are optimized for best over all performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for hybrid workloads that are limited by the CPU, require large memory, perform large amount of hard-disk read/write operations and generate large internode network communication such as through large distributed memory MPI jobs. These nodes can clocked at up to 3.1GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

Kogence Cloud HPC Servers for GPU Workloads (GWL nodes)

For GPU workloads (GWL) we offer NVIDIA Volta V100 Tensor Core based enterprise HPC class compute nodes. These nodes come either the Intel Xeon Broadwell or the Intel Xeon Scalable (Skylake) CPUs. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. While Broadwell based nodes are designed for primarily the GPU workloads, the Skylake based nodes are designed for more demanding hybrid workloads that need good storage IOPS performance, good network performance and good CPU performance in addition to Volta GPUs.

HPC Software, Simulator and Solver Applications Available on Kogence

On Kogence, any software application, solver or a simulation tool, each being referred to as a Simulator, is deployed as an independent entity, referred to as a Container, that can be connected to Models using the Stack tab of the Model and invoked to do some tasks/computations on input files/data provided under the Files tab of the Model. Containers cannot contain user specific or Model specific data or setup. It has to be an independent entity that can be linked and invoked from any Model on Kogence.

We use the term Simulator in much wider sense than what is commonly understood. In the context of definition above, Matlab is a simulator and so is Python. Both are deployed in independent Container. On Kogence we use Docker container technology.

To review the list of publicly available software application, solver or a simulation tool on Kogence, please refer to the Software_Library page.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we do not launch a Container; we launch a Model. Model is connected to a cluster and a stack of software and simulators

Cloud HPC Network Architecture on Kogence

On Kogence container cloud high performance computing (HPC) platform your requested compute infrastructure is preconfigured with various types of data networks depending upon the requirements of the model you are running. For example, all software containers requested by your workflow are automatically composed so that all containers in your workflow can talk to each other. Kogence cloud HPC platform allows you to run a multi-software workflows on an autoscaling multi-node cluster of HPC severs orchestrated on the cloud infrastructure as and when needed by your model. See container network to understand architecture of networking among individual containers in your workflow. Refer to the microprocessor architecture to understand the OmniPath (a type of Remote Direct Memory Access network) network among the sockets of an cloud HPC server and the network among the cores within a socket. Your cloud HPC clusters will also be pre-configured with OS bypass network (e.g. OpenFabircs Infiniband) for your multi-node MPI workloads, if and as needed by your model. Please refer to the OS bypass network document for the details.

There is an Internet Protocol (IP) network configured among all the nodes of your HPC cloud cluster. Resource managers, job schedulers as well as MPI libraries that we automatically preconfigure your clusters with, are dependent on an existing IP network among the nodes. This document describes the IP network among different components of Kogence platform architecture.

Kogence Cloud HPC Platform - IP Network Architecture

Geographical Regions, Data Center Zones, Proximity Placement Groups and Resource Groups

Kogence cloud HPC platform is, by design, a cloud agnostic multi-region, multi-data-center app. Your cloud HPC infrastructure is orchestrated on hardware from many different data centers provided by many different cloud infrastructure providers across the globe. Scope and choices are configurable and is done on per customer basis at the time of deployment depending on the geographical location and needs of each enterprise customer. We take data security very seriously. We only use the data centers whose operations are regularly audited by independent firms against an ISAE 3000/AT 101 Type 2 Examination standard. Please refer to the Kogence security policy document for detailed criteria of our data center selection.

Our data centers are distributed in some 20+ geographical regions and organized in some 50+ data center zones. Each geographical region has 2 or 3 data center zones and each zone is mapped to 2 or 3 physical data centers. Data center zones can be treated as logical data centers that are fail safe. Even if a physical data center has power failure, for example, the logical data center (i.e. the zone) is still up and running. Your HPC resources are always orchestrated in a single data center zone. Exact physical data center that is used for your HPC resources changes from each request to next request.

Geographical level isolation could be important for workloads with compliance and data sovereignty requirements where guarantees must be made that user data does not leave a particular geographic region. Geographical level orchestration configuration tuning is also important for workloads that are latency-sensitive and need to be located near users in a particular geographic area.

All nodes within a Kogence autoscaling cloud HPC cluster are orchestrated within a proximity placement group that provides low latency and high throughput IP network among the nodes in addition to OS Bypass network for multi node MPI simulations.

Enterprise clients with redundancy needs or those with users distributed across the globe can choose to configure deployment of Kogence Grand Central App to span across data center zones and geographical regions. In each data center zone, participating data centers are connected to each other over redundant low-latency private network links. Likewise, all data center zones in a region communicate with each other over redundant private network links. These intra- and inter-data-center zone links are needed for fail safe redundant data replication and automatic backups for storage and databases.

Each enterprise client's data, backups as well as all other infrastructure resource reside in a separate resource group. On request from customer's authorized representative (such as at the termination of Kogence license/contract) the entire resource group can be deleted with no copies of customer data left anywhere.

Virtual Private IP Network - Host and Network Based Intrusion Prevention

Please refer to the Fundamentals of IP Network if you need to familiarize yourself with the IP network terminology used below.

Your organization's Kogence autoscaling cloud HPC clusters are orchestrated within a private IP network with a size /16 IPv4 CIDR block (example: 10.0.0.0/16). Each enterprise customer's simulations are launched in a separate private network. Simultaneously running simulations from either the same user or different users of same organization are launched in separate subnets with /24 IPv4 CIDR block size (example: 10.0.0.0/24) of the private network. Network and subnet sizes as well as geographical regions and logical data center zones are configurable for each enterprise customer's deployment.

Each node also provides host based intrusion prevention. The network interface controllers on all hosts define well specified and required inbound and outbound traffic and all other traffic is blocked. Each node host is also firewalled at OS level with corresponding iptable entries. At an application level, all Kogence cloud based docker containers run in secure non-privileged mode. For under standing the network architecture among then containers please see separate document on Container Networking.

In typical non-HPC optimized cloud service, network interface controllers on all hosts would be virtual network interfaces. The software defined virtual network interfaces seriously impact the network bandwidth and increase the latency to a level that they become performance limiting factor in, for example, a distributed memory MPI applications. On Kogence cloud HPC platform, if offer nodes specifically designed for the Network Limited Work Loads. These nodes use hardware based network virtualization. The network card on the host hardware is specifically designed to separate the network traffic on all hosted virtual machines at the hardware level without impacting the network bandwidth and increasing latency. Network Limited Work Loads nodes on Kogence cloud HPC platform provide upto 100Gbps of network bandwidth. Please also check separate document on OS Bypass network for multi node distributed memory MPI simulations.

The Fundamentals of IP Network

The Internet Protocol suite (IP suite, sometimes also know as the TCP/IP suite) is the set of communications protocols widely used for the internet (as well as other networks). IP Protocol suite should not be confused with IPv4 or IPv6 which is just one set of protocols of the Internet Layer of the entire IP suite. The suite was originally part of the UNIX operating system and was later integrated with all common OS and is now maintained by the Internet Engineering Task Force.

The suite contains many different protocols --- two of the most important ones are TCP (Transmission Control Protocol) and IP (Internet Protocol). The IP suite, like many other protocol suites, may be viewed as a set of 4 main layers of logic and services. From lowest to highest, these are the Link Layer, the Internet Layer, the Transport Layer, and the Application Layer. This 4 layer IP model should not be confused with the popular OSI reference model that describes the same protocol suite in more granular 7 layers. Each Layer can use any of the multiple possible protocols. The OS kernel comes with many utilities to coordinate data packet exchange between different Layers. These kernel utilities are able to coordinate between different Layers with each using any of the acceptable protocol.

Each layer solves a set of problems involving the transmission of data, and provides a well-defined service to the upper layer protocols based on using services from some lower layers. Upper layers are logically closer to the user and deal with more abstract data, relying on lower layer protocols to translate data into forms that can eventually be physically transmitted. A very helpful abstraction resulting from layered protocol mental model is to be able think that application client (e.g. ssh client) is directly communicating with the application server (e.g. ssh daemon) on remote host and the communication is entirely focusing on ssh functionality such as authorization, authentication and encryption. In reality, the ssh data packets at sender host go through a stack of software/logic (part of OS, device drivers, firmaware or hardware) and are wrapped by TCP headers, IP headers and then Ethernet headers. On the recipient host side, these data packets again go through same stack of software/logic in reverse order and these headers are removed one by one. But before data packet is successfully handed over to an upper layer, the lower layers on the sender and recipient host may exchange many data packets back and forth several times to make sure that the data is correct and reliable. Similarly, before the data packets actually reach the desired destination it may actually hop through multiple network devices that upper Layers are completely agnostic of. This way upper layer can focus on higher level logic (e.g. ssh authorization, authentication and encryption) and higher level source and destination addresses and not worry about lower level details such data integrity, data reliability, data routing etc.

Link Layer Standards: IEEE 802 LAN

The lowest layer is the Link Layer that consists of software and hardware needed to transmit data in actual physical media. Here we will review one of the most important Link Layer standard -- the LAN standards.

A local area network (LAN) is a collection of devices that are physically connected to the same hub, switch or group of interconnected switches, while all being configured to use the same low level (i.e. the Link Layer) communication protocol (i.e. the IEEE 802 protocols). IEEE 802 is a family of IEEE standards dealing with local area networks (LAN) and metropolitan area networks (MAN) and is maintained by the IEEE 802 LAN/MAN Standards Committee (LMSC) consisting of several working groups with individual working group focusing on a particular LAN/MAN standard/technology. The groups are numbered from 802.1 to 802.22. Among the 802 standards, the most widely used standards are for the Ethernet (802.3),  Wireless LAN (Wi-Fi, 908.11), Bridging and Virtual Bridged LANs (802.1). Other historical LAN protocols include ARCNET and AppleTalk. In modern days, LAN is synonymous with networks built to communicate using IEEE 802 standards while the wired LAN is synonymous with Ethernet.

The services and protocols specified in IEEE 802 map to the Link Layer of the IP protocol suite. The Link Layer is typically subdivided into two sub-layers (Data Link and Physical) in the seven-layer Open Systems Interconnection (OSI) networking reference model. IEEE 802 further splits the OSI Data Link Layer into two sub-layers named logical link control (LLC) and media access control (MAC). For applications running on one host to be able to communicate with other hosts within the network, one also need to define other communication standards and protocols. For example, Link Layer protocols such as LAN protocols will not ensure data packets arrive in order, they will not ensure reliability and they will not do any error corrections. Moreover, higher level concepts of ports and sockets needs to be defined. For example, if you open to browser tabs as clients that are requesting different pages from a webserver, then your client host needs to be able to track and figure out how to deliver the content to correct tab. This is done using concepts of ports and sockets as will see below.

There can be non-LAN IP-networks. IP protocol suite is designed to work with any Link Layer technology which can be different from LAN technologies defined by the IEEE 802 standards. Conversely, you can also potentially access your LAN devices from within the LAN by building a non-IP protocol stack on top of the low level Link Layer LAN protocols defined by the IEEE 802 standards. But if you need to access these devices over the internet then almost certainly you will need to connect your LAN (or your non-LAN network) using network hardware that support IP protocol stack on top of LAN or other Link Layer standards (of course, then hosts devices on your network will themselves need to have the IP software stack support).

Within the IP protocol suite terminology, LAN is a network (or a subnet as we will see below) as seen by the IP Layer. Therefore, LANs that use IP protocol suite are sometimes also referred more specifically as the IP-networks. From the Link Layer perspective, this network may be divided into network segments that are connected together by switches/bridges and hubs/repeaters to form a LAN. The IP Layer does not know anything about the concept of network segments. The network segments are single collision domains, meaning only one sender can send message on the segment at a time and can lead to congestion if a segment consists of large number of hosts. Switches and bridges help reduce the congestion by breaking LANs into several smaller collision domains (i.e. smaller network segments). To function properly, LANs are configured so that any device can send a broadcast message that can be seen by all devices on the LAN.  For this reason, LANs are often referred to as broadcast domains.

Link Layer Network Hardware: Network Interface Controller, Repeater, Hub, Bridge and Switch

Network Interface Controller (NIC)

An NIC goes by several common names such as the a network interface card, a network adapter, a LAN adapter or simply a network interface. Any device that needs to be connected to a network needs to have at least one NIC. If device is attached to multiple networks then it can have more than one NIC. The NIC is a Link Layer device and provides physical access to a networking medium and, for LAN and other similar networks, provides a low-level addressing system through the use of MAC addresses that are uniquely assigned to each NIC in a network.

A device must have one NIC for each network (not network type) to which it connects. For instance, if a host attaches to two LAN networks, it must have two network cards. When you install a new NIC, you are creating a new network interface. However some network interfaces may be logical and are not necessarily associated with an NIC. For instance, the loopback interface has no NIC associated with it. Loopback interface is used by the host to send the message back to itself. Similarly, virtual network interfaces are also not associated with any specific NIC. One can create virtual network interfaces using kernel utilities as described in a separate section below.

IP protocol suite supports many types of network interfaces including IEEE 802.3 LAN (i.e. the Ethernet LAN), Token-ring (tr) (another common LAN protocol), Serial Line Internet Protocol (SLIP, which is used for serial connections), Loopback (lo, the loopback interface is used by a host to send messages back to itself), FDDI, Serial Optical (so, the Serial Optical interface is for use with optical point-to-point networks using the Serial Optical Link device handler), Point-to-Point Protocol (PPP, the Point to Point protocol is most often used when connecting to another computer or network via a modem) and Virtual IP Address (VIPA).

One should carefully notice the difference between network, network interface and network driver. Defining a network interface can be seen as defining a specific device driver that needs to be used for traffic going through that network interface and defining some other basic properties so that the IP layer knows which traffic should be routed to that network interface. Even virtual network interfaces have network drivers defined for them. We use the term NIC for both the actual network interface hardware as well as the virtual network interfaces that are defined in software and don't have hardware associated with them.

Each NIC typically will know its own IP, its subnet mask and IP of the default gateway. The subnet mask defines the set of IP addresses that belong to the subnet to which the NIC is connected to while the default gateway is the default route for all those IP addresses that do not belong to the subnet. One can also define additional static routes for the NIC. A static route is nothing but a way of specifying a route for the traffic that does not belong to the network to which NIC is connected but at the same time it must also not go through the default gateway. One can add a static route to a different network that cannot be accessed through your default gateway. As discussed below, these routes for all NICs are maintained by the kernel in a routing table.

An NIC connects the device to a network (or a subnet). Each NIC has a unique private Link Layer address (which is the MAC address in case of a LAN Link Layer) as well as a unique private IP Layer address (which is the private IP address in case of IPv4 or IPv6 IP Layer). Both of these addresses are unique within the network (or the subnet) that the NIC is connecting to.

There is a prevalent misconception that the MAC addresses are unique across all network devices on a network as well as across all networks across the entire planet. This is not correct. First of all MAC addresses are associated to network interfaces (e.g. network cards), i.e. the hardware that allows hosts to connect to a network. These hardware can be swapped. Network interfaces can also be implemented in software, meaning they can be virtual. MAC address can easily be spoofed as well by writing a piece of code that publishes different MAC address to any request for identification. From the protocol perspective, MAC addresses are only supposed to be unique within a network (e.g. within a LAN). As we will see below, if you connect multiple routers together then in that network of routers, it is the MAC address of routers that needs to be unique. The MAC addresses of each devices behind each router should be unique as well but 2 devices behind 2 different routers are allowed to have same MAC address. By the way, same is true for the private IP addresses as well.

IEEE manages the MAC addresses of hardware sold by manufacturers, so they tended to be globally unique. This is done because sellers do not know how you will connect the device to the network. The hardware they are selling might itself act as a router and then its MAC address needs to be unique or if it is used behind a router then it does not need to be unique. With network hardware being virtualized, things are starting to be much more complex.

IP Aliasing, Virtual Network Interfaces and Sub-Interface

Most OS kernel allow two private IP addresses to be assigned to a single network interface. Irrespective, the concepts of direct and indirect routing discussed below in a separate section work the same way.

Assigning single and multiple IP addresses manually to a device is a potential security risk as it may expose traffic from a different subnet if a different subnet IP address is added by mistake. Basically, the NIC on which 2 different subnet IP addresses are added would become a gateway for those 2 subnets and can route traffic between the two subnets. Kogence cloud HPC platform uses DHCP server for automatic assignment of IP addresses to NIC and we do not assign multiple IP addresses to same NIC.

One can also define virtual interfaces or sub-interfaces of an existing hardware network interface using the ip utility, for example, as discussed later. Sub-interfaces are virtual interfaces configured such that the traffic from each sub-interface goes through the primary hardware NIC. On the other hand, virtual interfaces do not have to be connected to an actual hardware NIC. As an example, the bridge networks defined by the docker daemon by default only route traffic between the containers on same host and are not connected to hardware NIC. In that case, host machines routing table will have a route to the bridge network so even though host can communicate to the containers, the other devices on the host network cannot reach the containers. For more details please see Container Networking document.

Processes that use the IP address of the virtual interface as their source address can send packets through any real hardware network interface (if connected) that provides the best route for that destination. Incoming packets destined for a virtual IP address are delivered to the process regardless of the interface through which they arrive. The routing algorithm is agnostic of the fact that the source IP address is an IP address of a virtual interface and works just like as discussed below in a separate section.

Repeaters and Hubs

A repeater is one of the simplest 2-port network device. A hub is simply a multi-port repeater. Hubs and repeaters are basically signal amplifiers and help extend the range and the physical separation between the hosts. There is always a limit on the physical separation between hosts that is defined in the IEEE 802 standards. That said, a basic repeater is not actually used in a network hooked up with 10/100BaseT cables, it used to be found with thinnet LANs.

Hub broadcasts packets on all ports (except from the port it received the signal) that it receives on any one port. Every hosts then checks the MAC address to see if packet was meant for it or not. Hubs make collision domains in network. Meaning that collisions are possible if two device try to send data simultaneously. Each host should use CSMA/CD to check if nobody is writing before attempting send a packet. In hubs, bandwitdh between all ports is shared. Meaning that if one port is sending a lot of data then others may not be able to send anything. Hubs can be daisy chained. Hub will create a bus topology (not star) network. This is true (logically) even if they are daisy chained. The entire network (hooked by one or more hubs) is considered as only one network segment. Entire network segment is a single collision domain.

Network segment is a layer 2 concept (Data Link sublayer of the Link Layer). As far as IP Layer is concerned, all network segments belong to same "network" or same "sub network". On the other hand, sub network or super network are concepts in IP layer. Sub-networks have different IP addresses (or different IP address ranges). IP range of a sub-network is a subset of the IP range of its super-network.

Hubs used to be a simple way of making local area networks (LAN) but nowadays hubs are almost completely replaced by switches.

Switches and Bridges

A bridge is a 2-port network device and a switch is simply a multi-port bridge.

Since a network segment is a single collision domain, if you have a large number of hosts on a single network segment, we may have a lot of congestion and traffic may be choked. To reduce the network congestion we may organize hosts in smaller network segments and then connect all network segments to a switch. As discussed above, each segment may have several nodes hooked to an independent hub. And all hubs can then be connected to a switch.

A bridge/switch is a little smarter version of hub/repeater. A switch/bridge does look inside the destination MAC address field of a data packets. It keeps a table of mapping between the MAC addresses of connected devices and the switch/bridge port to which each network segment is connected. Once the MAC address table is populated, a bridge/switch will also know which devices (MAC address) are in which segment (a switch/bridge port). Until the entire MAC table is populated, switch will broadcast packets to all its ports. If switch recognizes the MAC address in the address field of the data packet then it only passes data to its desired port and hence desired network segment and not to all of its ports/network-segments/devices. But if does not recognize the address then it forwards it to all its ports. This way bridge/switch is able to reduce congestion.

Switches are the Link Layer devices and they do not know anything about the IP addresses. Switched do not look for IP addresses. As we will see later, the IP address to MAC address mapping is maintained by network interface software that is part of the IP Layer software suite and is typically part of the OS.

Bridge/switch forms a star topology of network. A bunch of network segments connected together through bridges/switches represents a homogeneous single network as far as IP Layer is concerned. They are different network segments only at the Data Link layer level. Network-segments and MAC addresses are also Link Layer concepts.

Switch generally provides a dedicated bandwidth to each of its ports. Meaning that one of the ports can not hog all the bandwidth and can not stop other devices from communicating with each other.

Internet Layer Standards: IPv4

Internet Layer is responsible for routing data packets between the internet gateways. Link layer takes care of transporting data from host to the internet gateway and the transport layer takes care of trasporting data from host to the application.

IPv4, IPv6, ICMP and IGMP are some common examples of Internet Layer protocols. Internet Layer protocol defines the concepts of IP addresses and routing algorithms. Two versions of the Internet Protocol (IP) are very common: IP Version 4 (IPv4) and IP Version 6 (IPv6). Each version defines an IP address differently. Because of its prevalence, the generic term IP address typically still refers to the addresses defined by IPv4.

Subnet, IPv4 CIDR, Private Address Space, Network Address Translation and Port Forwarding

Early network design, when global end-to-end connectivity was envisioned for communications with all Internet hosts, intended that IP addresses be uniquely assigned to every computer or device connected to the internet (directly or indirectly). IPv4 addresses were 32 bit divided into 4 segments of eight bits (known as "octets") and typically written as A.B.C.D where A, B, C and D are each numbers between 0-255. A was treated as the address of LAN (for example, ARPANET was network number 10) and B, C and D identified a unique host on the LAN. Total available address space under this scheme was about 4B.

Of course this could not work with the current scales of the internet. A combination of private address space, subnetting and network address translation (NAT) is now common in all modern networks to allow the scale of current internet.

Subnets

IP networks are hierarchical. Meaning a network can be broken down into subnets (a subnet is in itself a proper IP network). Everything behind a router is a subnet. A group of interconnected routers can form a network of subnets. A host that wants to connect to two or more subnets will need as many number of NIC with each NIC connecting to one subnet. Each NIC will receive a separate IP address from each subnet.

One should note that a subnet is in itself a proper IP network. Therefore the terms subnet and network is often used interchangeably and meaning should be clear from the context.

Subnets have a range of IP addresses. We usually don't write 1.2.0.0 - 1.2.255.255, instead, we shorten it to 1.2.0.0/16. Hosts in the subnet can be assigned one of these as private IP address (private within that subnet). Each number between the dots in an IP address is actually 8 binary digits (00000000 to 11111111) which we write in decimal form (between 0 and 255) to make it more readable. The /16 means that the first 16 binary digits is the network address of the subnet, in other words, the 1.2.*.* part is the the network address and last 16 can vary among the devices on the subnet. This means that any IP address beginning with 1.2.*.* is part of the subnet: 1.2.3.4 and 1.2.5.50 are in the subnet, and 1.3.1.1 is not. In this example, the subnet mask of the subnet is 255.255.0.0.

We usually use subnets ending in /8, /16 and /24 that makes it easier to understand even though any length is allowed. For example, 10.0.0.0/8 is a big subnet containing any address from 10.0.0.0 to 10.255.255.255 (over 16 million addresses). while 10.0.0.0/16 is smaller, containing only IP addresses from 10.0.0.0 to 10.0.255.255. 10.0.0.0/24 is smaller still, containing addresses 10.0.0.0 to 10.0.0.255. And 10.0.0.0/32 subnet contains only one IP address i.e. 10.0.0.0.

The Internet Engineering Task Force (IETF) has directed the Internet Assigned Numbers Authority (IANA) to reserve the following IPv4 address ranges for private networks 10.0.0.0 - 10.255.255.255 (10/8); 172.16.0.0 - 172.31.255.255 (172.16/12) and 192.168.0.0 - 192.168.255.255 (192.168/16). Hosts in all private subnets can use these IPv4 addresses simultaneously. We like to call these private IP addresses. These address spaces are not routed on global internet routers. These are only routed by the private routers attached to these private subnets.

If network is designed as consisting multiple subnets, then it is recommended to use small and separate private IP address space for each of these subnets. That way routers and gateways will be able to route traffic between subnets as needed without any Network Address Translation (NAT). In route to route traffic between subnets that use same private address space one would need to configure routers and gateways to use NAT and PAT as discussed below. Traffic going into the internet and coming back would always need NAT (and potentially PAT).

An address for a destination host is specified simply by an IP address (and not by mask). As we discussed later in a separate section on routing, the network device's kernel routing table contains the subnet masks of all subnets that this network device can access through all the NICs available on this network device. If network of multiple subnets is properly configured with non overlapping private address spaces then the destination IP address will only fall in one of the subnets or none of them. If it is in one of the subnets then packet is sent to appropriate NIC that connects to that subnet. NIC uses the Layer 2 (LAN) protocols to wrap it with LAN packet and directly send it to proper MAC address. If it is not in any subnets then it is passed to the MAC address of the router (i.e. the default gateway) and then router will either send it to another subnet (if router knows that that address is in another subnet) or to the default gateway. But it will go to the gateway only if it is not one of the private addresses as defined by the IANA.

Network Address Translation and Port Forwarding

One should note that the hosts inside the subnet cannot be accessed from the internet even with NAT in place. For example, you cannot do ssh either to the public IP address of router or the private IP address of a subnet from any host outside of the subnet. One can enable port forwarding in the router to handle such use cases. One can define routing rules in the router to specify that all traffic coming to the port 22 and on the public IP address of the router should be forwarded to specific port (could be port 22 itself) of specific private IP address in the subnet that has the ssh daemon running on that specific port.

In addition, hosts in the subnet can, of course, also be directly connected to internet (meaning it will have a public IP address in addition to a private subnet IP address). Each such host will have at least 2 NIC, one connecting it to the subnet and other connecting it to the internet gateway.

In all these network architecture examples, internet gateway can be replaced by the virtual private gateway if the subnet has to be behind a corporate VPN.

Classless Inter-Domain Routing

With private address spaces and private subnets as discussed above, the Classless Inter-Domain Routing (CIDR) replaced the classful routing in IPv4 starting in about 1993. As discussed above, in CIDR, the private address spaces are described by combination of IP address and a subnet mask such as a.b.c.d/16. All network interfaces in a network should have an IP address from the subnet that it connects to. DHCP servers make sure that all hosts in a subnet get proper private IP addresses within that subnet. If network admin are assigning IP address manually to each NIC then they need to ensure consistency otherwise packets will not route properly.

With CIDR, subnets can be any arbitrary size as determined by the subnet mask (i.e. they do not have any class). In past, IP addresses were classful with first few digits of IP address determining the class of the IP address. Some were part of very large class while others were part of smaller class. Depending upon the class of the IP address, global routers will route the packets to a subordinate router. This is not used any more with CIDR.

Routers

Routers are known as layer 3 devices as they operate on third layer (the network layer) of OSI model. Most popular data protocol used in layer 3 is IP. Sometimes these may also be known as IP routers. Router connects two (or more) networks/subnets. Both networks can have multiple devices in it. Both networks independently assign their own private IP addresses to each of their devices. Router gets an IP addresses from each of these networks that it is connecting. Routers typically maintain routing tables so that they can decide fast routes for sending packets.

Routers have significant amount of logic/functionality built in. Modern routers serve several functions. Simplest routers act like a switch, a network gateway, a DHCP server and a NAT gateway at the same time. There can be more features and functionality.

IP networks are hierarchical. Meaning a network can be broken down into subnets. Everything behind a router is a subnet. A group of interconnected routers can form a network that is network of subnets. Note that subnet itself is a proper IP network, so these terms are often used interchangeably.

As basic router functionality, it connects two networks. It has two NIC, one for each network. Each of these NIC receives an IP address from the network they are connected to. If router is connected to internet then, one of these IP address is know as its public IP address and other as its private IP address. Router looks into both the source and destination IP address fields of packets it receives for routing. If IP address is in one of the 2 networks then it forwards it to that network otherwise it forwards it to the default gateway.

When switched ON, the DHCP server functionality in the router assigns IP addresses to devices on the subnet.

As a switch (for the private subnet) to it keeps a table of MAC address, connected port and assigned private IP addresses of each connected device. It also keeps information about the next closest router connected to it. This closest router is typically called a "default gateway". And this table is known as routing table.

Routers also act as NAT gateway allowing, for example, one internet connection (one public IP address) to be shared between multiple devices in the subnet. Switch/hub creates network. It does not connect two networks. It can not help sharing internet. A switch also does not have DHCP capability. DHCP is usually performed by a router.

Gateways

Gateway implies two main functionality. Firstly, as the name suggests, it is an entry door for a network/subnet. Secondly, it may be used for protocol translation. For example, two networks that are being connected run on two different protocols then a gateway may be needed to do the translation. With large majority of networks nowadays being TCP/IP networks, gateways typically just imply being an entry door for a network/subnet.

In this sense every router is a gateway for the subnets that are connected to the router. Router is the default gateway for those subnets. Meaning any packet that is destined for an IP address not in the subnet is sent to the router. The kernel routing table of the router itself would define another default gateway. Any address that is not in all the subnets that router connects are forwarded to that default gateway (may also be known as the internet gateway depending on the context). This next closest router is the gateway of the other network.

In this sense, a router is basically a combined function of a gateway and a switch. Every router is a gateway but every gateway is not necessarily a router (it may just be a switch or just have a route to the switch of target subnet).

Transport Layer Standards: TCP and UDP

The TCP (for the point-to-point communication) and the UDP (for the multicasting or broadcasting) are two most common examples of Transport Layer protocols. Transport Layer encapsulates the application data blocks and passes it to the lower Internet Layer. In abstract sense, just like a browser appears to be directly communicating to webserver at Application Layer level, transport layer of host talks to the transport layer of the server directly. Transport layer protocols make sure that they can deliver error free/reliable data to the application layer.

TCP (one example of the Transport Layer protocols) deals with data integrity and correct sequencing of data. The Internet Layer (lower layer) deals with routing the packets. Both the Transport Layer and Internet layer protocols are typically implemented within the OS kernel. They are not essential components of OS, though. For example, you can compile linux without networking even though that is very rare. On the other hand, Application Layer programs are in the user space and not part of OS kernel. Link Layer protocols are typically implemented in hardware (special chipsets, firmware) or as device drivers. OS still includes components of Link Layer (e.g. network interface software) that allows IP suite stack to be able to work with many different Link Layer protocols such as Ethernet or Wi-Fi LAN etc.

TCP packets are called segments while UDP packets are called datagrams. TCP is fairly elaborate protocol. Such protocols are known as connection-based protocols (point-to-point as opposed to multicasting, read below). TCP first sends some byte stream in order to establish a connection. It then sends data as numbered segments, receives and reorders them. It also provides error detection using error detection codes and prevents lost packets through automatic repeat requests, flow control and congestion control.

On the other hand, UDP is a very simple protocol, and does not provide virtual circuits, nor reliable communication, delegating these functions to the application program that is built to utilize UDP transport instead of TCP transport. TCP is used for many protocols, including HTTP web browsing and email transfer. UDP may be used for multicasting and broadcasting, since retransmissions are not possible to a large amount of hosts. TFTP (trivial file transfer protocol, a little brother to FTP), DHCPCD (a DHCP client), multiplayer games, streaming audio, video conferencing, etc. typically use UDP transport. For unreliable applications like games, audio, or video, you just ignore the dropped packets, or perhaps try to cleverly compensate for them in the Application Layer. Why would you use an unreliable underlying protocol? For speed. It's way faster to fire-and-forget than it is to keep track of what has arrived safely and make sure it's in order and all that. If you're sending chat messages, TCP is great; if you're sending 40 positional updates per second of the players in the world, maybe it doesn't matter so much if one or two get dropped, and UDP is a good choice.

Just like Application Layer protocol HTTP introduced the concept of a URL, the Transport Layer introduces the concept of ports and sockets, the Internet Layer introduces the concept of IP address and the Link Layer introduces the concept of MAC address.

Transport layer can receive data from several different application layers such as FTP, HTTP etc. Hence, TCP encapsulation headers will contain the port number of the application together with some error control bits etc.. Port number is used to decide which data belongs to which application. Different applications talk and listen to different ports. One should note that the IP address is NOT part of the TCP header. IP address are part of Internet Layer headers not the transport layer headers.

When applications communicate via IP they must specify not only the target's IP address but also the "port address" of the application. A port address uniquely identifies an application. Standard network applications, on the server side, use standard port addresses. For example, HTTP (the web) is port 80, HHTPS is port 443, SSH is port 22 telnet is port 23, SMTP is port 25 and so on. Ports under 1024 are often considered special, and usually require special OS privileges to use. For example, a non-privileged user can not start a wenserver to listen to port 80. Non privileged user can start the webserver on a non-privileged port. These registered port addresses can be seen in /etc/services.

Just like IPv4 and IPv6, the TCP and UDP protocols are part of OS kernel. Protocols are implemented using a concept known as sockets. When Unix programs do any sort of I/O, they do it by reading or writing to a file descriptor. A file descriptor is simply an integer associated with an open file. A file can be a network connection, a FIFO, a pipe, a terminal, a real on-the-disk file, or just about anything else. Everything in Unix is a file! So when you want to communicate with another program over the Internet you're gonna do it through a file descriptor --- this is Socket. You make a call to the socket() system routine to open a socket. It returns the socket descriptor, and you communicate through it using the specialized send() and recv() socket calls. Since, it's a file descriptor, you might as well use the normal read() and write() calls to communicate through the socket but send() and recv() offer much greater control over your data transmission because send(), rec() will pass it to TCP/UDP sublayer. TCP uses stream sockets while UDP uses datagram sockets.

Sockets are not limited to Unix systems. All modern operating systems implement a version of the POSIX socket interface, for example, even the Winsock implementation for MS Windows, developed by unaffiliated developers, closely follows the POSIX standard. Most popular POSIX socket API is written in the C programming language and most other programming languages provide similar interfaces that are typically written as a wrapper library based on the C API. Concept and context of sockets is actually wider than the IP protocol suite in the sense that IP protocol suite communication channels are simply implemented using different socket domain called AF_INET (for IPv4) and AFNET6 (for IPv6) domain sockets. There are other socket domain that are all implemented based on the same POSIX socket API. For example, Unix domain socket (AF_UNIX) is a data communications endpoint for exchanging data between processes executing on the same host operating system (also known as inter-process communication or IPC) and these are also interfaced with the same sockets API that is used for IP protocol suite as well as all other supported network protocols. In addition to socket domain, one also needs to specify the socket type. For example, Stream Sockets (SOCK_STREAM for TCP reliable stream-oriented service) and Datagram Sockets (SOCK_DGRAM for UDP datagram service).

Relationship between network end point address (i.e. the IP address + port combination) and the socket is analogous to the relationship between a file and a file handle/descriptor. When an applications/processes opens read/write pipe to a file, it receives a new file descriptor. Similarly when a new connection is created to same network address a new socket is created. For example, lets consider an example of a HTTP webserver running on the server host (say IP a.b.c.d and port 80) and several remote client browsers running on the different client hosts (say 1.2.3.4 port 30050 and 5.6.7.8 port 60070) establishing separate connections to the server. On the server host, when HTTP daemon is being deployed, socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) method will be used to create an endpoint for communication specifying the domain, type and communication protocol. Method will return a file descriptor, sockfd, for the socket on success. One would then use the method bind(sockfd, a.b.c.d:80, addrlen) bind a socket to an HTTP server address. After a socket has been bound with an address, listen(spckfd,backlog) prepares it for incoming connections (backlog specifies the maximum number of pending client connections). Only one socket, among all sockets associated to a given address, can be in the "listen" state ensuring that only one server/daemon services the requests. When an application is listening for stream-oriented connections from other hosts, it is notified of such events and must initialize the connection using the accept(sockfd,1.2.3.4:30050 or 5.6.7.8:60070,addrlen) function to accept connection request events from 1.2.3.4:30050 or 5.6.7.8:60070. On the server host machine, the accept() creates a new socket for each accepted connection and return returns the new socket descriptor for the accepted connection, or -1 if an error occurs. Once a connection is accepted, it is dequeued (does not count against backlog). All further communication with the client now occurs via this new socket descriptor.

Clients will themselves use socket() to first create a socket on its side and get the clientsockfd. Then it will use connect(clientsockfd,a.b.c.d:80,addrlen) to connect to a remote server that is already listening for incoming connections. Once connection is successful, send() and recv(), or write() and read(), or sendto() and recvfrom(), are used for sending and receiving data to/from a remote socket. To use many of socket features, you need to use the send() / recv() family of system calls rather than write() / read(). close() causes the system to release resources allocated to a socket. gethostbyname() and gethostbyaddr() are used to resolve host names and addresses. select() is used to pend, waiting for one or more of a provided list of sockets to be ready to read, ready to write, or that have errors. poll() is used to check on the state of a socket in a set of sockets. The set can be tested to see if any socket can be written to, read from or if an error occurred. getsockopt() is used to retrieve the current value of a particular socket option for the specified socket. setsockopt() is used to set a particular socket option for the specified socket.

Application Layer Standards

FTP, HTTP, POP3, IMAP, SMTP, SSH, Telnet, rlogin, DNS (domain name service) etc. are some common application layer protocols. Application layer programs are usually configured as client and server, with each running on one of the two hosts that want to communicate to exchange data or information.

Lets consider a simple example. The HTTP (hyper-text transport) protocol is one of the simplest simplest examples of a protocol used between application-to-application. In this scenario a web browser running on one of the internet host machine is considered as the HTTP client and a webserver program (such as Apache, Microsoft IIS or Nginx) running on another internet host machine hosting the website that is listening to the HTTP requests coming from the client browser is considered as the server. Client will send the HTTP address (i.e. the URL) of the document it wants in certain predefined format. And server will send that document back in certain predefined format. This exchange protocol is HTTP protocol.

Both applications (i.e. the client and the server) are not worried about loss of data or error in data. Both applications are assuming that the Transport Layer has taken care of data reliability. So the HTTP protocol only deals with the format of the request message that client will send and the format of the response message that server will send back.

Network Routing Details

Kernel Routing Table

When an application on a network device wants to send a message to a destination device, and if the device has multiple NIC, the IP Layer utilities should be able to find out which NIC to use. The OS kernel maintains a routing table (known as the kernel forwarding table or sometimes simply as the kernel routing table) for this purpose. NICs are Link Layer devices and use Link Layer addressing system. Therefore, in addition, once the NIC has been identified, the OS kernel needs to be able to find the Link Layer address of the destination. As discussed above, the ARP table is used for the that purpose.

The kernel routing table consists of a series of entries. Each entry consists of multiple fields. For example, the table may look like this:

10.0.246.0 255.255.255.0 GW le0

10.0.118.0 255.255.255.0 local le0

default N/A GW le0

In this example, there is only one network interface le0. There are 3 routes defined for this network interface le0 in the kernel routing table. As described later in a separate section, kernel routing table can be printed using kernel utilities such as ip r list. Kernel prepares this list by combining routes defined for each of the NICs defined on the host. Each NIC will at the lest specify the subnet that that NIC connects to. Alternatively, NIC might receive this information from a DHCP server when NIC connects to a network as described above. In addition, at least one of the NICs may also define a default route for the IP addresses that are not in the subnet that NIC connects to. This typically called the default gateway. NIC may also define additional static routes, typically those are also gateways to other network or subnets.

Direct Routing Using ARP and Routing Table

Lets consider an example of the direct routing to understand how packets are routed to a destination device that is connected directly to the same network as the source device.

• The IP layer of the machine SO receives a packet addressed to the machine DEST at the IP address 10.0.118.4. Notice that the IP layer headers of the packet always contain the true ultimate destination address 10.0.118.4 of DEST. The issue that we are currently discussing is the procedure the OS kernel uses to find out how to prepare the Link Layer headers.
• In order to prepare the Link Layer headers, SO consults its kernel routing table. Lets take the example of above described routing table with only one network interface le0 and 3 routes defined for this network interface le0 in the kernel routing table.
• SO applies each netmask to the destination IP address 10.0.118.4 until it finds a match with the destination address. For example, applying the mask 255.255.255.0 to the destination IP address 10.0.118.4 results in 10.0.118.0 which matches the second destination in the routing table. If no match is found, SO uses the third default entry (known as the default gateway).
• Having found a destination match, SO uses the gateway and interface fields of the second entry for preparing the Link Layer packet (i.e. for preparing the Ethernet headers in this case). SO addresses the Link Layer packet for the gateway of that entry and transmits that Link Layer packet through the network interface of that entry. Please notice that the IP Layer headers will always have the IP address 10.0.118.4 of the true destination DEST.
• In this case, the gateway is local -- meaning the local network. Therefore, in this direct routing example, SO prepares the Link Layer packet also with the true ultimate destination address DEST. But Link Layer routing needs Link Layer address of DEST. In this case, the second entry tells the kernel that the interface is to an Ethernet, therefore SO does a lookup in the ARP table to translate the IP address for DEST to a MAC address for DEST
• When SO transmits the Link Layer packet on the Ethernet with the MAC address of DEST in the header, it is directly received by DEST.
• The IP Layer of DEST looks at the IP address in IP Layer headers and decides that the packet has already reached its destination and no further routing is necessary.

Indirect Routing Using ARP and Routing Table

Next, lets consider an example of the indirect routing to understand how packets are routed to a device not directly connected to the local network as the source device.

• The IP layer of the machine SO receives a packet addressed to the machine DEST2 at the IP address 10.0.246.3. Again, notice that the IP layer headers of the packet always contain the true ultimate destination address 10.0.246.3 of DEST2. The issue that we are currently discussing is the procedure the OS kernel uses to find out how to prepare the Link Layer headers.
• SO consults its kernel routing table, which is same as the one shown above.
• As described in the previous example, SO applies each netmask to the IP address 10.0.246.3 until it finds a match with the destination address. In this example, it will match the first entry. Therefore, SO uses the gateway and interface fields of the first entry.
• Having found a destination match, SO uses the gateway and interface fields of the first entry for preparing the Link Layer packet. SO addresses the Link Layer packet for the gateway of that entry and transmits that Link Layer packet through the network interface of that entry. Routing table tells the kernel that the gateway for that entry is GW and the interface is le0 (which kernel knows that as an Ethernet interface). Please notice that the IP Layer headers will always have the IP address 10.0.246.3 of the true destination DEST2.
• When IP Layer of GW receives the packet, it reads the ultimate destination IP address as DEST2. Finding that the address is not its own, and because GW is configured as a router, it consults its kernel routing table as SO did above. GW finds that the ultimate destination address can be reached via the local gateway and sends the packet out the local Ethernet interface addressed to the ultimate destination MAC address.

Kernel Network Management Utilities ip and arp

On Kogence Cloud HPC platform, users do not have privileges to manage/change the network settings but they can use these utilities to check the network configuration.

Two basic kernel utilities are popular for exploring and managing networks -- ifconfig and ip. The ifconfig is a legacy utility that is part of the net-tools package and is not recommended anymore. The ip is a newer recommended uility that is paty of the iproute2util package.

The basic syntax of ip utility is:

ip [ OPTIONS ] OBJECT { COMMAND | help }

where OBJECT is the object you want to manage/modify which can be

• link (l) - for displaying and modifying network interfaces
• address (a or addr) - for displaying and modifying IP addresses of network interfaces
• route (r) - for displaying and modifying the routing table.
• neigh (n) - for displaying and modifying neighbor objects (ARP table).

For example, when operating with the link object the commands take the following form:

ip l [ COMMAND ] dev IFNAME [COMAMND]

The most commonly used COMMANDS used when working with the link objects are: showsetadd, and del. For example, ip link show will show all available network interfaces while ip link show dev eth0 will show the details of the eth0 interface. The command ip link set eth0 down will disable the eth0 network interface while ip link set eth0 up will enable the same interface.

When operating with the address object the commands take the following form:

ip a [ COMMAND ] ADDRESS dev IFNAME

The most frequently used COMMANDS of the address object are: show, add, and del. For example, ip a show will show IP addresses of all network interfaces while ip addr show dev eth0 will IP address(es) of the eth0 network interface. The command ip address add 192.168.121.45/24 dev eth0 will assign the IP address 192.168.121.45 in a /24 subnet to the interface eth0. Similarly, ip address del 192.168.121.45/24 dev eth0 will remove the assigned IP address 192.168.121.45 from eth0 interface.

Similarly, working with the route object, ip r list will list all the routes in the kernel's routing table while ip r list 172.17.0.0/16 will display the route for the subnet 172.17.0.0/16.

If we want to add a route that specifies the IP address 192.168.121.1 as the gateway for the subnet 192.168.121.0/24 , we will use the command ip r add 192.168.121.0/24 via 192.168.121.1 while ip r add 192.168.121.0/24 dev eth0 will add a route to subnet 192.168.121.0/24 that can be reached on device eth0. The command ip r add default via 192.168.121.1 dev eth0 will add a default route via the local gateway 192.168.121.1 for the device eth0.

For permanently adding route to the network interface eth0, one would need to edit /etc/network/interfaces/route-eth0 to append something like 172.10.1.0/24 via 10.0.0.100 dev eth0 and then restarting the network service using systemctl restart network.service.

The kernel utility arp is useful for exploring the ARP table. The syntax takes the form:

arp [-v] [-i if] [-t type] -a [hostname]

By default arp will print full ARP table. -v is for verbose printing. -i, -t and -a can be used to filter results specific to an interface, specific to a type of hardware or specific host. List of possible hardware types (which support ARP) are ash(Ash), ether(Ethernet), ax25(AMPR AX.25), netrom (AMPR NET/ROM), rose (AMPR ROSE), arcnet (ARCnet), dlci (Frame Relay DLCI), fddi (Fiber Distributed Data Interface), hippi (HIPPI), irda (IrLAP), x25 (generic X.25), eui64 (Generic EUI-64).

By the way, ip utility allows one to enable or disable ARP on an NIC through ip link set dev eth0 arp on or ip link set dev eth0 arp off.

Container Networking on Kogence

Docker Networking Fundamentals

One of the reasons why Kogence uses Dockers as the HPC containerization technology is that the Docker provide rich network isolation features that are not available in other HPC containerization technologies such as Singularity. Docker container running on same host or on different hosts of a cluster can be connected as well as to the host on which they are running. This allows Kogence to orchestrate HPC workloads on containers such that these workloads are not even aware that they are running inside Docker. Similarly, in the case of distributed memory multi-node workloads such as MPI workloads are also not aware whether their connected peer workloads are also Docker workloads or not. Whether your Docker hosts run Linux, Windows, or a mix of the two, we can use Docker to manage them in a platform-agnostic way.

Docker’s networking subsystem is pluggable, using drivers. Docker offers several drivers such as bridge, host, macvlan, and overlay by default, and provide core networking functionality. On Kogence Container Cloud HPC platform we use each of these depending upon the workload you launch. User remains agnostic to the details and her workload automatically runs on a container cloud cluster orchestrated with the most optimal choice for her specific workload.

Docker Bridge Network

If Docker container are started with default bridge network then the containers will use a separate network namespace and the host network interfaces, routing tables, ARP tables etc will not be visible inside the container.

Bridge network works with a single host only. It provides network connectivity between the host and the containers running on that single host.

Docker daemon creates a virtual switch on the host. This virtual switch typically shows up as the docker0 network interface in the list of interfaces on the host. The docker0 interface is connected to a private subnet and receives the first IP address of that private address space. As with any other LAN hardware, this switch (i.e. the docker0 interface) also gets a MAC address. The host is able to receive and send packets using this network interface on the connected subnet either using Ethernet broadcasting mechanism (e.g. when destination MAC address is not known) or using point to point mechanism through the destination MAC address of any other network device connected to the same subnet. As the host communicates with other network devices using this interface, the host kernel will populate the MAC addresses of other network devices connected to this subnet in its ARP table. The host kernel routing table will also be populated with the subnet IP address range (i.e. the subnet mask) of this subnet so when host receives a packet, on any of host's network interface and not just docker0, destined for an IP address in this subnet range then host will use this docker0 network interface to further route the packet. Note that the host acts as the gateway for this subnet and therefore, hosts kernel routing table does not get a default gateway entry for this interface.

When a container, say cont#1, gets created (either using docker run or docker create) on this host, the Docker daemon adds a new virtual port (technically they are known as the veth-pair type network interface) on this virtual switch. This virtual port shows up as vethXYZ@ifAB interface on the host machine. Lets say this interface has an interface ID of CD. The Docker daemon also creates a new eth0@ifCD network interface inside the container cont#1. This interface will have an interface ID of AB. The @ifAB at the end of vethXYZ@ifAB indicates that this port on the switch (identified by the interface ID CD) is connected to the interface ID AB (which is the eth0@ifCD interface on the container cont#1). Correspondingly, @ifCD at the end of eth0@ifCD indicates that this Ethernet interface is connected to the interface ID CD (which is the vethXYZ@ifAB port on the virtual switch on host). Please note that technically all interfaces are created on the host machine -- they are just in different network namespaces. Kernel assigns the numerical network interface IDs to all network interfaces sequentially across all namespaces.

Please note that these ports (i.e. the veth-pair type network interface) on the virtual switch do not receive an IP address even though they show up as proper network interface with a MAC address in the list of network interfaces. But these MAC addresses will never get populated in an ARP table and these interfaces will never get used directly. The host will use the docker0 interface to send packets to the containers and the containers will use their own interfaces like eth0@ifCD to send packets to the host as as well as to the other containers. These interfaces like vethXYZ@ifAB should be mentally modelled as ports on the switch with an Ethernet wire connected. The interface on the other end (such as the eth0@ifCD interface) that this Ethernet wire connects to is the one that receives an IP address not the port itself. Technically, vethXYZ@ifAB is a slave interface while the docker0 is the master interface. Any outbound packet that goes through the vethXYZ@ifAB interface will get the docker0 IP address as source IP address.

The eth0@ifCD interface on the container cont#1 is also connected to same private subnet as the docker0 and receives the a unique private IP address of that private address space. As with any other LAN hardware, this interface also gets a MAC address. The processes on this container are able to receive and send packets using this network interface on the connected subnet either using Ethernet broadcasting mechanism (e.g. when destination MAC address is not known) or using point to point mechanism through the destination MAC address of any other network device connected to the same subnet. As the container communicates with other network devices using this interface, the container will populate the MAC addresses of other network devices connected to this subnet in its own ARP table. The kernel routing table of the container will also be populated with the subnet IP address range (i.e. the subnet mask) of this subnet so when container receives a packet, on any of its network interface and not just eth0@ifCD, destined for an IP address in this subnet range then the container will use this eth0@ifCD network interface to further route the packet. Note that the host acts as the gateway for this subnet and therefore the container's routing table table will also list the docker0 IP address as the default gateway entry for this interface. So any packet that is not destined for this subnet will be forwarded to the host machine on the docker0 interface. Host machine can then do NAT/PAT (if configured) and use its other network interfaces (if configured) to forward the packet to correct destination out side of host and containers private network. Please not the by default, NAT/PAT is not configured. So by default container can access the internet provided host has other network interfaces that are connected to the internet but the reverse is not possible (unless NAT/PAT in configured on host by publishing ports from container), i.e. the internet cannot access the containers even though host may be made accessible on the internet.

On Kogence container cloud HPC platform, by default, we do not publish any of the containers port. If your workloads needs to expose some ports then those specific ports are published specifically for your workload only. Corresponding changes on the host's network router and host's iptables are also made automatically as needed.

Docker Host Network

If Docker's host network driver is used then Docker containers do not run in a separate network namespace. There will be not network isolation and the workloads running in the container will use the host network stack. This means that the host's network interfaces, routing tables, ARP tables etc will be visible inside the container.

By default, Kogence container cloud HPC platform uses the host network and therefore all network and host level security implemented applies even to the workloads running in the containers as well. Other network choices are made only for specific types of workloads for which host network does not suffice. Please contact us for further details.

OS Bypass Networking on Kogence

Remote Direct Memory Access Networks on Kogence

Kogence supports multiple parallel processing frameworks including MPI, open MP and Cham++.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on a wide variety of parallel computing architectures. There are several well-tested and efficient implementations of MPI. On Kogence, currently we support Open MPI (not to be confused with OpenMP), MPICH, MVAPICH (an MPICH derivative) and Intel MPI (an MPICH derivative) libraries. All of these support various hardware network interfaces either through Open Fabrics Interface (OFI) or through natively implemented support which provides OS Bypass Remote Direct Memory Access capabilities and avoid the inter process communication to go through traditional TCP/IP stack providing significant performance boost. On Kogence, users would not have to worry about the details of hardware and network infrastructure as we will configure appropriate transport layer for best performance for their HPC applications.

For example, the Intel MPI Library package contains the libfabric library, which is used when the mpivars.sh script executes. The libfabric implements OFI interface. What OFI does is that it provides a common set of commands that application developers can call and libfabric translates them to various different protocols that different hardware providers support. The hardware providers protocol can be selected by export I_MPI_OFI_PROVIDER=<name> or export FI_PROVIDER=<name> with first option being preferred. <name> could be mlx, verbs, efa, tcp, psm2. For example OFI can translate commands to standard tcp/ip protocol if <name>=tcp. This can work on standard ethernet hardware as well as on IPoIB (IP over infiniband), IPoOPA (IP over Intel Omni-Path network). If <name>=mlx then OFI can translate commands to UCX protocol that is needed for OS Bypass Mellanox InfiniBand. Similarly <name>=mlx can be used Intel Omni Path. The verbs is needed for other OS bypass network hardware (non-Melanox InfiniBand, iWarp/RoCE etc).

Similar to MPI, Cham++ is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD computing systems with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Charm++ does offer some benefits over MPI such relative ease of programing to developers, fault tolerance and separating out machine layer so that application developer does not distinguish whether user executes the application leveraging multi-threading parallelism on shared memory SMP architectures or multi-processing parallelism on distributed memory clusters. Charm++ can use MPI as the transport layer (aka Charm over MPI), or you can run MPI programs using Charm++ Adaptive-MPI (AMPI) library (aka MPI over Cham). Charm++ can also use TCP/IP stack or other OS bypass network protocols for messaging needs. The communication protocols and infrastructures supported by Charm++ are UDP, MPI, OFI, UCX, Infiniband, uGNI, and PAMI. Charm++ programs can run without changing the source on all these platforms.

Exploring, Searching and Navigating Scientific Content on Kogence

Models

Every cloud HPC project users create on Kogence is called a model. Concept of model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a model. Model is connected to a cluster and a stack of software and simulators. For example, if you want to launch Matlab software on Kogence cloud HPC platform then you would first create a Model. Connect that model to a cloud HPC hardware or cluster of hardware of your choice and then connect the model with the Matlab software. Then you can launch this model.

Easiest way to get started with Kogence cloud HPC platform is to search and then copy an existing model. Kogence Model Library consists of many publicly shared models. There are many ways to explore, search and navigate models available on Kogence:

1. Model Library page allows you to search, filter and sort lists of models. Model Library page only lists models for which you have requisite permissions to access. Models on Kogence are full text searchable. You can use the search bar on top of the Model Library page to search most relevant models. You can also filter list of models either by the software that are connected to the model (e.g you can filter all Matlab models) or by numerical computational methodologies used in the model (e.g. you call filter all Density Functional Theory models). You can also sort the list of models to find most popular models. For example, you can sort the model by the number of times model have been executed in the past by other users. Models that have been executed many times in past is expected to be fully functional and bugs free. Number of forks mean how many times an original model has been copied as personal copies by other users. You can also sort models by number of forks. Models that have been copied by many other users are expected to be well documented, functional, easy to learn and popular among other users.
2. Model Library page as well as each individual model page (e.g. Matlab surface fitting example model page) show the number of forks. If you click on the icon you can see the full family tree consisting of parent and grand-parents of current model as well as children and grand-children of current model. Some time you may find it to more helpful to copy an original parent model instead of a child model.
3. If you open any individual model page (e.g. Matlab surface fitting example model page) and if you scroll to the bottom, you will see a list of related models. This is an auto populated list of models that are similar to the current model you are viewing. This is another way to navigate and explore related scientific content on Kogence cloud HPC platform.
4. Software Library page lists all software for which you have requisite permissions to use on Kogence cloud HPC platform. Next to each listed software, toward the right hand side, it shows the number of models that use that software. It only shows the models for which you have requisite permissions to access. If you click on the icon showing the number of models, you will see a list. You can click on any of those models to explore and make a personal copy if you like.
5. From the Software Library page, if you click and open any software page (e.g. Matlab software page) and then scroll towards the bottom of that page you will see a list of models that use that software.
6. All the scientific content on Kogence is automatically categorized into appropriate categories. You can see category tags on the right hand panel of software (e.g. Matlab software page) and model pages (e.g. Matlab surface fitting example model page). You can also see category tags on Model Library and on Software Library. If you click on any category tag (e.g. Density Functional Theory or Quantum Chemistry) then you can see listing of all models, software and discussion boards relevant to that category.

Creating New Models

Permission Settings

1. Public Models: Any registered user can make copies of any Public model or edit wiki pages of Public models. After making a copy they can also download files from models. Authors of models get scholarly credits and citations for linked publications as copies are made.
2. Private Models: Private models are accessible only to you and your collaborators. Private models can be copied only by a model "admin". Model "user" cannot copy Private models.
3. All collaborators get view permission by default.
4. Execute permission requires edit permission as well.
5. Execute permission can only be assigned to a user who has access to software license, if needed.
6. Execute permission allows collaborators to execute models using either shared compute plan purchased under a team subscription or using individual compute plans (free or purchased) of each collaborators.
7. Collaborators with execute permission can share remote interactive display screen of model that is currently being run by pressing "Visualize" button on their individual browsers.

1. Model admin can add other collaborators to models and change any collaborator's role from "admin" to "user" and vice-versa.
2. Model admin can also change a private model to public and vice-versa.
3. Model admin can also create new copies of a private model. Non-admin model collaborators cannot make copies of private models. Any registered user can make copies of a public model.

Running Models On Autoscaling Cloud HPC Clusters

Kogence Autoscaling Cloud HPC Cluster Architecture

Kogence High Performance Computing (HPC) clusters are NOT shared clusters unlike HPC traditional onprem clusters. Each model execution on Kogence kick starts a new private and personal autoscaling HPC cluster in the cloud. When your model terminates, your cluster terminates too. If you add collaborators to your model then those collaborators can also access the cluster that your model has started.

Each of Kogence HPC cluster consists of two types of nodes - 1/ One master node and 2/ Multiple compute nodes. Master node is interactive node and support latency free remote user interaction. You can open graphical applications or shell terminals (TTY). Master node is also used to monitor cluster and job status. On the other hand, compute nodes are batch processing nodes. They do not provide graphics or TTY support. Jobs you send to the compute nodes must be able to execute without TTY support, should not require user interaction and should not be trying to open any graphical windows.

Kogence clusters allow you to select different type of HPC server hardware independently for the master node and for the compute nodes. All compute nodes in a cluster are of same server hardware type. Typically, users select a small inexpensive server hardware for the master node and use this node for pre-processing, post-processing and cluster and job monitoring purpose. Typically these uses do not require high computing power. Type of server hardware for the compute nodes can be selected based on the scalability and memory needs of your model. If your model scales linearly as you increase the number of CPUs then you are better of choosing compute nodes with large CPU counts. If your model requires large memory then you can select compute nodes that have large RAM/CPU ratio.

Number of compute nodes on Kogence clusters scale up and down automatically depending upon the number of jobs you send to the cluster and depending upon the resource requirements of each of these jobs. You use the Kogence Cluster Builder Graphical User Interface (GUI) to select cluster hardware resources you want in your cluster.

Configuring the Cloud HPC Cluster Hardware

You use the Kogence Cluster Builder GUI to select cluster hardware resources you want in your cluster. Kogence Cluster Builder GUI is accessible through the Cluster tab of your model. Follow the steps below to setup you cluster.

• Step1: Select a Compute Plan

On the left hand panel, first you select one of your Compute Plans (i.e. one of your CPU-Credit accounts) that you want to use for the simulations. All logged-in Kogence users will at least see an Individual Plan in the dropdown menu. If you are one of the team member of a Team Subscription then you might see more than one Compute Plans in the dropdown menu to choose from. Typically users use different Compute Plans for work under different projects or different grants.

• Step2: Select a Time Limit for the Cluster

On the left hand panel, you should enter a time limit after which you want your cluster to be automatically terminated irrespective of whether your simulation has completed or not. When you start your simulation, some CPU Credits will be blocked from your Compute Plan based on the time limit you select here (this also depends upon the number of nodes and type of nodes in the cluster). After your cluster terminates (either automatically because your simulation ended or because you terminate your cluster manually using the Stop button on top right corner of the NavBar), this CPU Credits block would be removed and some CPU Credits would be credited back to your Compute Plan. Net number of CPU Credits that are debited would depend upon the number of hours each of the compute node in your cluster was actually up.

Typically, user do not want to end their computations prematurely after hours of computing on several nodes. We recommend maintaining a large pool of CPU Credits (these credits do not expire) and selecting large time limit to ensure that jobs do not end prematurely. Unused CPU Credits are automatically refunded back to your Compute Plan.

If we have a credit card on file with Auto Refill box check marked then you can also select the Unlimited option for the time limit. In that case, once your Compute Plan is about to be depleted, your credit card would be charged and, if charge attempt is successful, your Compute Plan would be replenished. In this case your simulation can continue to run.

• Step3: Select "Run on autoscaling cluster" Option

Once you select this option you will be able to send selected jobs and selected commands within your model to the compute node while others will execute on the master node. If this option is not selected then only one cloud HPC server will be booted up and all jobs would run the same server. You would execute jobs through the Software Stack Builder GUI or through CloudShell terminal or through scripts as described below.

• Step4: Select Maximum Number of Compute Nodes

Autoscaling clusters start with a minimum number of compute nodes (currently set at 2 nodes). It then scales up and down automatically depending upon the jobs submitted to compute nodes through the Software Stack Builder GUI or through CloudShell terminal or through scripts. You can select the maximum number of compute nodes that you want to have in your cluster. You do not need to change your job submission scripts or your settings on the Software Stack Builder GUI. If the workload submitted is higher than the cluster size then jobs would simply wait in the queue for the compute nodes to become available. If the workload submitted is less than the cluster size then some of the compute nodes would be automatically terminated soon. Cluster can scale down to zero compute nodes if no workload has been waiting in the queue for sometime.

See the screen shots below. You can click on the images to see a larger version.

• Step5: Select Server Hardware for the Master and the Compute Nodes

Kogence clusters allow you to select different type of HPC server hardware independently for the master node and for the compute nodes. All compute nodes in a cluster are of same server hardware type. Typically, users select a small inexpensive server hardware for the master node and use this node for pre-processing, post-processing and cluster and job monitoring purpose. Typically these uses do not require high computing power. Type of server hardware for the compute nodes can be selected based on the scalability and memory needs of your model. If your model scales linearly as you increase the number of CPUs then you are better of choosing compute nodes with large CPU counts. If your model requires large memory then you can select compute nodes that have large RAM/CPU ratio.

First select "Master Node" from the dropdown menu and then click on one of the machine icons of the different HPC server hardware available under your subscription to select the HPC hardware for the master node of the cluster. Next select the "Compute Node" from the same dropdown menu and then again click on one of the machine icons to select the HPC server hardware for the compute node.

• Step6: Save the Cluster Settings

On the right hand panel you will see a summary of your final selection. Please press Save button to confirm the selections.

Running Jobs on Autoscaling Cluster Using CloudShell Terminal

For most users, running jobs on cloud HPC clusters using the Kogence Software Stack Builder GUI is the the most convenient, easiest and quickest method. On the other hand, more advanced user may prefer to invoke executable binaries of available software, solvers and simulator applications using a shell terminal. It offers greater flexibility. Use cases can vary widely and include ability to redirect outputs and errors to specific files, ability to stop and restart an application with stopping entire simulation, invoking executable with large number of different input files (e.g. parameter scans) etc.

• Step1: Connect Specific Software to Your Model

First, go to the Stack tab of your model and connect one or more software, solver or simulator applications that you want to use from the shell terminal. Please do not select any entrypoint binaries from the dropdown menu for any of these software. See an example screen shot below. This would download the software docker containers on your HPC server after it boots up. This may take a few seconds. Since you did not select any entrypoint binaries, no command would be invoked. After downloading the containers, commands would be automatically configured and added to the PATH so you can invoke them from the shell terminal.

Staying at the Stack tab of your model, connect the CloudShell to your model. From the entrypoint binary dropdown menu, select the shell-terminal. Currently, we offer two shell terminal emulators: xterm and gnome-terminal. In the command textbox please type either xterm or gnome-terminal depending on your preference.

• Step3: Save the Software Stack Settings

Save the settings of the Stack tab.

• Step4: Run the Model and Connect to Your Cloud HPC Server

From the top NavBar click on the Run button. It may take up to 5 minutes for your autoscaling cloud HPC cluster to boot up. Once the cluster is booted up, the Run button would turn into Stop button and the Visualizer button would become active. Connect to the master node of your cluster by clicking on the Visualizer button. Once all the software containers have been downloaded, you will see your shell terminal.

You are ready to invoke any of the binaries that are documented on the simulator wiki page. Any command that you invoke on the shell terminal would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster using the shell terminal then please use the qsub command of the Grid Engine resource manager like below:
qsub -b y -pe mpi num_of_CPU -cwd YourCommand

Be careful with the num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the qsub man page. Specifically you might find following switches to be useful:
• -b y: Command you are invoking is treated as a binary and not as a job submission script.
• -cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
• -wd working_dir_path: Makes the working_dir_path as the working directory. Output and error files would be generated in this folder.
• -o stdout_file_name: Job output would go in this file.
• -e stderr_file_name: Job error would go in this file.
• -j y: Sends both the output and error to the output file.
• -N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using -o and/or -e switches. You can also monitor and manage your job using the job name.
• -sync y: By default qsub command returns control to the user immediately after submitting the job to cluster. So you can continue to do other things on the CloudShell terminal including submitting more jobs to the scheduler. This option tells qsub to not return the control until the job is complete.
• -V: Specifies that all environment variables active within the qsub utility be exported to the context of the job.

Running Jobs on Autoscaling Cluster Using Scripts

For most users, running jobs on cloud HPC clusters using the Kogence Software Stack Builder GUI is the the most convenient, easiest and quickest method. On the other hand, more advanced user may prefer to invoke executable binaries of available software, solvers and simulator applications using scripts. It offers greater flexibility. Use cases can vary widely and include ability to redirect outputs and errors to specific files, ability to stop and restart an application with stopping entire simulation, invoking executable with large number of different input files (e.g. parameter scans) etc.

• Step1: Connect Specific Software to Your Model

First, go to the Stack tab of your model and connect one or more software, solver or simulator applications that you want to use from the shell terminal. Please do not select any entrypoint binaries from the dropdown menu for any of these software. See an example screen shot below. This would download the software docker containers on your HPC server after it boots up. This may take a few seconds. Since you did not select any entrypoint binaries, no command would be invoked. After downloading the containers, commands would be automatically configured and added to the PATH so you can invoke them from the shell terminal.

Staying at the Stack tab of your model, connect the CloudShell to your model. From the entrypoint binary dropdown menu, select the bash-shell. In the command textbox please type the name of your script. Save the settings of the Stack tab. Make sure you have the script available under the Files tab of your model.

• Step3: Save the Software Stack Settings

Save the settings of the Stack tab.

• Step4: Run the Model and Connect to Your Cloud HPC Server

From the top NavBar click on the Run button. It may take up to 5 minutes for your autoscaling cloud HPC cluster to boot up. Once the cluster is booted up, the Run button would turn into Stop button and the Visualizer button would become active. Connect to the master node of your cluster by clicking on the Visualizer button. Once all the software containers have been downloaded, your shell script will execute automatically.

In your shell-script you can invoke any of the binaries that are documented on the simulator wiki page. Any command that you invoke in your shell script would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster through the script then please use the qsub command of the Grid Engine resource manager like below:
qsub -b y -pe mpi num_of_CPU -cwd YourCommand

Be careful with the num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the qsub man page. Specifically you might find following switches to be useful:
• -b y: Command you are invoking is treated as a binary and not as a job submission script.
• -cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
• -wd working_dir_path: Makes the working_dir_path as the working directory. Output and error files would be generated in this folder.
• -o stdout_file_name: Job output would go in this file.
• -e stderr_file_name: Job error would go in this file.
• -j y: Sends both the output and error to the output file.
• -N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using -o and/or -e switches. You can also monitor and manage your job using the job name.
• -sync y: By default qsub command returns control to the user immediately after submitting the job to cluster. So your script would immediately move on to do other things including submitting more jobs to the scheduler. This option tells qsub to not return the control until the job is complete.
• -V: Specifies that all environment variables active within the qsub utility be exported to the context of the job.

Managing and Monitoring Cloud HPC Cluster Jobs

For all your Cloud HPC Cluster jobs, we strongly recommend that you add a CloudShell software to your workflow using the Stack Builder GUI accesible through the Stack tab of your model. Make sure that the CloudShell is marked as "Run With Previous" so that the CloudShell becomes available for use immediately and is not blocked until jobs before it have been completed.

On the CloudShell terminal, you can use the qhost, qstat and qdel utilities to manage and monitor your Cloud HPC Cluster jobs.

• qhost shows the number of compute nodes available in your cluster. NCPU, NSOC, NCOR and NTHR shows the number of CPUs, number of sockets, number of cores and number of hardware threads available on each of those compute nodes. LOAD shows the time averaged (over past 5 minutes) demand for number of CPUs for all runnable processes. If you have 36 CPU nodes then LOAD should be very close to 36 if you utilizing your nodes properly.
• qstat shows the state of the job.
• "r" represents a properly running job.
• "qw" represents a job that is waiting-in-queue. This usually happens when your autoscaling cluster is waiting for nodes to be added into your cluster. Kogence platform uses predictive algorithms and there can sometime be an up to 5 minutes of delay before nodes are added to your cluster. Your job will automatically switch to "r" state once nodes are added.
• "Eqw" represents that job is waiting in queue due to some error in the submission command. You can run the command qstat -explain c -j jobIDXYZ to see more details. You can use qdel jobIDXYZ to delete the job, fix the submission command and resubmit the job using qsub.