# Help:FAQ

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Miscellaneous Kogence Terminology

### What is a Model?

Every cloud HPC project users create on Kogence is called a Model. Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators.

Each model consists of:

1. An independent project documentation wiki. Wiki comes with a full featured WYSIWYG editor. Any graphics generated on execution of your Model is automatically pulled into the project wiki. Permission controls you choose for your Model apply to the Model's wiki as well. A private wiki would show up in the Model Library page only if one of the collaborator with correct permission logs in.
2. An independent discussion board. Permission controls you choose for your Model apply to the Model's discussion boards as well.
3. An independent control of permissions at the Model level. If you give your collaborators a edit permission then your collaborators would have edit permission on all assets under that Model including project files, wiki, discussion boards, cluster settings and stack settings. You cannot control permissions on individual file level, for example.
4. A connected cluster. You can chose to run your Model on a single cloud HPC server or or you can choose to run on an autoscaling cloud HPC cluster. You can choose the maximum number of nodes that you want your cluster to scale. Nodes are created and deleted automatically based on the progression of your Model execution. Any settings you choose for setting up your cluster remain stored with the Model.
5. A connected software stack. Model can be connected to multiple software. You can create a Workflow using these multiple connected software. Some of these can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command. Any settings you choose for setting up your software stack and invocation commands remain stored with the Model.
6. All assets of each Model are independently version controlled.
7. When you Copy a Model you are creating a new Model with all its assets being duplicated. From there on, new Model maintains its own new version control history.

### What is a Simulator and a Container?

On Kogence, any software application, solver or a simulation tool, each being referred to as a Simulator, is deployed as an independent entity, referred to as a Container, that can be connected to Models using the Stack tab of the Model and invoked to do some tasks/computations on input files/data provided under the Files tab of the Model. Containers cannot contain user specific or Model specific data or setup. It has to be an independent entity that can be linked and invoked from any Model on Kogence.

We use the term Simulator in much wider sense than what is commonly understood. In the context of definition above, Matlab is a simulator and so is Python. Both are deployed in independent Container. On Kogence we use Docker container technology.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we do not launch a Container; we launch a Model. Model is connected to a cluster and a stack of software and simulators

### What is a Simulation?

Execution of the Model on a cloud HPC Cluster is called a Simulation. A Simulation is NOT a single job. Billing on Kogence works on per Simulation basis (i.e. per execution of a Model) not on per job basis. Simulation can consists of a complex Workflow of multiple jobs using multiple software as defined by the user on the Stack tab of the Model. Some of these jobs can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators

### How Long Does it Take for My Simulation to Start?

If your Model is connected to a single cloud HPC sever that is not already persisting (i.e. when you start your Simulation for the first time) then it can take about 2 minutes for the server to boot up and start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. If you press the Stop button to stop the simulation and then press the Run button again then server persists through use of kPersistent technology. That means that server would start executing your jobs immediately and you can connect using Visualizer tab immediately. This allows you to work interactively, stop the execution, edit and debug code using code editor accessible under the Files tab of your model, and the restart your Model again just like your would do when you are on the onprem workstation.

If your Model is connected to an autoscaling cloud HPC cluster then it can take about 5 minutes for the cluster to boot up and configured to start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. kPersistent technology is not yet available with clusters.

### What are CPU-Hrs?

CPU-Hr is the unit we use on Kogence to measure amount of computing power that has been reserved or consumed independent of the type of hardware being used. CPU-Hrs = # of CPUs X # of Hours.

We measure time in the steps of hours. This does not mean that every time you start your Simulation you will be billed for full one hour. Only the first time you start a Simulation we charge your for one full hour. If, before the full hour is completed, you stop and restart same or different Model on same hardware then we do not charge you anything until the completion of the full hour since the time you first started your first Model. After the completion of a full hour, you would be charged for another full hour. Same logic continues for all subsequent hours.

### What is 1 CPU Credit or 1 HPC Token?

HPC Token and CPU Credit are one and the same thing depending upon the release version of Kogence HPC Grand Central App deployed under your subscription. On Kogence, HPC Tokens or the CPU Credits is the currency that you purchase and then spend it while you consume the HPC compute resources on the Kogence Container Cloud HPC platform.

The cost of compute resources is specified in terms of CPU Credits or HPC Tokens. Please check the pricing page for the currently available pricing. Typically, 1 CPU Credit = 1 CPU for 1 hour. Accelerated hardware such as a GPU compute node is also priced in terms of CPU Credits. Typically, 10 CPU Credit = 1 GPU-accelerated-CPU for 1 hour.

So for example if you connect your Model to a 4 CPU machine and select the wall time limit of 10 hours, then your accounts needs to have at least 40 CPU Credits remaining otherwise your Simulation would not start. If your Simulation gets started and either ends automatically or you stopped it by pressing the stop button after 1 hour and 20 minutes, for example, then we will refund your account with 32 CPU Credits. If you restarted the Simulation on same hardware and with same wall time limit within next 40minutes, say after 20minutes to be exact, then we would again block 40 CPU Credits at the start. But if you stoped the Simulation after 10 minutes only then we would refund your account with full 40 CPU Credits back. You are not charged anything for this second Simulation because you already paid for 2 hours during the first Simulation.

As of this writing, everybody starts with free 20 CPU Credits. We top up your accounts every month for free. You can earn more free Credits. You can also purchase more Credits. Credits do not have any expiration date. Currently you can purchase CPU Credits for as low as as $0.02. This means you can purchase 1 CPU-Hr of computing for$0.02. Kogence reserves the right to change these pricing without notice. Please check the pricing page for the currently available pricing.

## Cloud HPC Hardware Terminology

### What is 1 CPU on Kogence?

1 CPU on Kogence is same as what is called as 1 CPU in the Resource Monitor of Microsoft Windows 10 PC, for example. Similarly, # of CPUs shown on Kogence is same as what is shown as "CPU(s)" by the lscpu utility of linux platforms. Different platforms and utilities may call this same logical computing unit by different names. On same Microsoft Windows 10 PC, for example, System Information and Task Manager utilities calls it as "number of logical processors" and Amazon AWS (see here) and Microsoft Azure (see here) call this "number of vCPU". In general,

${\displaystyle no\_of\_CPUs=no\_of\_hardware\_thread\_per\_core\times no\ of\ cores\ per\ socket\times no\ of\ sockets\_per\_server}$.

This is the most relevant logical unit of computing power in the context of cloud High Performance Computing (HPC). Each logical processor in fact has an independent architectural state (i.e. instruction pointer, instruction register sets etc.) and each can independently and simultaneously execute instruction sets from one worker thread or worker process each without needing to do context switching. Each of these register sets are clocked at full CPU clock speed.

In the context of MPI (Message Passing Interface, multi-processing framework) and OpenMP (multi-threading library), for example, each of these CPUs can run an individual MPI process or an individual OpenMP thread without requiring frequent context switches. So if you are scheduling an MPI job on a 4 CPU machine on Kogence then you can run mpirun -np 4 and there will be no clock cycle sharing or forced context switching among processes.

Between CPUs, Cores, Sockets, Processors, vCPUs, logical processors, hardware threads etc. terminology can get confusing if you are not a computer scientist! Lets take a look at common Microsoft Windows 10 PC that we are all so familiar with. All other computer system will also show similar info. Lets first look at the task manager. This machine has an Intel Core i5-7200 CPU. But don't jump to call it a 1 CPU machine, that is is wrong. You will see below. Here we also notice that this machine has 1 socket, 2 cores and 4 logical processors. Now if you open the System Information, you will see something similar. Now if you open the Resource Monitor, you will see something like this. Notice that this is a 4 CPU machine -- from CPU 0 to CPU 3. If you are on a linux machine you run the lscpu command, you will see 4 CPU(s) as well.

Hardware threads within a core do share one common floating point processor and cache memory, though. Please check FAQ section to learn more about CPUs and hardware threads. In FLOPS heavy HPC applications it is the utilization of floating point processor that is more important than the utilization of CPU.If a worker thread or a worker processes running on one of the hardware thread can keep the floating point processor of that core fully occupied then even though the system/OS resource management and monitoring utilities (such as top or gnome system-monitor on linux platforms) may show overall CPU utilization being capped at most at 50%, you are already getting most out of your HPC server. 50% CPU utilization may often be misleading in the context of HPC. Remember that the CPU utilization percentage is averaged across total number of CPUs (i.e. total number of hardware threads) in your HPC server and if you are using only one hardware thread per core then maximum possible CPU utilization is 50%. This just means that CPU can in theory execute twice as many instruction per unit time provided they are not all asking for access to floating point processor. If they are then 50% utilization is as best you can get.

For these reasons, some FLOPS heavy HPC applications available on Kogence (e.g. Matlab) will not run one independent worker process or worker thread on each hardware thread. Instead they run one worker process or worker thread per core. But Kogence allows you to start multiple instances of these applications. You can experiment with single instance and multiple instance and see if you get better net throughput.

### What is a Hardware Thread?

Most High Performance Computing (HPC) servers that Kogence offers are built using Intel microprocessor chips. HPC server motherboards consists of multiple-sockets with each socket plugged with one Intel microprocessor chip. Each microprocessor chip has multiple cores. Each core is built using Hyper-Threading technology (HTT). HTT creates 2 logical processors out of each core. Different operating system (OS) utilities and program identify these logical processors by different names: some identify them as 2 CPUs and others identify them as 2 hardware threads. Hardware threads is a very confusing terminology. It has nothing to do with user (worker) threads that your program might start. Each hardware thread is capable of independently executing an independent task -- ether an independent worker thread or an independent process.

Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core. Just like a dual-core or dual-socket configuration that uses two separate physical processor, each of these logical processor has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified process/thread, independently from the other logical processor sharing the same physical core. On the other hand, unlike a traditional a dual-core or dual-socket configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface. The sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core. A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The processor may stall due to a cache miss, branch misprediction, or data dependency. The degree of benefit seen when using a hyper-threaded processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.

In vast majority of modern HPC use cases, we find that HTT helps speeding up the application. It is a very effective approach to get most performance for a given cost. At high level any HPC program when loaded into CPU registers as a set of machine instructions can be thought of as directive to CPU to repeatedly looping over following: 1/ Fetch instruction, 2/ Decode instruction and fetch register operands; 3/ Execute arithmetic computation; 4/ Possible memory access (read or write); 5/ Write back results to register. It is the step #4 that most people ignore logically when thinking about speed of executions. In this context, even cached memory is slow, much less main memory. L1 cache typically has a latency of ~2 CPU cycles, L2 cache typically has a latency of ~8 CPU cycles while L3 cache typically has a latency of ~100 CPU cycles. Main memory has about 2X more latency than the L3 cache (~200 CPU-cycles away). If your code opens some I/O pipes to read/write on files, for example, than I/O devices on main memory bus has 100X-1000X more latency than main memory (~20K to ~200K CPU-cycles away). If I/O devices are on network or on PCIe bus than you are looking at miliseconds (~2million CPU-cycles away) level latency at the least.

Now imagine your HPC code is running on a CPU. CPUs are really good in doing step #1 to #3 and step #5. For example, modern CPUs can do 4 floating points (4 FLOPS) per CPU-cycle. If your program needs to access some data from main memory than it will be waiting for ~200 CPU-cycles and the floating point arithmetic compute engine of CPU will be just sitting idle. If it is waiting for data from an I/O then compute engine may be waiting idle for millions of CPU-cycles. If you could use that time, you could have completed 4 million floating point operations utilizing that idle time. That is exactly what the HTT technology accomplishes. Even the most heavily optimized real world HPC code would need to access cache memory frequently at the least. This means there are several 10s to several 100s of CPU-cycles, at the least, that could be utilized by other thread. Whether hardware threading (HTT) will enhance the performance or not basically boils down to the ratio of floating point instructions to instructions that need to fetch or write data from cache, main memory or I/O devices. Even if that ratio is 100 or 1000, you would still expect HTT to boost performance.

In case the HTT is turned off, it is not that you would be completely wasting those idle cpu cycles. The OS can still do the context switching and load other ready to be executed process or thread (user worker thread or OS thread). The benefit of HTT is that the time it takes to do the context switching between hardware threads is much less than that needed by the OS to do the context switch as the Thread Control Block, TCB, is already loaded in the hardware.

However, in cases where both threads are primarily operating on very close (e.g., registers) or relatively close (first-level cache) instructions or data, the overall throughput occasionally decreases compared to non-interleaved, serial execution of the two lines of execution. In the case of LINPACK benchmark, a popular benchmark used to measure supercomputers on the TOP500 list, many studies have shown that you get better performance by disabling HT Technology. We think these are largely artificial constructs and do not represent how real life HPC codes work. These LINPACK benchmarks are specifically designed to probe the speed of floating point operations and deliberately avoid accessing memory or I/O. Real world HPC workloads cannot avoid memory access or I/O as efficiently as these benchmarks do. These benchmarks are hand crafted specifically to probe the speed of floating point operations.

One of the drawbacks of traditional onprem HPC cluster is that it hinders experimentation with HTT for your specific use case at hand. Either entire cluster needs to switch ON HTT or entire cluster needs to switch OFF HTT. Your onprem cluster admin typically makes that decision based on all types of workloads that all other users typically run on that cluster. On kogence cloud HPC platform you create your own personal autoscaling HPC cluster for each model you are executing. You can run same model multiple times on brand new clusters and disable HTT Technology in some cases while leaving it enabled in others. You can easily test both configurations and decide which is best based on empirical evidence. With that said, for the overwhelming majority of workloads, you should leave HTT Technology enabled.

OS configured on Kogence are HTT-aware, meaning they know the distinction between hardware threads and physical cores and would properly schedule user threads to reduce stalled CPU states and enhance performance unless you specifically instruct Kogence platform to pin your processes and worker threads to specific cores or to specific hardware threads.

### What are the SISD, SIMD, MISD and MIMD Architectures?

Oldest microprocessor architectures were SISD, which is short for single instruction single data. This means that a single piece of data (say two operands on which add instruction needs to be executed) is loaded on CPU registers and a single instruction (such as the add instruction) is executed on them.

SIMD, which is short for single instruction multiple data, enables us to process multiple data with a single instruction. This allows data to be "vectorized". For example, we may be able add two columns of a matrix in a single CPU clock cycle. The width of SIMD is the number of data that can be processed simultaneously and it is determined by the bit length of registers. Intel hardware has offered vectorized SIMD capability beginning with SSE (Streaming SIMD extensions) which supports 128-bit registers. AVX (advanced vector extensions) offered 256-bit registers and AVX-512 offers 512-bit registers. On Kogence, all hardware offered is AVX-512 capable. Since the theoretical peak performance of a CPU is the value when the application uses the vector width to the full, the utilization of SIMD is crucial for the performance of applications on modern CPU architecture. However, it is not trivial to transform a scalar kernel to a SIMD-vectorized one. Additionally, the optimal method of vectorization is different for each instruction set architecture (ISA). Most software, simulators and solvers that we offer on Kogence, such as GROMACS, LAMMPS, NAMD etc. have been effectively vectorized and are available to use out of the box to our users.

Multiple instruction single data (MISD) is a theoretical processor architecture concept that, to our best knowledge, has never been implemented in any commercially available computing hardware.

All hardware that we offer on Kogence are Multiple instruction multiple data (MIMD) capable. Each CPU in itself is SIMD but all computing hardware we offer consists of multiple CPUs that can independently execute multiple instructions, each instruction working on vectorized data, simultaneously. So the computing the computing hardware in aggregate is known as an MIMD system.

MIMD computing systems can be further divided into two categories -- the shared memory architecture (known as symmetric multiprocessors or SMP) and distributed memory architecture (also known as cluster computing). We offer both on Kogence. Please refer to other sections of this document for more details.

### What is the Memory Latency and Bandwidth on Kogence?

Before we answer that question, lets cover some basics.

#### Virtual Memory and Page Tables

In HPC, we often deal with shared memory parallelism that means processes can share memory. That means multiple virtual memory pages (as addressed by different processes) may be mapped to same physical memory frame. If NUMA is enabled then reverse if also possible, i.e. same virtual memory page may be mapped to multiple physical memory frames (for read only frames) with individual frames being on local physical memory of each individual CPU.

#### Cache Memory, Translation Look Aside Buffer (TLB) and Main Physical Memory

Cache and TLB are part of the Memory Management Unity (MMU) which is integrated on same silicon chip together with the microprocessor. Both cache and TLB are used to reduce the time it takes for a running process to access a physical memory location which is located off the microprocessor chip and on a different silicon chip on the motherboard connected on the memory bus on all hardware that Kogence offers. Cache is basically a quick access copy of small sections of physical memory in a static RAM on the microprocessor's silicon chip while TLB is a quick access copy of small sections of page table (which also resides in physical memory) in the static RAM on the microprocessor's silicon chip.

#### Latency and Throughputs

Servers offered on Kogence have 3 levels of cache: L1, L2 and L3. L1 cache is further broken down into instruction (L1i) and data (L1d) cache. Each core gets its own L1 and L2 cache while L3 cache is shared among all cores in a socket. Hyperthreads do not get their own cache. Typically our hardware will have 32KB each of L1i and L1d cache, 256KB of L2 cashe and >10MB of L3 cache. Amount of each level of cache differs from hardware to hardware. Check the specification page for more details. On Kogence hardware, TLB may reside between the CPU and the L1 cache. On servers Kogence offer,

• L1 cache typically has a latency of ~2 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L2 cache typically has a latency of ~8 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L3 cache typically has a latency of ~100 CPU cycles and a bandwidth of about ~200 bytes/CPU-cycle.
• Main memory has a latency of ~200 CPU cycles and a bandwidth of about ~100 bytes/CPU-cycle.

### What are Shared and Distributed Memory Computing, SMP, NUMA and ccNUMA?

Oldest microprocessor chips had one CPU connected to an off chip RAM through a memory bus. Then came the multi-core microprocessors. Single chip had multiple CPUs in it connected to the same shared off chip RAM through a common memory bus. This shared memory computing architecture is referred to as the symmetric multiprocessor (SMP). What this means is that each CPU can access the common shared RAM in a completely autonomous fashion independent of the other CPUs. The architecture is "symmetric" when see from each core's perspective. There is no master/slave type of concept, there is a single OS across all CPUs and neither of the CPUs has to go through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP). Processes running on each CPU can address the entire common memory space and access that memory space directly though the memory controller. An SMP system is a tightly-coupled system in which multiple processors working under a single operating system and a single virtual memory space can access entire system memory over a common bus or interconnect path.

Applications using open MP for parallelism rely on this shared memory architecture. You can not two threads of same process on CPUs that don't share memory. Older versions of MPI didn't leverage shared memory capability. Meaning if you were running two MPI ranks (i.e. two processes in same process group) on an SMP machine then those two process would have used inter process communication methods to access each others data. Starting from MPI-3, MPI now offers shared memory capability as well.

Of course, in the SMP architecture, the memory access latency will depend on the memory bus contention or memory access conflicts. As the number of cores connected to single memory bus started to grow this contention quickly became an issue and the performance boost expected from multiple CPUs quickly started to disappear. Cache memory architectures, integrated memory controllers (IMC) and multiple channel memory architectures were developed to reduce this memory contention in SMP architectures. In modern hardware that we offer on Kogence, memory architecture is quite complex to deal with possible contentions. Please see Intel Cascade Lake and Intel Skylake architecture documentation pages as well as other sections of this document.

SMP architectures are further broken down into two conceptual frameworks -- Uniform Memory Access (UMA) and Non-uniform Memory Access (NUMA). UMA architectures are basically the earliest SMP architectures where, in the absence of contentions, the memory access latency for each CPU is exactly the same. These UMA SMP architectures are clearly not scalable as more cores and CPUs are added. SMP architectures that we offer on Kogence are NUMA architectures. In NUMA SMP architectures, each CPU does not necessarily have completely same access latency to entire system RAM but, logically, there is still a single common virtual memory address space for all processes running on the single node. Each CPU can address the same entire virtual memory address space. Therefore, architecture is still known as SMP architecture. The older SMP applications will run on newer NUMA memory architecture unchanged -- hence the architecture is still SMP at a logical level.

Within each socket of the architectures we offer, there are two sub-NUMA clusters created by the two IMCs. Each socket gets a mesh interconnect between cores and IMCs, two IMCs and 6 memory channels (3 memory channels per IMC ) with the sockets on same node being connected by the UPI links. Absence any contentions, all the cores in a localization zone (i.e. in the sub-NUMA cluster) have same memory access latency. One can use OS provided utilities to specify process or thread affinities to within sub-NUMA clusters.

Cache management with these modern NUMA SMP architectures is significantly more complex as we need to make sure that the potentially multiple duplicate copies of data loaded into L1, L2 cache of each core as well as the L3 cache of the socket is in sync as multiple CPUs perform read write operations. Hardware architectures we offer are cache coherent NUMA (ccNUMA) which means that the cache coherence is ensured by the hardware itself.

Distributed memory computing (also known as cluster computing or loosely-coupled computing) should be contrasted with the tightly coupled shared memory (SMP) computing. Clusters consists of "nodes". A node is the largest unit of computing where cache coherency can be ensured. Different nodes are running independent copies of OS and have their own virtual memory address space. Processes running on different nodes need to access data through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP).

### What are "Number of Processes" and "Number of Threads"?

A process is a logical representation of an instance of a program that has been submitted to CPU and OS to manage its execution. Process is distinct from a thread -- both user or OS (worker) threads as well as hardware threads (HTT technology). Hardware thread is a confusing terminology. Each hardware thread can execute an independent process or an independent worker thread. On Kogence platform we refer to hardware thread as a CPU.

As opposed to threads, processes provides stronger logical isolation as you are stating independent instances of program to work independently. Intuitively, one can understand starting multiple processes of same program to be similar to starting same program multiple times. For example, if you start multiple instances of Matlab by tying matlab -desktop & multiple times on the CloudShell terminal then you are basically starting multiple Matlab processes. You should not be afraid of multiple instances of Matlab corrupting data or files because if one instance of Matlab opens a file then other instance cannot access it. Similarly both instances are creating there own temporary data in the memory to work on so you do not have to think about controlling access to variable in the memory.

Using multi-processing libraries such as MPI to start multiple instances of Matlab is much more preferred method as compared to above mentioned method of starting multiple instances on a shell terminal. As an example, if your code running under each instance of Matlab is trying to access same file then all instances except will crash complaining they cannot access the file. If you start processes through standard libraries such as MPI then you can be assured that MPI will manage those things. It will take care of keeping a process on wait until other processes close the file.

At a lower level, a process is defined by a Process Control Block (PCB). PCB stores state of execution of a process. It has all the information that OS and CPU need to freeze or unfreeze a process. Any set of instructions submitted to CPU that comes with its own PCB is a process. Each process gets a unique process ID (PID) that you can check using OS utilities. So any task that has a unique PID is a process. As an example, if you execute a bash shell script then the bash-shell started to execute the script instructions will get a PID and would be a process. If from that shell script you provide instructions to call another bash shell script then that script will be executed in another child bash shell and that bash shell will gets its own PCB and PID. Modern OS are smart enough to not duplicate the machine code for each instance in the main memory and in cache. For example, they will only load one copy of the text or the code segment in the memory. But each process will get its own program counter, a CPU register, that keeps track of which instruction is being executed currently. Program counter is part of PCB and gets saved and restored when processes is taken out by OS from running state and then brought back to running state. Typically, virtually memory allocated to a process and files and other I/Os opened by a process are restricted to that process. Other processes cannot access those resources. But processes can specifically issue instructions to share these resources with other processes. In HPC we do this by using standard libraries such as MPI.

Processes can start children processes (and those can start grand-children processes etc). Each of those will get their own PCB (and therefore a PID). Processes can also start multiple threads. Threads will operate under same PCB. Threads do not get their own PCB (instead they get a Thread Control Block, TCB, which has a link to the PCB of the parent process). Threads can access resources of the process.

Most HPC solvers can also do multi-processing as opposed to multi-threading. Meaning multiple CPUs across multiple nodes can be instructed to use independent instances of same solver code (children processes) and operate independently on a specific set of data and instruct children processes to work on them in parallel. As children may be running on different nodes, they don't have access to memory of parent process in the parent compute node. Children processes exchange data with each other and with parent using message passing interface (MPI) library subroutines. If MPI processes are running on same node then they do have ability to access shared memory as well. But all request to access data/memory still goes through MPI subroutines. MPI subroutines manage mode of access to the data automatically by keeping track of which children process is running on which node.

On Kogence, we restrict product of processes and threads to be equal or less than the number of CPUs in the cluster. This eliminates serious overhead from context switching.

### What is the Overhead of Process Context Switching?

Context switch refers to which instruction set is a given core executing at a given time. Context switch can refer to switching from one hardware thread to another hardware thread (HTT technology), switching from one user (worker) thread to another user (worker) thread or from one process to another process. Context switching from one hardware thread to another hardware thread has negligible overhead. This FAQ discusses the overhead of process context switch. There is another FAQ that discusses the overhead of worker thread context switch.

When a request is made to execute a different process on the same CPU (either triggered by user code instruction, i.e. voluntary context switch, or by the OS scheduler because running process went into idle state waiting for some I/O or because OS scheduler wants to give processor time to other processes, i.e. involutary context switch) then first a switch from user mode to OS mode is triggered. This simply means that a subroutine from OS code base needs to be called that will perform the task of context switch. OS code is just like usual user code that is loaded into the physical memory. This OS code will save the process control block (PCB, which consists of current values in a set of registers in the CPU such as the program counter, pointer to page table etc.) of the current process to the physical memory and load the PCB of new process from physical memory into the CPU registers. OS will then make another switch from OS mode to user mode to let CPU start executing instruction from new process. In older hardware and OS architectures, a switch from user mode to OS mode itself was implemented like a full process context switch. That meant saving existing process PCB, loading OS PCB, executing OS code, saving OS PCB and then loading the new process PCB. But in modern hardware and OS, cost of switching out of user mode to OS mdoe and then back is much less expensive and in not a significant portion of the cost of context switch anymore. Also, on virtualized cloud HPC hardware, some hypervisors need to give control to the host machine OS and guest OS cannot do this switch. This increases the computational cost of context switch by an order of magnitude. On Kogence platform, we have carefully configured that system to eliminate this latency.

Therefore the fixed cost of doing a context switch consists of: switching from user mode to OS mode; storing and loading of PCB; and switching back from OS mode to user mode. On Kogence cloud HPC platform, switching in and out OS mode back to user mode typically costs about 100 CPU clock cycles. With CPU clock cycles of about ~2 to 3GHz, the time it takes to switch back and forth between OS mode and user mode would take ~50ns. Among the fixed cost components, the cost of storing and loading process PCB is much larger. This is easily 2 orders of magnitude bigger and takes about 2,000 CPU cycles or ~1µs.

Much bigger portion of cost is the variable cost. The variable CPU cycle cost of doing context switch consists of: flushing of the cache; and flushing of the TLB. Note that each process uses the same virtual memory address space but they are each mapped to different physical memory address space through page table. A small portion of page table is cached on the TLB. Now depending upon the number of pages of virtual memory that the old and new processes use (called the working set), CPU might encounter lots of TLB misses on context switching. So it may have to flush the TLB and load new sections of page table. In addition, the old section of physical memory frames that were cached on various levels of cache may also become useless and CPU might encounter lots of cache misses and it may have to flush the cache and lot new physical memory frames into the cache.

The variable CPU cycle cost of switching from old process to new process and back to old process will change dramatically depending the process pinning you instruct the OS scheduler to use. Lets say CPU1 is executing process1. Then OS asked the same CPU1 to start executing process2. After some time, now OS wants to restart the process1. The number of CPUS cycles we have to waste before we can successfully start executing process1 changes dramatically if OS schedules process1 back to original CPU1 or if OS schedules it to a new CPU2 (and instead starts a process3 on original CPU1). Note that each core gets its own L1 and L2 cache while L3 cache is shared between cores. Since address translations from virtual memory pages that preocess1 needs to access the physical memory address may already be present in the TLB of CPU1 and those required frames of physical memory may also be already present in L1 or L2 cache of CPU1 (since process1 was running on CPU1 sometime back), the process1 may restart on CPU1 much quicker than on CPU2. If new CPU2 sits on a different socket then cost will be even higher as both TLB and all levels of cashes will need to be repopulated. By the way the process pinning affects fixed cost of context switching as well because now PCB is also not available in cache and needs to read from main physical memory.

Variable cost also depends strongly on the size of virtual memory pages that a process is using. If processes use very large page sizes than chances of need to flush TLB may be lower (as number of entries needed in the TLB may become smaller) but chances of need to flush cache may increase.

All of the above depend strongly on the size of the caches in the CPU you're using. Typically, if pages are available in cache then time it takes to write pages can easily take 100,000 cycles or few microseconds. If we miss cache and TLB then this can increase by an order of magnitude or to few 10's of microseconds. As we discussed above this also depends on the process pinning instructions your code might give to OS. Past a certain working set size, the fixed cost of context switching is negligible compared to the variable cost due to the cost of accessing memory. Because of all these reasons, it is very hard to put any representative/average number on the cost of context switching.

In summary,

• Switching from user mode to OS mode and back takes about 100 CPU cycle or about 50ns.
• A simple process context switch (i.e. without cache and TLB flushes) costs about 2K CPU-cycles or about 1µs.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Launching a new process takes about 100K CPU cycles or about 50µs.
• In the HPC world, creating more active threads or processes than there are hardware threads available is extremely detrimental (e.g. in 100K CPU-cycles, CPU could have execute 400K FLOPS). If number of worker threads is same as the number of hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

• If you are doing thread switch within same CPU (proper CPU pinning) with no need to do TLB and cache flushes then context switch takes about 100 CPU-cycles or about 50ns.
• A thread switch going to different CPU will cost about the same as a process context switch does i.e. its costs about 2K CPU-cycles or about 1µs. Here we are assuming that virtual memory that thread needs is already in cache and we dont need TLB and cache flushing.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each thread context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Creating more active threads than there are hardware threads available is extremely detrimental. If number of worker threads is same as the number f hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

### How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

When using a tool like top or gnome-monitor on Kogence HPC servers, you will see CPU usage being reported as being divided into 8 different CPU states. For example.
%Cpu(s): 13.2 us,  1.3 sy,  0.0 ni, 85.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

These eight CPU states are: “user” (us), “system”, (sy), “nice” (ni), “idle” (id), “iowait” (wa), “hardware interrupt” (hi), “software interrupt” (si), and “steal” (st). top is showing percentage of time server is spending in each of the eight possible states. Of these 8 states, “system”, “user” and “idle” are the 3 main CPU states. The ni state is a subset of us state and represents a fraction of CPU time that is being spent on low priority tasks. The wa state is the subset of the id state and represents a fraction of CPU time that is being spent while waiting for an I/O operation to complete. These 3 main CPU states and the si, hi and st states add up to 100%.

Please note that these are averaged over all CPUs of your HPC server. So if you started an 8 CPU HPC server then top will show utilization averaged across all 8 CPUs. You can press 1 to get per-CPU statistics.

• system (sy)

The “system” CPU state shows the amount of CPU time used by the kernel. The kernel is responsible for low-level tasks, like interacting with the hardware, memory allocation, communicating between OS processes, running device drivers and managing the file system. Even the CPU scheduler, which determines which process gets access to the CPU, is run by the kernel. While usually low, the system state utilization can spike when a lot of data is being read from or written to disk, for example. If it stays high for longer periods of time, you might have a problem. So, for example, if CPU is doing a lot context switching then you will see that the CPU may be spending a lot more time in the system state.

• user (us)

The “user” CPU state shows CPU time used by user space processes. These are processes, like your application, or some management daemons and applications started automatically by Kogence that would be running in the background. In short, every CPU time used by anything other than the kernel is marked “user” (including root user), even if it wasn’t started from any user account. If a user-space process needs access to the hardware, it needs to ask the kernel, meaning that would count towards “system” state. Usually, the “user” state uses most of your CPU time. In properly coded HPC applications, it can stay close to the maximum of 100%

• nice (ni)

The “nice” CPU state is a subset of the “user” state and shows the CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks. The nice utility is used to start a program with a particular priority. The default niceness is 0, but can be set anywhere between -20 for the highest priority to 19 for the lowest. CPU time in the “nice” category marks lower-priority tasks that are run when the CPU has some time left to do extra tasks.

• idle (id)

The “idle” CPU state shows the CPU time that’s not actively being used. Internally, idle time is usually calculated by a task with the lowest possible priority (using a positive nice value).

• iowait (wa)

“iowait” is a sub category of the “idle” state. It marks time spent waiting for input or output operations, like reading or writing to disk. When the processor waits for a file to be opened, for example, the time spend will be marked as “iowait”. Instead, if a task running on a given CPU blocks on a synchronous I/O operation, the kernel will suspend that task and allow other tasks to be scheduled on that CPU. In that case, CPU is not idle and this will not be shown in id or wa states.

• hardware interrupt (hi)

The CPU time spent servicing hardware interrupts.

• software interrupt (si)

The CPU time spent servicing software interrupts.

• steal (st)

The “steal” (st) state marks time taken over by the hypervisior.

### How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

One can use top or the gnome-monitor utilities to monitor the memory utilization.

## Cloud HPC Network Architecture

### Is the Kogence Cloud HPC Cluster Connected to Internet?

If you are starting your simulation on a single cloud HPC server, your HPC server is connected to internet. Once your Model is in the running state, you can click on the Visualaizer button on top right corner to connect to your HPC server over the internet under a secure and encrypted channel of SSL/TLS. If you connect a CloudShell to the software stack of your model through the Stack tab of your Model then you can use the CloudShell terminal to pull repositories over the internet using pip, git pull, curl, wget etc.

If you are starting an autoscaling cloud HPC cluster the the master node of the cluster is connected to the internet in same way.

Please refer to Network Architecture document for more details.

### How is Network the Architected Among the Compute Nodes?

Cloud HPC clusters you start on Kogence are equipped with standard TCP/IP network as well as infiniband like OS Bypass networks. Please refer to Network Architecture document for more details.

### Do the Kogence Cloud HPC Cluster Come with Infiniband Network?

Yes. If you select Network Limited Work Load node then that node is equipped with an OS Bypass network with network bandwidth up to 100Gbps. Please refer to OS Bypass Network and Remote Direct Memory Access documents for more details.

### How is the Network Configured Among the Containers?

Please refer to the Container Network document for details. The OS Bypass Remote RDMA Network is also available for workloads inside containers if same is accessible to the HPC server itself.

## Cloud HPC Parallel Computing Libraries and Tools

### What is BLAS?

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software.

On linux systems, if you don't configure your system properly, you would be using the default GNU BLAS (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS, AtlasBLAS, GotoBLAS and Intel MKL ) that can be used instead of the default base libraries. These libraries are optimized to take advantage of the hardware they are run on, and can be significantly faster than the base implementation (operations such as Matrix multiplications may be over 40 times faster). Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

### What is BLACS?

The BLACS (Basic Linear Algebra Communication Subprograms) is a linear algebra oriented message passing interface for distributed memory cluster computing. It provide basic communication subroutines that are used in PBLAS and ScaLAPACK libraries. It implements a portable message passing library to higher level libraries such as PBLAS and ScaLAPACK libraries. To provide this functionality BLACS can be built using any of the communication layer such as MPI, PVM, NX, MPL, etc available on the hardware and provide a hardware agnostic interface to higher level libraries. It is important to use hardware optimized communication layer to build BLACS to get most optimized performance for your HPC applications.

### What is PBLAS?

PBLAS (Parallel BLAS) is the distributed memory versions of the Level 1, 2 and 3 BLAS library. BLAS is used for shared memory parallelism while PBLAS is used for distributed memory parallelism appropriate for clusters of parallel computers (heterogeneous computing). PBLAS uses BLACS for distributed memory communication. Just like BLAS, on Kogence we automatically configure software/solver/simulators with appropriate hardware optimized versions of PBLAS based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### What is LAPACK?

LAPACK is a large, multi-author, Fortran library for numerical linear algebra. It provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision. LAPACK is the modern replacement for LINPACK and EISPACK libraries. LAPACK uses block algorithms, which operate on several columns of a matrix at a time. On machines with high-speed cache memory, these block operations can provide a significant speed advantage.

LAPACK uses BLAS. The speed of subroutines of LAPACK depends on the speed of BLAS. EISPACK did not use any BLAS. LINPACK used only the Level 1 BLAS, which operate on only one or two vectors, or columns of a matrix, at a time. LAPACK's block algorithms also make use of Level 2 and Level 3 BLAS, which operate on larger portions of entire matrices. LAPACK is portable in the sense that LAPACK will run on any machine where the BLAS are available but performance will not be optimized if hardware optimized BLAS and LAPACK are not used.

On linux systems, if you don't configure your system properly, you would be using the default GNU LAPCK (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS/LAPACK, Atlas/LAPACK, GotoBLAS/LAPACK and Intel MKL ) that can be used instead of the default base libraries. If you configure your system properly with hardware optimized LAPACK libraries then your HPC applications will acquire serious boost in performance. Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

### What is ScaLAPACK?

ScaLAPACK (Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory clusters of parallel computers (heterogeneous computing). ScaLAPACK uses explicit message passing for inter-process communication using MPI or PVM (this functionality is provided by the BLACS that is used in ScaLAPACK). Just like LAPACK, ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy that includes the off-processor memory of other processors (or processors of other computers), in addition to the hierarchy of registers, cache, and local memory on each processor. ScaLAPACK uses PBLAS and BLACS. ScaLAPACK is portable in the sense that ScaLAPACK will run on any machine where PBLAS, LAPACK and the BLACS are available but performance will not be optimized if hardware optimized PBLAS and LAPACK are not used.

Kogence use ScaLAPACK automatically whenever appropriate for you based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### What is ELPA

ELPA (Eigenvalues SoLvers for Petaflop-Applications) is a matrix diagonalization library for distributed memory clusters of parallel computers (heterogeneous computing). For some operations, ELPA methods can be used instead of ScaLAPACK methods to achieve better performance. ELPA API is very similar to ScaLAPACK API and therefore it is very convenient for HPC application developers to provide support for both libraries and offer ability to link ELPA in additions to ScaLAPACK during the application build. ELPA does not provide full functionality of ScaLAPACK and is not a complete substitute. Whenever HPC applications deployed on Kogence offer support for ELPA, on Kogence, we provision those applications (e.g. Quantum Espresso) with hardware optimized ELPA.

### What is FFTW?

FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data as well as of even/odd data, i.e. the discrete cosine(DCT)/sine(DST) transforms. At Kogence we automatically use the hardware optimized versions of FFTW based on the hardware you selected on the Cluster tab of your model.

### What is Intel MKL?

Intel Math Kernel Library (Intel MKL) is a library that includes Intel's hardware optimized versions of BLASLAPACKScaLAPACK, FFTW as well as some miscellaneous sparse solvers and vector math subroutines. The routines in MKL are hand-optimized specifically for Intel processors. On Kogence, Intel MKL is automatically configured and linked whenever appropriate based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### Background

Message Passing Interface (MPI) is a portable message-passing standard designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a library of routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. HPC software applications, solver and simulators written using MPI library routines and properly compiled and linked against MPI libraries enable multiprocessing. Multiple processes of MPI applications can be run in parallel either on single multi-CPU cloud HPC server or on an autoscaling cluster of multiple multi-CPU cloud HPC servers.

Traditionally, inter-process communication used by MPI is slower. This is expected when two MPI processes of an application are running on different nodes as the communication has to go through the network hardware. Older versions of MPI used the same inter-process communication framework even when these two MPI processes of the application are running on the same multi-CPU server. As opposed to MPI, OpenMP, on the other hand, is a library for multithreading. OpenMP applications enable you to start multiple threads of a single process of the application. A processes, and therefore all its threads, necessarily run on single cloud HPC server. Inter-thread communication is much faster than the inter-process communication as multiple threads of same process have access to same memory and such communication can avoid routing of data exchange through network hardware. This discrepancy in the communication speed when shared memory is available had lead to the adoption of multiprocessing-multithreading hybrid parallel programing constructs. If an application/solver provides support for both MPI and OpenMP then one can use both MPI (multiprocessing) and OpenMP (multithreading) at the same time and is some times known as hybrid parallelism. User is recommended to start as many MPI processes as there are nodes in the cluster and within each process user is recemented to start as many openMP threads as there are CPUs (with hardware hyperthreading) or cores (without hardware hyperthreading) in each node. This MPI/OpenMP approach uses an MPI model for communicating between nodes while utilizing groups of threads running on each computing node in order to take advantage of faster communication enabled by the shared memory multicore/multi-CPU architectures. On Kogence, user can configure such choices very conveniently on the Stack tab of the Model by selecting number of processes and threads independently from dropdown menus. Alternatively, user can use command line options of qsub and mpirun to configure such choices (see below for details).

The MPI-3 standard introduces another approach to hybrid parallel programming that uses the new MPI Shared Memory (SHM) model. The MPI SHM model enables changes to existing MPI codes incrementally in order to accelerate communication between processes on the shared-memory nodes. Any HPC application/solver that supports MPI SHM model, is preconfigured on Kogence to use shared memory communication when two processes of an application are running on same node.

For shared memory architectures across the cluster of nodes, please check OpenSHMEM. Newer version of MPI libraries have started offering capabilities to run OpenSHMEM over MPI. Of course, the HPC application/solver itself must be programmed to support OpenSHMEM.

Furthermore, traditionally, MPI inter-process communication used the standard TCP/IP stack for communication. Newer versions of MPI support various hardware network interfaces either through Open Fabrics Interface (OFI) or through natively implemented support which provides OS Bypass Remote Direct Memory Access capabilities and avoid the inter process communication to go through traditional TCP/IP stack providing significant performance boost. On Kogence, users would not have to worry about the details of hardware and network infrastructure as we will configure appropriate transport layer for best performance for their HPC applications.

### Kogence mpirun Wrapper

There are several well-tested and efficient implementations of MPI. On Kogence, currently we support Open MPI (not to be confused with OpenMP), MPICH and Intel MPI (an MPICH derivative) libraries. MPI is, at the least, as old as the world-wide-web itself. This has lead to a lot of legacy features and extremely varied ways in which MPI has been integrated with different tools in the HPC stack. As cloud computing and container based dependency isolation and portability is becoming more popular, the struggles of integrating MPI with more modern Docker based computing frameworks has been widely reported by the industry and academic researchers. In fact, these struggles had lead to the development of an alternate containerization technology called Singularity. Although, singularity solves many of the problems of integrating MPI with containerization, at Kogence we have taken a different approach. We have natively integrated dockers with MPI. On Kogence, users do not need to worry about these integration details. Kogence users can bring their own custom MPI HPC application as docker containers and users will invoke this MPI application just like they would do on any other traditional onprem HPC cluster. All of the Kogence HPC tools such as data management, version management, remote interactive display orchestration, autoscaling of cloud HPC clusters, automatic composition of users container with all other containers on Kogence --- all of these will be automatically be available for any newly added customer docker container.

MPI applications are traditionally invoked either using orterun (in case of openMPI), mpirun, mpiexec, mpiexec.hydra (in case of MPICH and Intel MPI) or through PMI interfaces offered by resource managers (for example, under Slurm, one may be able invoke MPI applications using srun if PMI is properly configured on the system). On Kogence, users will invoke MPI applications using Kogence mpirun wrapper. This should be available in $PATH when you are logged in. You can check by running which mpirun command on a CloudShell. Make sure you are using the Kogence mpirun wrapper (/tmp/software/mpirun) and not the mpirun wrapper that comes packaged with various distributions of the MPI libraries or with the OS. Kogence mpirun wrapper is a powerful integration that allows you to use same invocation constructs to invoke MPI application inside docker container, inside singularity containers or installed on host machine. Lets say you add 2 containers in the Stack tab of your Kogence Model. Lets say container1 has an MPI application built using MPICH and container2 has an MPI application built using openMPI. Once you run this Model on Kogence, both containers will get binaries automatically composed to invoke each other's entrypoints. If you use Kogence mpirun wrapper inside container1 to invoke application binary inside container2, wrapper will properly take care of wrapping the commands with correct MPI command line options, starting correct daemons and executables depending on the version on MPI packaged in the target container. It also automatically detects and provides integration with various resource managers such as Slurm, PBS, SGE and LSF. It also automatically scales the cluster of cloud HPC servers as well as provides integration with container cluster orchestration frameworks docker swarm and kubernettes. This wrapper will automatically detect the version of MPI that is packaged with containerized application and invoke that MPI executable with provided command line options after orchestrating a docker swarm or kubernettes cluster of containers. If you are invoking the MPI application using Stack tab graphical user interface of the Model, then Kogence will automatically choose appropriate MPI command line options for you. Advanced users should use CloudShell or other scripting language to manually invoke mpirun wrapper with command line options to suite their needs. With Kogence mpirun wrapper, you can provide any of the following command line options depending upon the version of MPI packaged in the container that you are invoking. You can check the version of MPI packaged in the container by looking at the Settings tab of the concerned container. ### OpenMPI #### OpenMPI Slots • Number of slots is one of the basic concepts in MPI. You can define any number of slots per host in the host file and then provide that host file to MPI using -hostfile option. By defining number of slots, you are letting MPI to schedule that many number of processes on that host. If you do not define it then, by default MPI assumes number of slots to be same as number of core. By using the switch --use-hwthread-cpus, you can override this and tell MPI to assume number of slots to be same as number of CPUs (i.e. the number of hardware threads). Defining any more slots than the number of CPUs will cause a lot of context switching and may degrade the performance of your applications. • With -np option you can request either fewer processes to be launched than there are slots or you can also oversubscribe the slots. Oversubscribing the slots will cause lots of context switching and may degrade performance of your application. One can prevent oversubscription by using the -nooversubscribe option. Oversubscription can also be prevented on per host basis by specifying the max_slots=N in the hostfile (resource managers and job schedulers do not share this with MPI, this only works when you are explicitly providing host file to the MPI). There are alternative ways to specify number of processes to launch. If you don't specify anything but provide a host file then MPI launches as many processes on each host as there are slots. The number of processes launched can be specified as a multiple of the number of nodes or sockets available using -npernode N and -npersocket N options respectively. #### Most Common OpenMPI Command Line Options • -c, -n, --n, -np <N>: Run N copies of the program across all nodes in a round-robin fashion by slots. If no value is provided for the number of copies to execute (i.e., neither the -np nor its synonyms are provided on the command line), Open MPI will automatically execute a copy of the program on each process slot. • Both -n and -np are also supported by Intel MPI and MPICH. • If you wrap your command with mpirun -np N, for example, then MPI will launch N copies of your command as N processes in a process group in a round-robin fashion by slots. These processes are called rank0, rank1 ... and so on. It is expected that the command you are launching is properly compiled and linked with MPI so as soon as these N copies launch, they can talk to each other and decide which tasks each would be working on. This logic is coded by the application coder. Typically all ranks wait for message from rank0. • If you wrap a non-MPI application command with mpirun --np N then each of the N copies would be doing exactly same thing. If you scheduled another mpirun wrapped command then MPI will launch another set of processes in another process group. These new processes will also be called rank0, rank1 ... and so on within that new process-group. The processes from process-group1 and the processes from process-group2 can only communicate through inter-process communication and not using intra-process communication. They also can not share memory. #### OpenMPI Command Line Options for Process Mapping to Hardware Resources • Note that none of the following options imply a particular binding policy - e.g., requesting N processes for each socket does not imply that the processes will remain bound to the socket as OS does rescheduling after context switching between processes. One can map the processes to specific objects in the cluster. This is the initial launch of the processes. One can then further bind the processes to objects in the cluster so that when OS does rescheduling after context switching, these processes remain bound to specific cluster objects. • Also note that, depending on value of --np, it is possible that a socket or node, for example, gets more than processes specified by following command options as the round robin scheduling continues. Following command options are round robin mapping parameters and not the options to control total number of processes in each hardware resource. • Following are all deprecated but still supported on Kogence. • -npersocket, --npersocket <N> (deprecated in favor of --map-by ppr:N:socket): On each node, launch N times the number of processor sockets on the node number of processes. N consecutive process will go in one socket, then scheduler moves to next socket and so on in round robin. • The -npersocket option also turns on the -bind-to-socket option. • -N, -npernode, --npernode <N> (deprecated in favor of --map-by ppr:N:node): On each node, launch N consecutive processes and then move to next node and so on in round robin. • Intel MPI and MPICH uses -perhost or -ppn. • -pernode, --pernode (deprecated in favor of --map-by ppr:1:node): On each node, launch one process. Equivalent to -npernode 1. • Following are preferred on Kogence cloud HPC platform. • -map-by ppr:N:socket: Launch N consecutive process per socket then move to next socket in round robin. ppr="process per resource". • -map-by socket: Launch 1 process per socket in round robin. For example, rank0 to go to first socket, rank1 to the next socket and so on until all sockets have one process on all nodes and then it will restart from first socket on first node as needed. • -map-by core: Launch 1 process per core in round robin. • Similarly you can do map-by for slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, node, sequential, distance. • --map-by socket is the default. #### OpenMPI Command Line Options for Process Binding to Hardware Resources • Binding processes to specific CPUs is also possible. This tells OS that a given process should always stick to a given CPU as OS is doing context switching between multiples processes and threads. This can improve performance if the operating system is placing processes suboptimally. For example, when we are launching less number of processes than the number of CPUs in a node, OS might oversubscribe some multi-core sockets while leaving other sockets idle; this can lead processes to contend unnecessarily for common resources. Or, OS might spread processes out too widely; this can be suboptimal if application performance is sensitive to inter-process communication costs. Binding can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with. • The processors to be used for mapping and binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a --bind-to l3cache directive will cause the process to be bound to the processors that share a single L3 cache within that socket. To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding algorithm will assign each process located to a socket to a unique core in a round-robin manner. Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located. • -bind-to socket • Similarly you can do bind-to for slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none. • By default, open MPI uses --bind-to core when the number of processes is <= 2, --bind-to socket when the number of processes is >2 and --bind-to none when nodes are being oversubscribed. If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process. #### OpenMPI Command Line Options for Porting Environment Variables • -x <env>=<value>: Export the specified environment variables to the remote nodes before executing the program. Only one environment variable can be specified per -x option. Existing environment variables can be specified or new variable names specified with corresponding values. For example: mpirun -x DISPLAY -x OFILE=/tmp/out .... The parser for the -x option is not very sophisticated; it does not even understand quoted values. Users are advised to set variables in the environment, and then use -x to export (not define) them. #### Miscellaneous OpenMPI Command Line Options • -v, --verbose: Be verbose • -V, --version: Print version number. If no other arguments are given, this will also cause orterun to exit. • -use-hwthread-cpus, --use-hwthread-cpus: Use hardware threads as independent CPUs. Note that if a number of slots in a node is not explicitly provided to Open MPI, then the use of this option changes the default calculation of number of slots on a node. • -nooversubscribe: Throw an error if user is trying to oversubscribe the node (i.e. requesting to start more processes in node than the number of slots). • -display-map, --display-map: Display a table showing the mapped location of each process prior to launch. • -display-allocation, --display-allocation: Display the detected resource allocation. #### OpenMPI Command Line Options That Should NOT Be Used On Kogence • -machinefile, --machinefile, -hostfile, --hostfile <hostfile>: Provide a hostfile to use. • Hostnames in the files are those shown by the hostname command or the IP address using which launcher can access the host. • For openMPI, -hostfile and -machinefile is synonymous. For Intel MPI and MPICH, the format of hostfile and machinefile is different. # This is an example hostfile. Comments begin with # foo.example.com bar.example.com slots=2 yow.example.com slots=4 max-slots=4  • -H, -host, --host <host1,host2,...,hostN>: Launch processes on the listed hosts. • Intel MPI and MPICH uses -hosts. ### Intel MPI Most Common Intel MPI Command Line Options • -n, -np <N>: Launch N processes in total across all nodes. If the option is not specified, then use the number of cores on each machine as default. • -perhost, -ppn <N>: Launch N consecutive processes on one host before moving to next host. • <N> can also be replaced by all or allcores. all is a synonym for number of CPUs in the host while allcores is a synonym for number of cores in the host. • Unless specified explicitly, the -perhost option is implied with the value set in $I_MPI_PERHOST which is set as allcores by default. This means that Intel MPI, by default, launches first N consecutive processes on first node with N being same as the number of cores in that node before scheduling processes to the next node.
• Notice the syntax difference with open MPI. Open MPI uses -N, -npernode, --npernode <N>.
• If you are working under a job scheduler or resource manager (as determined by looking at certian environment variables) then these options are ignored. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/job-schedulers-support.html for more details.
• -rr: Same as -perhost 1. rr="round robin". Consecutive processes go to different host and once all hosts are exhausted only then next process goes to host1.
• -nolocal will not run process on localhost.
• Remember that Intel MPI will always go by -n or -np and oversubscribe the nodes if needed. -perhost, -ppn, -rr etc. only control the "number of consecutive" processes on same node.

#### Intel MPI Command Line Options for Process Mapping and Binding to Hardware Resources

• Intel MPI allows user to specify computing resources for process mapping and binding either using the logical enumeration of CPUs (as provides by the cpuinfo utility) or using heirarchial levels (such as socket, core and CPU) or by topological enumeration. For example, for the case of a 2 socket HPC server with each socket being dual core and each core having 2 CPUs (being hyperthreaded), the logical CPU #, the heirachial CPU # and the topological CPU # would generally be reported like this: Logical Enumeration: 0 4 1 5 2 6 3 7 Hierarchical Levels: Socket 0 0 0 0 1 1 1 1 Core 0 0 1 1 0 0 1 1 Thread 0 1 0 1 0 1 0 1 Topological Enumeration: 0 1 2 3 4 5 6 7
• Intel MPI uses following 7 main environment variables to control process mapping and binding. Default values for the first 6 variables are shown below.
I_MPI_PIN=on
I_MPI_PIN_RESPECT_CPUSET=on
I_MPI_PIN_RESPECT_HCA=on
I_MPI_PIN_CELL=unit
I_MPI_PIN_DOMAIN=auto:compact
I_MPI_PIN_ORDER=bunch
I_MPI_PIN_PROCESSOR_LIST

• The environment variable I_MPI_PIN_PROCESSOR_LIST controls the one-to-one pinning while the I_MPI_PIN_DOMAIN environment variable performs the one-to-many pinning. If I_MPI_PIN_DOMAIN is defined then I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
• If hyperthreading is on then:
• the number or processes on the node is greater than the number of cores
• no one process pinning environment variable is set
• the "spread" order will automatically be used instead of the default "compact" order.
• I_MPI_PIN_DOMAIN:
• Defines a separate set of CPUs (domain) on a single node. One process will be started per domain in round robin.
• If this is defined then I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
• 3 different syntax:
• -env I_MPI_PIN_DOMAIN <arg>: Here <arg> can be either core, socket, node, cache1, cache2, cache3, or cache (the largest domain among cache 1/2/3). If, for example, -env I_MPI_PIN_DOMAIN socket is set then the domain size is one socket and number of of domains is same as number of sockets. One process per socket will be started in round robin fashion.
• -env I_MPI_PIN_DOMAIN=<size>[:<layout>]:
• Here <size> can be auto, omp or a number N. auto means domain size is number of CPUs divided by number of requested processes. This is the default. Meaning number of domains is same as number of processes requested. omp specifies the domain size to be same as the setting of OMP_NUM_THREADS. N specifies the domain size to be equal to N CPUs.
• And <layout> can be platform, compact or scatter. compact layout means that the domain members are located as close to each other as possible in terms of common resources (cores, caches, sockets, and so on). This is the default. scatter layout is the opposite of compact layout and implies that the domain members are located as far away from each other as possible in terms of common resources. platform layout implies that the domain members are ordered according to their BIOS numbering (platform-depended numbering).
• As an example, for the case of a 2 socket HPC server with each socket being dual core and each core having 2 CPUs (being hyperthreaded), mpirun -n 4 -env I_MPI_PIN_DOMAIN auto:scatter ./a.out implies domain size=2 (defined by the number of CPUs=8 / number of processes=4) and domain layout=scatter. This means that there are four domains {0,2}, {1,3}, {4,6}, {5,7} defined. Domain members do not share any common resources (refer to the heirarchieal enumeration shown above, cpu #0 and cpu#2 belong to two different sockets).
• As an another example, export OMP_NUM_THREADS=2; mpirun -n 4 -env I_MPI_PIN_DOMAIN omp:platform ./a.out implies domain size=2 (defined by OMP_NUM_THREADS=2) and layout=platform. In this case, four domains {0,1}, {2,3}, {4,5}, {6,7} are defined with domain members (CPUs) have consecutive numbering.
• I_MPI_PIN_ORDER <arg>:
• Here <arg> can be range, scatter, spread, compact, or bunch. bunch implies that the processes are mapped proportionally to sockets and the domains are ordered as close as possible on the sockets. Meaning node that has more sockets will get proportionately more processes and the consecutive processes go to separate domains that share as much as possible with previous domain. This is the default. compact means that the domains are ordered such that adjacent domains have maximum sharing of common resources. This is similar to bunch but here processes are not proportionate to number of sockets. scatter means that the domains are ordered such that adjacent domains have minimal sharing of common resources. spread means that the domains are ordered consecutively with the possibility not to share common resources. range means that the domains are ordered according to the processor's BIOS numbering. This is a platform-dependent numbering.
• This variable defines the order of mapping of MPI processes to the MPI_PIN_DOMAINS. Contrast this variable to I_MPI_PIN_DOMAIN=<size>[:<layout>]. There <layout> defines the organization of CPUs within a domain.
• The optimal setting for this environment variable is application-specific. If adjacent MPI processes prefer to share common resources, such as cores, caches, sockets, FSB, use the compact or bunch values. Otherwise, use the scatter or spread values. Use the range value as needed.
• I_MPI_PIN_CELL <arg>: Here <arg> can be unit (which means a logical CPU) or core (which means a physical core).
• This environment variable has effect on both pinning types: one-to-one pinning through the I_MPI_PIN_PROCESSOR_LIST environment variable and the one-to-many pinning through the I_MPI_PIN_DOMAIN environment variable. The default value rules are: If you use I_MPI_PIN_DOMAIN, then the cell granularity is unit (that is is one member of domain is one logical CPU). If you use I_MPI_PIN_PROCESSOR_LIST, then the following rules apply: When the number of processes is greater than the number of cores, the cell granularity is unit. When the number of processes is equal to or less than the number of cores, the cell granularity is core.
• I_MPI_PIN_PROCESSOR_LIST:
• Does one-to-one-pinning as opposed to I_MPI_PIN_DOMAIN
• One-to-one pinning is not recommended for multi-threaded applications (i.e hybrid parallelism).
• 3 different syntax:
• Syntax1: To pin the processes to CPU0 and CPU3 on each node globally, use the following command: mpirun -genv I_MPI_PIN_PROCESSOR_LIST=0,3 -n <number-of-processes> <executable>. Format like this is acceptable as well: 0,1,2,4-9,10,12,17-19,22.
• Syntax2: I_MPI_PIN_PROCESSOR_LIST=all[:grain=cache3][,shift=<shift>][,preoffset=<preoffset>][,postoffset=<postoffset>. The all can be replaced by allcores or allsocks. The allcores is default. The grain is the granularity. The grain value must be a multiple of the procset (ie all, allcores or allsocks) value. Otherwise, the minimal grain value is assumed. The default value is the minimal grain value. The grain can be <n> (meaning n cpus), fine, core, cache, cache1, cache2, cache3, socket, half (i.e. half of a socket), third, quarter or octavo (i.e. 1/8th of a socket).
• Syntax 3: I_MPI_PIN_PROCESSOR_LIST=all:map=scatter. You can use map=bunch, map=spread as well. Instead of all, you can also use allcores or allsocks. bunch means that the processes are mapped proportionally to sockets and the processes are ordered as close as possible on the sockets. scatter means that the processes are mapped as remotely as possible so as not to share common resources: FSB, caches, and core. spread means that the processes are mapped consecutively with the possibility not to share common resources.

#### Intel MPI Command Line Options for Porting Environment Variables

• -env <envar> <value>
• -envall
• -envnonne

#### Running Intel MPI and OpenMP Hybrid Parallel

• source <install-dir>/env/vars.sh release: Make sure "thread safe" version of MPI is being used.
• export I_MPI_PIN_DOMAIN=omp
• export OMP_NUM_THREADS=<N>

or alternatively:

• mpirun -n 4 -env OMP_NUM_THREADS=4 -env I_MPI_PIN_DOMAIN=omp ./myprog

#### Miscellaneous Intel MPI Command Line Options

• -info: all other options ignored and info printed
• -V, -version: Prints version.
• -v, -verbose: Print debug info.
• -print-rank-map: Print MPI rank map.
• -iface ib0: Use ib0 network interface.

#### Intel MPI Command Line Options That Should NOT Be Used On Kogence

• -hostfile, -f <hostfile>. Hostfile format is just one hostname or IP address per line:
host1
host2
host3

-machinefile or -machine <machinefile>. Machinefile format on the other hand can specify number of processes (i.e slots in open MPI terminology) per host. Process count is assumed 1 if not specified.
host1:4
host2:8
host3
host4

• -hosts <host1,host2,host3>: Launch processes on the specified list of hosts.
• -host <host1>: Can only specify one host. Notice the difference compared to open MPI syntax.

#### Intel MPI with external PMI  (instead of Hydra)

• Default process manager is Hydra. Hydra supports many different process launcher (with default being ssh) which can be changed using the -bootstrap command line option.
• Although not documented by Intel, Intel MPI can be invoked using PMI, PMI2 or PMIx process managers instead of Hydra if system is properly configured. Please refer to https://slurm.schedmd.com/mpi_guide.html#intel_mpi.

### MPICH

#### Most Common MPICH Command Line Options

• MPICH uses hydra process manager just like Intel MPI. As such all command line options that Intel MPI uses can also be used with MPICH except that the -bootstrap option used in Intel MPI to use non-default process launcher is called -launcher in MPICH. Please refer t https://www.mpich.org/static/downloads/3.4/mpich-3.4-README.txt for more details.
• Machinefile format of MPICH is same as that of Imtel MPI.
• For process mapping and process binding, instead of environment variables used by Intel MPI, MPICH uses -bind-to and -map-by just like openMPI.

### What is OpenMP?

OpenMP is an application programming interface (API) supports shared-memory multithreading programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and openMP (multithreading) at the same time and is some times known as hybrid parallelism.

### What is Charm++

Similar to MPI, Cham++ is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD computing systems with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Charm++ can use MPI as the transport layer (aka Charm over MPI), or you can run MPI programs using Charm++ Adaptive-MPI (AMPI) library (aka MPI over Charm). Charm++ can also use TCP/IP stack or other OS bypass network protocols for messaging needs. The communication protocols and infrastructures supported by Charm++ are UDP, MPI, OFI, UCX, Infiniband, uGNI, and PAMI. Charm++ programs can run without changing the source on all these platforms.

MPI is much more widely adopted standard and as of today only few HPC applications, such as NAMD, uses Charm++. Charm++ does offer some benefits over MPI such relative ease of programing to developers, fault tolerance and separating out machine layer so that application developer does not distinguish whether user executes the application leveraging multi-threading parallelism on shared memory SMP architectures or multi-processing parallelism on distributed memory clusters.

The Charm++ software stack has three components - Charm++, Converse and a machine layer. In general, machine layer handles the exchange of messages among nodes, and interacts with the next layer in the stack - Converse. Converse is responsible for scheduling of tasks (including user code) and is used by Charm++ to execute the user application. Charm++ is the top-most level in which the applications are written.

#### Building Charm++ and Charm++ Applications

Charm++ can be build to suite different OS, CPU and Networking architectures (i.e. different networking protocols as detailed above). SMP capabilities can be requested as an option during build. SMP capabilities allow hybrid parallelism (multi-threading + multi-processing). "Multicore" builds do not carry any transport layer and therefore can only be used on single SMP machines. Benefit of multicore build over SMP build (of course this comparison only applicable when user wishes to run applications only on single machine) is that in SMP mode, Charm++ spawns one communication thread per process and therefore there is only less thread per process available for application parallelism. Multicore builds can use all available CPUs for worker threads. Also, multicore builds only support multi-threading. Similarly, any SMP capable version of Charm++ can be built with integrated open MP runtime. Charm++ creates its own thread pool and if you use compiler provided open MP library while compiling your own application, such as NAMD, then you can end up oversubscribing. It is also better to have a single runtime control all threads/process (unlike MPI+Open MP). Charm++ built with integrated OMP runtime that can control OMP threads in the application and you don't need to provide compile time OMP options while compile your own applications. To use this library on your applications, you have to add -module OmpCharm in compile ﬂags to link this library instead of the compiler-provided library in compilers. Without -module OmpCharm, your application will use the compiler-provided Open MP library which running on its own separate runtime. Also you don’t need to add -fopenmp or -openmp with gcc and icc compilers respectively. These ﬂags are already included in the predeﬁned compile options when you build SMP capable Charm++ with OMP option.

#### Charm++ Terminology

In terms of physical resources, Cham++ assume the parallel machine consists of one or more physical nodes, where a physical node is a largest unit over which cache coherent shared memory is feasible (and therefore, the maximal set of cores per which a single process can run). Each physical node may include one or more processor chips (sockets), with shared or private caches between them. Each chip may contain multiple cores, and each core may support multiple hardware threads (SMT/HTT for example).

A chare is a charm++ object such as an array that you want to distribute and do some computation.

Charm++ recognizes two logical entities:

• PE (processing element) = can be a thread (smp mode) or a process (non-smp mode).
• Logical node = one OS process.

SMP Mode: is like MPI + Open MP. You can have multiple processes on each physical node and each process can have multiple threads. In Charm++ terminology, you will say physical node is partitioned into logical nodes (one process = one logical node) and each local node has multiple PEs (in SMP mode, PE = thread).

Non-SMP Mode: is like just MPI. Each PE is a process. Physical node is partitioned into many logical nodes. Each logical node is running one PE (that is one process).

Just like in MPI case, depending on the runtime command-line parameters, a PE (i.e. thread or process) may be associated with a "core" or with a Intel "hardware thread".

Charm++ program, a PE is a unit of mapping and scheduling: each PE has a scheduler with an associated pool of messages. Each chare is assumed to reside on one PE at a time. Physical nodes may be partitioned into one or more logical nodes. Since PEs within a logical node share the same memory address space, the Charm++ runtime system optimizes communication between them by using shared memory.

A Charm++ program can be launched with one or more (logical) nodes per physical node. For example, on a machine with a four-core processor, where each core has two hardware threads, common conﬁgurations in non-SMP mode would be one node per core (four nodes/PEs total) or one node per hardware thread (eight nodes/PEs total). In SMP mode, the most common choice to fully subscribe the physical node would be one logical node containing seven PEs-one OS thread is set aside per process for network communications. When built in the “multicore” mode that lacks network support, a comm thread is unnecessary, and eight PEs can be used in this case. A communication thread is also omitted when using some high-performance network layers such as PAMI.

Alternatively, one can choose to partition the physical node into multiple logical nodes, each containing multiple PEs. One example would be three PEs per logical node and two logical nodes per physical node, again reserving a communication thread per logical node.

It is not a general practice in Charm++ to oversubscribe the underlying physical cores or hardware threads on each node. In other words, a Charm++ program is usually not launched with more PEs than there are physical cores or hardware threads allocated to it. To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain.

Non-SMP Mode: PE = Logical Node (= OS Process)

Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:

8 logical nodes = 8 OS processes. Each running on an hardware thread.

Logical Node = one OS Process. Contain more than one PE (i.e. more than one OS thread)

Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:

7 PE (each running as a OS thread).

There will be only one process.

Reserving 1 hardware thread for network communication.

If you had multiple physical nodes then there will be one process per node with 7 PEs on each physical node and there will be one communication thread per process.

There are various benefits associated with SMP mode. For instance, when using SMP mode there is no waiting to receive messages due to long running entry methods. There is also no time spent in sending messages by the worker threads and memory is limited by the node instead of per core. In SMP mode, intra-node messages use simple pointer passing, which bypasses the overhead associated with the network and extraneous copies. Another beneﬁt is that the runtime will not pollute the caches of worker threads with communication-related data in SMP mode.

However, there are also some drawbacks associated with using SMP mode. First and foremost, you sacriﬁce one core to the communication thread. This is not ideal for compute bound applications. Additionally, this communication thread may become a serialization bottleneck in applications with large amounts of communication. Finally, any library code the application may call needs to be thread-safe.

In Kogence, your workflow launch is already optimized considering these tradeoffs when evaluating whether to use SMP mode for your application or deciding how many processes to launch per physical node when using SMP mode.

#### Charm++ Program Options

charmrun is the standard launcher for Charm++ applications. When compiling Charm++ programs, the charmc linker produces both an executable file and an utility called charmrun, which is used to load the executable onto the parallel machine.

Unlike mpirun, you can pass options either to charmrun or to the program itself. For example, charmrun ++ppn N program program-arguments or charmrun program ++ppn N program-arguments.

Execution on platforms which use platform specific launchers, (i.e., mpirun, aprun, ibrun), can proceed without charmrun, or charmrun can be used in coordination with those launchers via the ++mpiexec (see below). Programs built using the network version (i.e. non multi-core) of Charm++ can be run alone, without charmrun. This restricts you to using the processors on the local machine. If the program needs some environment variables to be set for its execution on compute nodes (such as library paths), they can be set in .charmrunrc under home directory. charmrun will run that shell script before running the executable. Or you can use, ++runscript option (see below).

In all of the below,

• "p" stands for PE (can be thread or process depending on smp of non-smp modes)
• "n" stands for "node" (i.e. logical node = process)
• "ppn" stands for PE per logical node (i.e. PE per Process)
##### Single Machine Options to Charm++ Program

+autoProvision: Automatically determine the number of worker threads to launch in order to fully subscribe the machine running the program.

+oneWthPerHost: Launch one worker thread per compute host. By the definition of standalone mode, this always results in exactly one worker thread.

+oneWthPerSocket: Launch one worker thread per CPU socket.

+oneWthPerCore: Launch one worker thread per CPU core.

+oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.

+pN: Explicitly request exactly N worker threads. The default is 1.

##### Cluster Options to Charm++ Program

Note: Notice 2 ++ sign below.

++autoProvision: Automatically determine the number of processes and threads to launch in order to fully subscribe the available resources.

++processPerHost N: Launch N processes per compute host.

++processPerSocket N : Launch N processes per CPU socket.

++processPerCore N : Launch N processes per CPU core.

++processPerPU N : Launch N processes per CPU processing unit, i.e. hardware thread.

++nN: Run the program with N processes. Functionally identical to ++pN in non-SMP mode (see below). The default is 1.

++pN: Total number of PE (threads or processes) to create. In SMP mode, this refers to worker threads (where n∗ppn=p), otherwise in non-smp mode it refers to processes (n=p). The default is 1. Use of ++p is discouraged in favor of ++processPer* (and ++oneWthPer* in SMP mode) where desirable, or ++n (and ++ppn) otherwise.

++nodelist: File containing list of nodes. Dile is of the format:

host <hostname> <qualifiers>

such as

host node512.kogence.com ++cpus 2 ++shell ssh

++runscript : Script to run node-program with. For example: $./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script. In this case, the set_env_script is invoked on each node before launching the program pgm. ++local: Run charm program only on local machines. ++mpiexec: Use the cluster’s mpiexec job launcher instead of the built in ssh method. This will pass -n$P to indicate how many processes to launch. If -n $P is not required because the number of processes to launch is determined via queueing system environment variables then use ++mpiexec-no-n rather than ++mpiexec. An executable named something other than mpiexec can be used with the additional argument ++remote-shell runmpi, with ‘runmpi’ replaced by the necessary name. To pass additional arguments to mpiexec, specify ++remote-shell and list them as part of the value after the executable name as follows: $ ./charmrun ++mpiexec ++remote-shell "mpiexec --YourArgumentsHere" ./pgm. Use of this option can potentially provide a few benefits: Faster startup compared to the SSH approach charmrun would otherwise use; No need to generate a nodelist file; Multi-node job startup on clusters that do not allow connections from the head/login nodes to the compute nodes. At present, this option depends on the environment variables for some common MPI implementations. It supports OpenMPI (OMPI_COMM_WORLD_RANK and OMPI_COMM_WORLD_SIZE), M(VA)PICH (MPIRUN_RANK and MPIRUN_NPROCS or PMI_RANK and PMI_SIZE), and IBM POE (MP_CHILD and MP_PROCS).

##### CPU Affinity Options

+setcpuaffinity: Set cpu affinity automatically for processes (when Charm++ is based on non-smp versions) or threads (when smp). This option is recommended, as it prevents the OS from unnecessarily moving processes/threads around the processors of a physical node.

+excludecore<core#>: Do not set cpu affinity for the given core number. One can use this option multiple times to provide a list of core numbers to avoid.

##### Additional options in SMP Mode

++oneWthPerHost: Launch one worker thread per compute host.

++oneWthPerSocket: Launch one worker thread per CPU socket.

++oneWthPerCore: Launch one worker thread per CPU core.

++oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.

++ppn N: Number of PEs (or worker threads) per logical node (OS process).

+pemap L[-U[:S[.R]+O]][,...] : Bind the execution threads to the sequence of cores described by the arguments using the operating system’s CPU afﬁnity functions. Can be used outside SMP mode. PU = hardware thread. By default, this option accepts PU indices assigned by the OS. The user might want to instead provide logical PU indices used by the hwloc (see here for details). To do this, prepend the sequence with an alphabet L (case-insensitive). For instance, +pemap L0-3 will instruct the runtime to bind threads to PUs with logical indices 0-3.

+commap p[,q,...] : Bind communication threads to the listed cores, one per process.

To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain. ++ppn N will spawn N threads in addition to 1 thread spawned by the runtime for the communication threads, so the total number of threads will be N+1 per node. Consequently, you should map both the worker and communication threads to separate cores. Depending on your system and application, it may be necessary to spawn one thread less than the number of cores in order to leave one free for the OS to run on. An example run command might look like: ./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app <args>. This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per node), each with three worker threads/PEs (++ppn 3). The worker threads/PEs will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 to core 3 and PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. PEs/worker threads 0-2 compromise the ﬁrst logical node and 3-5 are the second logical node. Additionally, the communication threads will be mapped to core 0, for the communication thread of the ﬁrst logical node, and to core 4, for the communication thread of the second logical node. Please keep in mind that +p always speciﬁes the total number of PEs created by Charm++, regardless of mode. The +p option does not include the communication thread, there will always be exactly one of those per logical node. Recommendation is to not do oversubscription. Dont run more threads/processes than the number of hardware threads available.

### What are Resource Managers and Job Schedulers?

Resource Manager and Job Scheduler phrases are used interchangeably. Kogence autoscaling cloud HPC clusters are configured with the Grid Engine resource manager. Any command that you invoke on the shell terminal would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster using the shell terminal then please use the qsub command of the Grid Engine resource manager like below:

qsub -b y -pe mpi num_of_CPU -cwd YourCommand


Be careful with the num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the qsub man page. Specifically you might find following switches to be useful:

• -pe mpi: Name of the "parallel environment" on Kogence clusters.
• -b y: Command you are invoking is treated as a binary and not as a job submission script.
• -cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
• -wd working_dir_path: Makes the working_dir_path as the working directory. Output and error files would be generated in this folder.
• -o stdout_file_name: Job output would go in this file.
• -e stderr_file_name: Job error would go in this file.
• -j y: Sends both the output and error to the output file.
• -N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using -o and/or -e switches. You can also monitor and manage your job using the job name.
• -sync y: By default qsub command returns control to the user immediately after submitting the job to cluster. So you can continue to do other things on the CloudShell terminal including submitting more jobs to the scheduler. This option tells qsub to not return the control until the job is complete.
• -V: Exports the current environment variables (such as \$PATH) to the compute nodes.

You can use qready to check if cluster is ready before submitting the jobs to cluster. You will submit simulations using the qsub command. You can use qwait job_name to wait for job to finish before doing some post prepossessing for submitting a dependent job. We recommend adding a CloudShell to your software stack in the Stack tab GUI. This CloudShell terminal can be used to monitor and manage cluster jobs. You monitor cluster jobs by typing qstat on the terminal. You can delete jobs using qdel jobID command. You can monitor compute nodes in the cluster by typing qhost on the terminal. qhost lets you monitor the performance and utilization of the compute nodes in your cluster.