# Help:FAQ

## Miscellaneous Kogence Terminology

### What is a Model?

Every cloud HPC project users create on Kogence is called a Model. Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators.

Each model consists of:

1. An independent project documentation wiki. Wiki comes with a full featured WYSIWYG editor. Any graphics generated on execution of your Model is automatically pulled into the project wiki. Permission controls you choose for your Model apply to the Model's wiki as well. A private wiki would show up in the Model Library page only if one of the collaborator with correct permission logs in.
2. An independent discussion board. Permission controls you choose for your Model apply to the Model's discussion boards as well.
3. An independent control of permissions at the Model level. If you give your collaborators a edit permission then your collaborators would have edit permission on all assets under that Model including project files, wiki, discussion boards, cluster settings and stack settings. You cannot control permissions on individual file level, for example.
4. A connected cluster. You can chose to run your Model on a single cloud HPC server or or you can choose to run on an autoscaling cloud HPC cluster. You can choose the maximum number of nodes that you want your cluster to scale. Nodes are created and deleted automatically based on the progression of your Model execution. Any settings you choose for setting up your cluster remain stored with the Model.
5. A connected software stack. Model can be connected to multiple software. You can create a Workflow using these multiple connected software. Some of these can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command. Any settings you choose for setting up your software stack and invocation commands remain stored with the Model.
6. All assets of each Model are independently version controlled.
7. When you Copy a Model you are creating a new Model with all its assets being duplicated. From there on, new Model maintains its own new version control history.

### What is a Simulator and a Container?

On Kogence, any software application, solver or a simulation tool, each being referred to as a Simulator, is deployed as an independent entity, referred to as a Container, that can be connected to Models using the Stack tab of the Model and invoked to do some tasks/computations on input files/data provided under the Files tab of the Model. Containers cannot contain user specific or Model specific data or setup. It has to be an independent entity that can be linked and invoked from any Model on Kogence.

We use the term Simulator in much wider sense than what is commonly understood. In the context of definition above, Matlab is a simulator and so is Python. Both are deployed in independent Container. On Kogence we use Docker container technology.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we do not launch a Container; we launch a Model. Model is connected to a cluster and a stack of software and simulators

### What is a Simulation?

Execution of the Model on a cloud HPC Cluster is called a Simulation. A Simulation is NOT a single job. Billing on Kogence works on per Simulation basis (i.e. per execution of a Model) not on per job basis. Simulation can consists of a complex Workflow of multiple jobs using multiple software as defined by the user on the Stack tab of the Model. Some of these jobs can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command.

Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators

### How Long Does it Take for My Simulation to Start?

If your Model is connected to a single cloud HPC sever that is not already persisting (i.e. when you start your Simulation for the first time) then it can take about 2 minutes for the server to boot up and start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. If you press the Stop button to stop the simulation and then press the Run button again then server persists through use of kPersistent technology. That means that server would start executing your jobs immediately and you can connect using Visualizer tab immediately. This allows you to work interactively, stop the execution, edit and debug code using code editor accessible under the Files tab of your model, and the restart your Model again just like your would do when you are on the onprem workstation.

If your Model is connected to an autoscaling cloud HPC cluster then it can take about 5 minutes for the cluster to boot up and configured to start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. kPersistent technology is not yet available with clusters.

### What are CPU-Hrs?

CPU-Hr is the unit we use on Kogence to measure amount of computing power that has been reserved or consumed independent of the type of hardware being used. CPU-Hrs = # of CPUs X # of Hours.

We measure time in the steps of hours. This does not mean that every time you start your Simulation you will be billed for full one hour. Only the first time you start a Simulation we charge your for one full hour. If, before the full hour is completed, you stop and restart same or different Model on same hardware then we do not charge you anything until the completion of the full hour since the time you first started your first Model. After the completion of a full hour, you would be charged for another full hour. Same logic continues for all subsequent hours.

### What is 1 CPU Credit or 1 HPC Token?

HPC Token and CPU Credit are one and the same thing depending upon the release version of Kogence HPC Grand Central App deployed under your subscription. On Kogence, HPC Tokens or the CPU Credits is the currency that you purchase and then spend it while you consume the HPC compute resources on the Kogence Container Cloud HPC platform.

The cost of compute resources is specified in terms of CPU Credits or HPC Tokens. Please check the pricing page for the currently available pricing. Typically, 1 CPU Credit = 1 CPU for 1 hour. Accelerated hardware such as a GPU compute node is also priced in terms of CPU Credits. Typically, 10 CPU Credit = 1 GPU-accelerated-CPU for 1 hour.

So for example if you connect your Model to a 4 CPU machine and select the wall time limit of 10 hours, then your accounts needs to have at least 40 CPU Credits remaining otherwise your Simulation would not start. If your Simulation gets started and either ends automatically or you stopped it by pressing the stop button after 1 hour and 20 minutes, for example, then we will refund your account with 32 CPU Credits. If you restarted the Simulation on same hardware and with same wall time limit within next 40minutes, say after 20minutes to be exact, then we would again block 40 CPU Credits at the start. But if you stoped the Simulation after 10 minutes only then we would refund your account with full 40 CPU Credits back. You are not charged anything for this second Simulation because you already paid for 2 hours during the first Simulation.

As of this writing, everybody starts with free 20 CPU Credits. We top up your accounts every month for free. You can earn more free Credits. You can also purchase more Credits. Credits do not have any expiration date. Currently you can purchase CPU Credits for as low as as $0.02. This means you can purchase 1 CPU-Hr of computing for$0.02. Kogence reserves the right to change these pricing without notice. Please check the pricing page for the currently available pricing.

## Cloud HPC Hardware Terminology

### What is 1 CPU on Kogence?

1 CPU on Kogence is same as what is called as 1 CPU in the Resource Monitor of Microsoft Windows 10 PC, for example. Similarly, # of CPUs shown on Kogence is same as what is shown as "CPU(s)" by the lscpu utility of linux platforms. Different platforms and utilities may call this same logical computing unit by different names. On same Microsoft Windows 10 PC, for example, System Information and Task Manager utilities calls it as "number of logical processors" and Amazon AWS (see here) and Microsoft Azure (see here) call this "number of vCPU". In general,

${\displaystyle no\_of\_CPUs=no\_of\_hardware\_thread\_per\_core\times no\ of\ cores\ per\ socket\times no\ of\ sockets\_per\_server}$.

This is the most relevant logical unit of computing power in the context of cloud High Performance Computing (HPC). Each logical processor in fact has an independent architectural state (i.e. instruction pointer, instruction register sets etc.) and each can independently and simultaneously execute instruction sets from one worker thread or worker process each without needing to do context switching. Each of these register sets are clocked at full CPU clock speed.

In the context of MPI (Message Passing Interface, multi-processing framework) and OpenMP (multi-threading library), for example, each of these CPUs can run an individual MPI process or an individual OpenMP thread without requiring frequent context switches. So if you are scheduling an MPI job on a 4 CPU machine on Kogence then you can run mpirun -np 4 and there will be no clock cycle sharing or forced context switching among processes.

Between CPUs, Cores, Sockets, Processors, vCPUs, logical processors, hardware threads etc. terminology can get confusing if you are not a computer scientist! Lets take a look at common Microsoft Windows 10 PC that we are all so familiar with. All other computer system will also show similar info. Lets first look at the task manager. This machine has an Intel Core i5-7200 CPU. But don't jump to call it a 1 CPU machine, that is is wrong. You will see below. Here we also notice that this machine has 1 socket, 2 cores and 4 logical processors. Now if you open the System Information, you will see something similar. Now if you open the Resource Monitor, you will see something like this. Notice that this is a 4 CPU machine -- from CPU 0 to CPU 3. If you are on a linux machine you run the lscpu command, you will see 4 CPU(s) as well.

Hardware threads within a core do share one common floating point processor and cache memory, though. Please check FAQ section to learn more about CPUs and hardware threads. In FLOPS heavy HPC applications it is the utilization of floating point processor that is more important than the utilization of CPU.If a worker thread or a worker processes running on one of the hardware thread can keep the floating point processor of that core fully occupied then even though the system/OS resource management and monitoring utilities (such as top or gnome system-monitor on linux platforms) may show overall CPU utilization being capped at most at 50%, you are already getting most out of your HPC server. 50% CPU utilization may often be misleading in the context of HPC. Remember that the CPU utilization percentage is averaged across total number of CPUs (i.e. total number of hardware threads) in your HPC server and if you are using only one hardware thread per core then maximum possible CPU utilization is 50%. This just means that CPU can in theory execute twice as many instruction per unit time provided they are not all asking for access to floating point processor. If they are then 50% utilization is as best you can get.

For these reasons, some FLOPS heavy HPC applications available on Kogence (e.g. Matlab) will not run one independent worker process or worker thread on each hardware thread. Instead they run one worker process or worker thread per core. But Kogence allows you to start multiple instances of these applications. You can experiment with single instance and multiple instance and see if you get better net throughput.

### What is a Hardware Thread?

Most High Performance Computing (HPC) servers that Kogence offers are built using Intel microprocessor chips. HPC server motherboards consists of multiple-sockets with each socket plugged with one Intel microprocessor chip. Each microprocessor chip has multiple cores. Each core is built using Hyper-Threading technology (HTT). HTT creates 2 logical processors out of each core. Different operating system (OS) utilities and program identify these logical processors by different names: some identify them as 2 CPUs and others identify them as 2 hardware threads. Hardware threads is a very confusing terminology. It has nothing to do with user (worker) threads that your program might start. Each hardware thread is capable of independently executing an independent task -- ether an independent worker thread or an independent process.

Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core. Just like a dual-core or dual-socket configuration that uses two separate physical processor, each of these logical processor has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified process/thread, independently from the other logical processor sharing the same physical core. On the other hand, unlike a traditional a dual-core or dual-socket configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface. The sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core. A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The processor may stall due to a cache miss, branch misprediction, or data dependency. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.

In vast majority of modern HPC use cases, we find that HTT helps speeding up the application, It is a very effective approach to get most performance for a given cost. At high level any HPC program when loaded into CPU registers as a set of machine instructions can be thought of as directive to CPU to repeatedly looping over following: 1/ Fetch instruction, 2/ Decode instruction and fetch register operands; 3/ Execute arithmetic computation; 4/ Possible memory access (read or write); 5/ Write back results to register. It is the step #4 that most people ignore logically when thinking about speed of executions. In this context, even cached memory is slow, much less main memory. L1 cache typically has a latency of ~2 CPU cycles, L2 cache typically has a latency of ~8 CPU cycles while L3 cache typically has a latency of ~100 CPU cycles. Main memory has about 2X more latency than the L3 cache (~200 CPU-cycles away). If your code opens some I/O pipes to read/write on files, for example, than I/O devices on main memory bus has 100X-1000X more latency than main memory (~20K to ~200K CPU-cycles away). If I/) devices are on network or on PCIe bus than you are looking at miliseconds (~2million CPU-cycles away) level latency at the least.

Now imagine your HPC code is running on a CPU. CPUs are really good in doing step #1 to #3 and step #5. For example, modern CPUs can do 4 floating points (4 FLOPS) per CPU-cycle. If your program needs to access some data from main memory than it will be waiting for ~200 CPU-cycles and the floating point asthmatic compute engine of CPU will be just sitting idle. If it is waiting for data from an I/O then compute engine may be waiting idle for millions of CPU-cycles. If you could use that time, you could have completed 4 million floating point operations utilizing that idle time. That is exactly what the HTT technology accomplishes. Even the most heavily optimized real world HPC code would need to access cache memory frequently at the least. This means there are several 10s to several 100s of CPU-cycles, at the least, that could be utilized by other thread. Whether hardware threading (HTT) will enhance the performance or not basically boils down to the ratio of floating point instructions to instructions that need to fetch or write data from cache, main memory or I/O devices. Even if that ratio is 100 or 1000, you would still expect HTT to boost performance.

However, in cases where both threads are operating primarily on very close (e.g., registers) or relatively close (first-level cache) instructions or data, the overall throughput occasionally decreases compared to non-interleaved, serial execution of the two lines of execution. In the case of LINPACK, the benchmark used to measure supercomputers on the TOP500 list, many studies have shown that you get better performance by disabling HT Technology. We think these are largely artificial constructs are do not represent how real life HPC codes work. These LINPACK benchmarks are specifically designed to probe the sped of floating point operations and deliberately avoid accessing memory or I/O.

One drawback of traditional onprem HPC cluster is that it hinders experimentation with HTT for your specific use case at hand. Either entire cluster needs to switch ON HTT or entire cluster needs to switch OFF HTT. Your cluster admin decides that based on all types of workloads that all other users typically run on that cluster. On kogence cloud HPC platform you create your own personal autoscaling HPC cluster for each model you are executing. You can run same model multiple times on brand new clusters and disable HTT Technology in some cases while leaving it enabled in others. You can easily test both configurations and decide which is best based on empirical evidence. With that said, for the overwhelming majority of workloads, you should leave HTT Technology enabled.

OS configured on Kogence are HTT-aware, meaning they know the distinction between hardware threads and physical cores and would properly schedule user threads to reduce stalled CPU states and enhance performance unless you specifically instruct Kogence platform to pin your processes and worker threads to specific cores or to specific hardware threads.

### What is the Memory Latency and Bandwidth on Kogence?

Before we answer that question, lets cover some basic.

#### Virtual Memory and Page Tables

In HPC, we often deal with shared memory parallelism. Processes can use share memory. That means multiple virtual memory pages (as addressed by different processes) may be mapped to same physical memory frame. If NUMA is enabled then reverse if also possible, i.e. same virtual memory page may be mapped to multiple physical memory frames (for read only frames) with individual frames being on local physical memory of each individual CPU.

#### Cache Memory, Translation Look Aside Buffer (TLB) and Main Phsyical Memory

Cache and TLB are part of the Memory Management Unity (MMU) which is integrated on same silicon chip together with the microprocessor. Both cache and TLB are used to reduce the time it takes for a running process to access a physical memory location which is located off the microprocessor chip and on a different silicon chip on the motherboard connected on the memory bus on all hardware that Kogence offers. Cache is basically a quick access copy of small sections of physical memory in a static RAM on the microprocessor's silicon chip while TLB is a quick access copy of small sections of page table (which also resides in physical memory) in the static RAM on the microprocessor's silicon chip.

#### Latency and Throughputs

Servers offered on Kogence have 3 levels of cache: L1, L2 and L3. L1 cache is further broken down into instruction (L1i) and data (L1d) cache. Each core gets its own L1 and L2 cache while L3 cache is shared among all cores. Hyperthreads do not get their own cache. Typically our hardware will have 32KB each of L1i and L1d cache, 256KB of L2 cashe and >10MB of L3 cache. Amount of each level of cache differs from hardware to hardware. Check the specification page for more details. On Kogence hardware, TLB may reside between the CPU and the L1 cache. On servers Kogence offer,

• L1 cache typically has a latency of ~2 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L2 cache typically has a latency of ~8 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
• L3 cache typically has a latency of ~100 CPU cycles and a bandwidth of about ~200 bytes/CPU-cycle.
• Main memory has a latency of ~200 CPU cycles and a bandwidth of about ~100 bytes/CPU-cycle.

### What are "Number of Processes" and "Number of Threads"?

A process is a logical representation of an instance of a program that has been submitted to CPU and OS to manage its execution. Process is distinct from a thread -- both user (worker) threads as well as hardware threads (HTT technology). Hardware thread is a confusing terminology. Each hardware thread can execute an independent process or an independent worker thread. On Kogence platform we refer to hardware thread as a CPU.

As opposed to threads, processes provides stronger logical isolation as you are stating independent instances of program to work independently. Intuitively, one can understand starting multiple processes of same program to be similar to starting same program multiple times. For example, if you start multiple instances of Matlab by tying matlab -desktop & multiple times on the CloudShell terminal then you are basically starting multiple Matlab processes. You should not be afraid of multiple instances of Matlab corrupting data or files because if one instance of Matlab opens a file then other instance cannot access it. Similarly both instances are creating there own temporary data in the memory to work on so you do not have to think about controlling access to variable in the memory.

Using multi-processing libraries such as MPI to start multiple instances of Matlab is much more preferred method as compared to above mentioned method of starting multiple instances on a shell terminal. As an example, if your code running under each instance of Matlab is trying to access same file then all instances except will crash complaining they cannot access the file. If you start processes through standard libraries such as MPI then you can be assured that MPI will manage those things. It will take care of keeping a process on wait until other processes close the file.

At a lower level, a process is defined by a Process Control Block (PCB). PCB stores state of execution of a process. It has all the information that OS and CPU need to freeze or unfreeze a process. Any set of instructions submitted to CPU that comes with its own PCB is a process. Each process gets a unique process ID (PID) that you can check using OS utilities. So any task that has a unique PID is a process. As an example, if you execute a bash shell script then the bash-shell started to execute the script instructions will get a PID and would be a process. If from that shell script you provide instructions to call another bash shell script then that script will be executed in another child bash shell and that bash shell will gets its own PCB and PID. Modern OS are smart enough to not duplicate the machine code for each instance in the main memory and in cache. For example, they will only load one copy of the text or the code segment in the memory. But each process will get its own program counter, a CPU register, that keeps track of which instruction is being executed currently. Program counter is part of PCB and gets saved and restored when processes is taken out by OS from running state and then brought back to running state. Typically, virtually memory allocated to a process and files and other I/Os opened by a process are restricted to that process. Other processes cannot access those resources. But processes can specifically issue instructions to share these resources with other processes. In HPC we do this by using standard libraries such as MPI.

Processes can start children processes (and those can start grand-children processes etc). Each of those will get their own PCB (and therefore a PID). Processes can also start multiple threads. Threads will operate under same PCB. Threads do not get their own PCB (instead they get a Thread Control Block, TCB, which has a link to the PCB of the parent process). Threads can access resources of the process.

Most HPC solvers can also do multi-processing as opposed to multi-threading. Meaning multiple CPUs across multiple nodes can be instructed to use independent instances of same solver code (children processes) and operate independently on a specific set of data and instruct children processes to work on them in parallel. As children may be running on different nodes, they don't have access to memory of parent process in the parent compute node. Children processes exchange data with each other and with parent using message passing interface (MPI) library subroutines. If MPI processes are running on same node then they do have ability to access shared memory as well. But all request to access data/memory still goes through MPI subroutines. MPI subroutines manage mode of access to the data automatically by keeping track of which children process is running on which node.

On Kogence, we restrict product of processes and threads to be equal or less than the number of CPUs in the cluster. This eliminates serious overhead from context switching.

### What is the Overhead of Process Context Switching?

Context switch refers to which instruction set is a given core executing at a given time. Context switch can refer to switching from one hardware thread to another hardware thread (HTT technology), switching from one user (worker) thread to another user (worker) thread or from one process to another process. Context switching from one hardware thread to another hardware thread has negligible overhead. This FAQ discusses the overhead of process context switch. There is another FAQ that discusses the overhead of worker thread context switch.

When a request is made to execute a different process on the same CPU (either triggered by user code instruction, i.e. voluntary context switch, or by the OS scheduler because running process went into idle state waiting for some I/O or because OS scheduler wants to give processor time to other processes, i.e. involutary context switch) then first a switch from user mode to OS mode is triggered. This simply means that a subroutine from OS code base needs to be called that will perform the task of context switch. OS code is just like usual user code that is loaded into the physical memory. This OS code will save the process control block (PCB, which consists of current values in a set of registers in the CPU such as the program counter, pointer to page table etc.) of the current process to the physical memory and load the PCB of new process from physical memory into the CPU registers. OS will then make another switch from OS mode to user mode to let CPU start executing instruction from new process. In older hardware and OS architectures, a switch from user mode to OS mode itself was implemented like a full process context switch. That meant saving existing process PCB, loading OS PCB, executing OS code, saving OS PCB and then loading the new process PCB. But in modern hardware and OS, cost of switching out of user mode to OS mdoe and then back is much less expensive and in not a significant portion of the cost of context switch anymore. Also, on virtualized cloud HPC hardware, some hypervisors need to give control to the host machine OS and guest OS cannot do this switch. This increases the computational cost of context switch by an order of magnitude. On Kogence platform, we have carefully configured that system to eliminate this latency.

Therefore the fixed cost of doing a context switch consists of: switching from user mode to OS mode; storing and loading of PCB; and switching back from OS mode to user mode. On Kogence cloud HPC platform, switching in and out OS mode back to user mode typically costs about 100 CPU clock cycles. With CPU clock cycles of about ~2 to 3GHz, the time it takes to switch back and forth between OS mode and user mode would take ~50ns. Among the fixed cost components, the cost of storing and loading process PCB is much larger. This is easily 2 orders of magnitude bigger and takes about 2,000 CPU cycles or ~1µs.

Much bigger portion of cost is the variable cost. The variable CPU cycle cost of doing context switch consists of: flushing of the cache; and flushing of the TLB. Note that each process uses the same virtual memory address space but they are each mapped to different physical memory address space through page table. A small portion of page table is cached on the TLB. Now depending upon the number of pages of virtual memory that the old and new processes use (called the working set), CPU might encounter lots of TLB misses on context switching. So it may have to flush the TLB and load new sections of page table. In addition, the old section of physical memory frames that were cached on various levels of cache may also become useless and CPU might encounter lots of cache misses and it may have to flush the cache and lot new physical memory frames into the cache.

The variable CPU cycle cost of switching from old process to new process and back to old process will change dramatically depending the process pinning you instruct the OS scheduler to use. Lets say CPU1 is executing process1. Then OS asked the same CPU1 to start executing process2. After some time, now OS wants to restart the process1. The number of CPUS cycles we have to waste before we can successfully start executing process1 changes dramatically if OS schedules process1 back to original CPU1 or if OS schedules it to a new CPU2 (and instead starts a process3 on original CPU1). Note that each core gets its own L1 and L2 cache while L3 cache is shared between cores. Since address translations from virtual memory pages that preocess1 needs to access the physical memory address may already be present in the TLB of CPU1 and those required frames of physical memory may also be already present in L1 or L2 cache of CPU1 (since process1 was running on CPU1 sometime back), the process1 may restart on CPU1 much quicker than on CPU2. If new CPU2 sits on a different socket then cost will be even higher as both TLB and all levels of cashes will need to be repopulated. By the way the process pinning affects fixed cost of context switching as well because now PCB is also not available in cache and needs to read from main physical memory.

Variable cost also depends strongly on the size of virtual memory pages that a process is using. If processes use very large page sizes than chances of need to flush TLB may be lower (as number of entries needed in the TLB may become smaller) but chances of need to flush cache may increase.

All of the above depend strongly on the size of the caches in the CPU you're using. Typically, if pages are available in cache then time it takes to write pages can easily take 100,000 cycles or few microseconds. If we miss cache and TLB then this can increase by an order of magnitude or to few 10's of microseconds. As we discussed above this also depends on the process pinning instructions your code might give to OS. Past a certain working set size, the fixed cost of context switching is negligible compared to the variable cost due to the cost of accessing memory. Because of all these reasons, it is very hard to put any representative/average number on the cost of context switching.

In summary,

• Switching from user mode to OS mode and back takes about 100 CPU cycle or about 50ns.
• A simple process context switch (i.e. without cache and TLB flushes) costs about 2K CPU-cycles or about 1µs.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Launching a new process takes about 100K CPU cycles or about 50µs.
• In the HPC world, creating more active threads or processes than there are hardware threads available is extremely detrimental (e.g. in 100K CPU-cycles, CPU could have execute 400K FLOPS). If number of worker threads is same as the number of hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

### What is the Overhead of Thread Context Switching?

• If you are doing thread switch within same CPU (proper CPU pinning) with no need to do TLB and cache flushes then context switch takes about 100 CPU-cycles or about 50ns.
• A thread switch going to different CPU will cost about the same as a process context switch does i.e. its costs about 2K CPU-cycles or about 1µs. Here we are assuming that virtual memory that thread needs is already in cache and we dont need TLB and cache flushing.
• If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each thread context switch costs about 100K CPU-cycles or about 50µs.
• As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
• Launching a new threads takes about 10K CPU cycles or about 5µs.
• Creating more active threads than there are hardware threads available is extremely detrimental. If number of worker threads is same as the number f hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
• On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.

### How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

When using a tool like top or gnome-monitor on Kogence HPC servers, you will see CPU usage being reported as being divided into 8 different CPU states. For example.
%Cpu(s): 13.2 us,  1.3 sy,  0.0 ni, 85.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

These eight CPU states are: “user” (us), “system”, (sy), “nice” (ni), “idle” (id), “iowait” (wa), “hardware interrupt” (hi), “software interrupt” (si), and “steal” (st). top is showing percentage of time server is spending in each of the eight possible states. Of these 8 states, “system”, “user” and “idle” are the 3 main CPU states. The ni state is a subset of us state and represents a fraction of CPU time that is being spent on low priority tasks. The wa state is the subset of the id state and represents a fraction of CPU time that is being spent while waiting for an I/O operation to complete. These 3 main CPU states and the si, hi and st states add up to 100%.

Please note that these are averaged over all CPUs of your HPC server. So if you started an 8 CPU HPC server then top will show utilization averaged across all 8 CPUs. You can press 1 to get per-CPU statistics.

• system (sy)

The “system” CPU state shows the amount of CPU time used by the kernel. The kernel is responsible for low-level tasks, like interacting with the hardware, memory allocation, communicating between OS processes, running device drivers and managing the file system. Even the CPU scheduler, which determines which process gets access to the CPU, is run by the kernel. While usually low, the system state utilization can spike when a lot of data is being read from or written to disk, for example. If it stays high for longer periods of time, you might have a problem. So, for example, if CPU is doing a lot context switching then you will see that the CPU may be spending a lot more time in the system state.

• user (us)

The “user” CPU state shows CPU time used by user space processes. These are processes, like your application, or some management daemons and applications started automatically by Kogence that would be running in the background. In short, every CPU time used by anything other than the kernel is marked “user” (including root user), even if it wasn’t started from any user account. If a user-space process needs access to the hardware, it needs to ask the kernel, meaning that would count towards “system” state. Usually, the “user” state uses most of your CPU time. In properly coded HPC applications, it can stay close to the maximum of 100%

• nice (ni)

The “nice” CPU state is a subset of the “user” state and shows the CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks. The nice utility is used to start a program with a particular priority. The default niceness is 0, but can be set anywhere between -20 for the highest priority to 19 for the lowest. CPU time in the “nice” category marks lower-priority tasks that are run when the CPU has some time left to do extra tasks.

• idle (id)

The “idle” CPU state shows the CPU time that’s not actively being used. Internally, idle time is usually calculated by a task with the lowest possible priority (using a positive nice value).

• iowait (wa)

“iowait” is a sub category of the “idle” state. It marks time spent waiting for input or output operations, like reading or writing to disk. When the processor waits for a file to be opened, for example, the time spend will be marked as “iowait”. Instead, if a task running on a given CPU blocks on a synchronous I/O operation, the kernel will suspend that task and allow other tasks to be scheduled on that CPU. In that case, CPU is not idle and this will not be shown in id or wa states.

• hardware interrupt (hi)

The CPU time spent servicing hardware interrupts.

• software interrupt (si)

The CPU time spent servicing software interrupts.

• steal (st)

The “steal” (st) state marks time taken over by the hypervisior.

### How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?

One can use top or the gnome-monitor utilities to monitor the memory utilization.

## Cloud HPC Network Architecture

### Is the Kogence Cloud HPC Cluster Connected to Internet?

If you are starting your simulation on a single cloud HPC server, your HPC server is connected to internet. Once your Model is in the running state, you can click on the Visualaizer button on top right corner to connect to your HPC server over the internet under a secure and encrypted channel of SSL/TLS. If you connect a CloudShell to the software stack of your model through the Stack tab of your Model then you can use the CloudShell terminal to pull repositories over the internet using pip, git pull, curl, wget etc.

If you are starting an autoscaling cloud HPC cluster the the master node of the cluster is connected to the internet in same way.

Workloads running inside the container provisioned on these nodes will also have access to the internet.

Please refer to Network Architecture document for more details.

### How is Network the Architected Among the Compute Nodes?

Cloud HPC clusters you start on Kogence are equipped with standard TCP/IP network as well as infiniband like OS Bypass networks. Please refer to Network Architecture document for more details.

### Do the Kogence Cloud HPC Cluster Come with Infiniband Network?

Yes. If you select Network Limited Work Load node then that node is equipped with an OS Bypass network with network bandwidth up to 100Gbps. Please refer to OS Bypass Network and Remote Direct Memory Access documents for more details.

### How is the Network Configured Among the Containers?

Please refer to the Container Network document for details. The OS Bypass Remote RDMA Network is also available for workloads inside containers if same is accessible to the HPC server itself.

## Cloud HPC Parallel Computing Libraries and Tools

### What is BLAS?

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software.

On linux systems, if you don't configure your system properly, you would be using the default GNU BLAS (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS, AtlasBLAS, GotoBLAS and Intel MKL ) that can be used instead of the default base libraries. These libraries are optimized to take advantage of the hardware they are run on, and can be significantly faster than the base implementation (operations such as Matrix multiplications may be over 40 times faster). Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

### What is BLACS?

The BLACS (Basic Linear Algebra Communication Subprograms) is a linear algebra oriented message passing interface for distributed memory cluster computing. It provide basic communication subroutines that are used in PBLAS and ScaLAPACK libraries.

### What is PBLAS?

PBLAS (Parallel BLAS) is the distributed memory versions of the Level 1, 2 and 3 BLAS library. BLAS is used for shared memory parallelism while PBLAS is used for distributed memory parallelism appropriate for clusters of parallel computers (heterogeneous computing). Just like BLAS, on Kogence we automatically configure software/solver/simulators with apprpriate hardware optimized versions of PBLAS based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### Whats is LAPACK?

LAPACK is a large, multi-author, Fortran library for numerical linear algebra. It provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision. LAPACK is the modern replacement for LINPACK and EISPACK libraries. LAPACK uses block algorithms, which operate on several columns of a matrix at a time. On machines with high-speed cache memory, these block operations can provide a significant speed advantage.

LAPACK uses BLAS. The speed of subroutines of LAPACK depends on the speed of BLAS. EISPACK did not use any BLAS. LINPACK used only the Level 1 BLAS, which operate on only one or two vectors, or columns of a matrix, at a time. LAPACK's block algorithms also make use of Level 2 and Level 3 BLAS, which operate on larger portions of entire matrices. LAPACK is portable in the sense that LAPACK will run on any machine where the BLAS are available but performance will not be optimized if hardware optimized BLAS and LAPACK are not used.

On linux systems, if you don't configure your system properly, you would be using the default GNU LAPCK (such as /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS/LAPACK, Atlas/LAPACK, GotoBLAS/LAPACK and Intel MKL ) that can be used instead of the default base libraries. If you configure your system properly with hardware optimized LAPACK libraries then your HPC applications will serious boost performance. Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.

### What is ScaLAPACK?

ScaLAPACK (Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory clusters of parallel computers (heterogeneous computing). ScaLAPACK uses explicit message passing for interprocessor communication using MPI or PVM. Just like LAPACK, ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy that includes the off-processor memory of other processors (or processors of other computers), in addition to the hierarchy of registers, cache, and local memory on each processor. ScaLAPACK uses PBLAS and BLACS. ScaLAPACK is portable in the sense that ScaLAPACK will run on any machine where PBLAS, LAPACK and the BLACS are available but performance will not be optimized if hardware optimized PBLAS and LAPACK are not used.

Kogence use ScaLAPACK automatically whenever appropriate for you based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### What is FFTW?

FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data as well as of even/odd data, i.e. the discrete cosine(DCT)/sine(DST) transforms. At Kogence we automatically use the hardware optimized versions of FFTW based on the hardware you selected on the Cluster tab of your model.

### What is Intel MKL?

Intel Math Kernel Library (Intel MKL) is a library that includes Intel's hardware optimized versions of BLASLAPACKScaLAPACK, FFTW as well as some miscellaneous sparse solvers and vector math subroutines. The routines in MKL are hand-optimized specifically for Intel processors. On Kogence, Intel MKL is automatically configured and linked whenever appropriate based on the hardware you selected on the Cluster tab of your model. No action is needed from users.

### What is MPI?

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. HPC software applications, solver and simulators written using MPI library routines and properly compiled compiled and linked against MPI libraries enable multiprocessing and can be run in parallel either on single multi-CPU server using shared memory parallelism or multiple multi-CPU servers using shared memory parallelism within each server and distributed memory parallelism across the servers. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and openMP (multithreading) at the same time and is some times known as hybrid parallelism. There are several well-tested and efficient implementations of MPI. On Kogence, current we support Open MPI (not to be confused with OpenMP), MPICH and Intel MPI libraries.

Number of slots is one of the basic concepts in MPI. You can define any number of slots per host in the host file and then provide that host file to MPI using -hostfile option. By defining number of slots, you are letting MPI to schedule that many number of processes on that host. If you do not define it then, by default MPI assumes number of slots to be same as number of core. By using the switch --use-hwthread-cpus, you can override this and tell MPI to assume number of slots to be same as number of CPUs (i.e. the number of hardware threads). Defining any more slots than the number of CPUs will cause a lot of context switching and may degrade the performance of your applications.

If you wrap your command with mpirun -np N, for example, then MPI will launch N copies of your command as N processes in a process group  in a round-robin fashion by slots. These processes are called rank0, rank1 ... and so on. They can share memory and do intera-process communication. It is expected that the command you are launching is properly compiled and linked with MPI so as soon as these N copies launch, they can talk to each other and decide which tasks each would be working on. If you wrap a non-MPI application command with mpirun --np N then each of the N copies would be doing exactly same thing. If you scheduled another mpirun wrapped command then MPI will launch another set of processes in another process group. These new processes will also be called rank0, rank1 ... and so on within that new process-group. The processes from process-group1 and the processes from process-group2 can only communicate through inter-process communication and not using intera-process communication. They also can not share memory.

With -np option you can specify either fewer processes than there are slots or you can also oversubscribe the slots. Oversubscribing the slots will cause lots of context switching and may degrade performance of your application. One can prevent oversubscription by using the -nooversubscribe option. Oversubscription can also be prevented on per host basis by specifying the max_slots=N in the hostfile (resource managers and job schedulers do not share this with MPI, this only works when you are explicitly providing host file to the MPI). There are alternative ways to specify number of processes to launch. If you don't specify anything but provide a host file then MPI launches as many processes on each host as there are slots. The number of processes launched can be specified as a multiple of the number of nodes or sockets available using -npernode N and -npersocket N options respectively. The -npersocket option also turns on the -bind-to-socket option, which is discussed in a later section.

One can map the processes to specific objects in the cluster. This is the initial launch of the processes. One can then further bind the processes to objects in the cluster so that when OS does rescheduling after context switching, these processes remain bound to specific cluster objects.

One can use the option --map-by foo where foo can be slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node, sequential, distance, and ppr. For example, --map-by node will cause rank0 to go to first node, rank1 to the next node and so on until all nodes have one process and then it will restart from first node as needed. --map-by socket is the default.

Binding processes to specific CPUs is also possible. This tells OS that a given process should always stick to a given CPU as OS is doing context switching between multiples processes and threads. This can improve performance if the operating system is placing processes suboptimally. For example, when we are launching less number of processes than the number of CPUs in a node, OS might oversubscribe some multi-core sockets while leaving other sockets idle; this can lead processes to contend unnecessarily for common resources. Or, OS might spread processes out too widely; this can be suboptimal if application performance is sensitive to interprocess communication costs. Binding can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with.

One can use the option --bind-to <foo> where foo can be slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none. By default, MPI uses --bind-to core when the number of processes is <= 2, --bind-to socket when the number of processes is >2 and --bind-to none when nodes are being oversubscribed. If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.

The processors to be used for binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a —bind-to l3cache directive will cause the process to be bound to the processors that share a single L3 cache within that socket. To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding algorithm will assign each process located to a socket to a unique core in a round-robin manner. Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located.

### What is OpenMP?

OpenMP is an application programming interface (API) supports shared-memory multithreading programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and openMP (multithreading) at the same time and is some times known as hybrid parallelism.

### What are Resource Managers and Job Schedulers?

Resource Manager and Job Scheduler phrases are used interchangeably. Kogence autoscaling cloud HPC clusters are configured with the Grid Engine resource manager. Any command that you invoke on the shell terminal would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster using the shell terminal then please use the qsub command of the Grid Engine resource manager like below:

qsub -b y -pe mpi num_of_CPU -cwd YourCommand


Be careful with the num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the qsub man page. Specifically you might find following switches to be useful:

• -pe mpi: Name of the "parallel environment" on Kogence clusters.
• -b y: Command you are invoking is treated as a binary and not as a job submission script.
• -cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
• -wd working_dir_path: Makes the working_dir_path as the working directory. Output and error files would be generated in this folder.
• -o stdout_file_name: Job output would go in this file.
• -e stderr_file_name: Job error would go in this file.
• -j y: Sends both the output and error to the output file.
• -N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using -o and/or -e switches. You can also monitor and manage your job using the job name.
• -sync y: By default qsub command returns control to the user immediately after submitting the job to cluster. So you can continue to do other things on the CloudShell terminal including submitting more jobs to the scheduler. This option tells qsub to not return the control until the job is complete.
• -V: Exports the current environment variables (such as \$PATH) to the compute nodes.

You can use qready to check if cluster is ready before submitting the jobs to cluster. You will submit simulations using the qsub command. You can use qwait job_name to wait for job to finish before doing some post prepossessing for submitting a dependent job. We recommend adding a CloudShell to your software stack in the Stack tab GUI. This CloudShell terminal can be used to monitor and manage cluster jobs. You monitor cluster jobs by typing qstat on the terminal. You can delete jobs using qdel jobID command. You can monitor compute nodes in the cluster by typing qhost on the terminal. qhost lets you monitor the performance and utilization of the compute nodes in your cluster.