- 1 Miscellaneous Kogence Terminology
- 2 Cloud HPC Hardware Terminology
- 2.1 What is 1 CPU on Kogence?
- 2.2 What is a Hardware Thread?
- 2.3 What are the SISD, SIMD, MISD and MIMD Architectures?
- 2.4 What is the Memory Latency and Bandwidth on Kogence?
- 2.5 What are Shared and Distributed Memory Computing, SMP, NUMA and ccNUMA?
- 2.6 What are "Number of Processes" and "Number of Threads"?
- 2.7 What is the Overhead of Process Context Switching?
- 2.8 What is the Overhead of Thread Context Switching?
- 2.9 How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?
- 2.10 How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?
- 3 Cloud HPC Network Architecture
- 4 Cloud HPC Parallel Computing Libraries and Tools
- 4.1 What is BLAS?
- 4.2 What is BLACS?
- 4.3 What is PBLAS?
- 4.4 What is LAPACK?
- 4.5 What is ScaLAPACK?
- 4.6 What is ELPA
- 4.7 What is FFTW?
- 4.8 What is Intel MKL?
- 4.9 What is MPI?
- 4.10 What is OpenMP?
- 4.11 What is Charm++
- 4.12 What are Resource Managers and Job Schedulers?
Miscellaneous Kogence Terminology
What is a Model?
Every cloud HPC project users create on Kogence is called a Model. Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators.
Each model consists of:
- An independent project documentation wiki. Wiki comes with a full featured WYSIWYG editor. Any graphics generated on execution of your Model is automatically pulled into the project wiki. Permission controls you choose for your Model apply to the Model's wiki as well. A private wiki would show up in the Model Library page only if one of the collaborator with correct permission logs in.
- An independent discussion board. Permission controls you choose for your Model apply to the Model's discussion boards as well.
- An independent control of permissions at the Model level. If you give your collaborators a edit permission then your collaborators would have edit permission on all assets under that Model including project files, wiki, discussion boards, cluster settings and stack settings. You cannot control permissions on individual file level, for example.
- A connected cluster. You can chose to run your Model on a single cloud HPC server or or you can choose to run on an autoscaling cloud HPC cluster. You can choose the maximum number of nodes that you want your cluster to scale. Nodes are created and deleted automatically based on the progression of your Model execution. Any settings you choose for setting up your cluster remain stored with the Model.
- A connected software stack. Model can be connected to multiple software. You can create a Workflow using these multiple connected software. Some of these can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command. Any settings you choose for setting up your software stack and invocation commands remain stored with the Model.
- All assets of each Model are independently version controlled.
- When you Copy a Model you are creating a new Model with all its assets being duplicated. From there on, new Model maintains its own new version control history.
What is a Simulator and a Container?
On Kogence, any software application, solver or a simulation tool, each being referred to as a Simulator, is deployed as an independent entity, referred to as a Container, that can be connected to Models using the Stack tab of the Model and invoked to do some tasks/computations on input files/data provided under the Files tab of the Model. Containers cannot contain user specific or Model specific data or setup. It has to be an independent entity that can be linked and invoked from any Model on Kogence.
We use the term Simulator in much wider sense than what is commonly understood. In the context of definition above, Matlab is a simulator and so is Python. Both are deployed in independent Container. On Kogence we use Docker container technology.
If you have developed a new solver, simulator or software application, you can now deploy it on Kogence. Just click on Containers tab on top NavBar and then click Create New Container and provide link to your container repository. We support any and all linux containers. You only need to package your solver in your container - you do not need anything else - no schedulers, no OS utilities, no graphics utilities, no ssh. Nothing. Period. You can decide if you want to restrict your solver usage to yourself, to your colleagues/collaborators or to let other Kogence users use your solver. If you open it to other Kogence users, they will not have access to either your source code or your binaries. They will be able to link your solver/software to their models and use the functionality that your solver/software provides.
Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we do not launch a Container; we launch a Model. Model is connected to a cluster and a stack of software and simulators
What is a Simulation?
Execution of the Model on a cloud HPC Cluster is called a Simulation. A Simulation is NOT a single job. Billing on Kogence works on per Simulation basis (i.e. per execution of a Model) not on per job basis. Simulation can consists of a complex Workflow of multiple jobs using multiple software as defined by the user on the Stack tab of the Model. Some of these jobs can run in interactive mode and other can run in unattended batch mode. Some of these may start on the master node or the interactive node and others may be scheduled on the compute node of the autoscaling cloud HPC cluster. Some of these may be blocking in the sense that next commands run when the current one exits while others can run concurrently with the current command.
Concept of Model is central to Kogence. On Kogence, we do NOT launch a server or a cluster; we do NOT launch a software or a simulator; we launch a Model. Model is connected to a cluster and a stack of software and simulators
How Long Does it Take for My Simulation to Start?
If your Model is connected to a single cloud HPC sever that is not already persisting (i.e. when you start your Simulation for the first time) then it can take about 2 minutes for the server to boot up and start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. If you press the Stop button to stop the simulation and then press the Run button again then server persists through use of kPersistent technology. That means that server would start executing your jobs immediately and you can connect using Visualizer tab immediately. This allows you to work interactively, stop the execution, edit and debug code using code editor accessible under the Files tab of your model, and the restart your Model again just like your would do when you are on the onprem workstation.
If your Model is connected to an autoscaling cloud HPC cluster then it can take about 5 minutes for the cluster to boot up and configured to start executing jobs defined in the Stack tab of your Model. Once server has started executing your jobs, you will see the Visualizer tab becoming active on the top NavBar and a Stop button next to it. kPersistent technology is not yet available with clusters.
What are CPU-Hrs?
CPU-Hr is the unit we use on Kogence to measure amount of computing power that has been reserved or consumed independent of the type of hardware being used. CPU-Hrs = # of CPUs X # of Hours.
We measure time in the steps of hours. This does not mean that every time you start your Simulation you will be billed for full one hour. Only the first time you start a Simulation we charge your for one full hour. If, before the full hour is completed, you stop and restart same or different Model on same hardware then we do not charge you anything until the completion of the full hour since the time you first started your first Model. After the completion of a full hour, you would be charged for another full hour. Same logic continues for all subsequent hours.
What is 1 CPU Credit or 1 HPC Token?
HPC Token and CPU Credit are one and the same thing depending upon the release version of Kogence HPC Grand Central App deployed under your subscription. On Kogence, HPC Tokens or the CPU Credits is the currency that you purchase and then spend it while you consume the HPC compute resources on the Kogence Container Cloud HPC platform.
The cost of compute resources is specified in terms of CPU Credits or HPC Tokens. Please check the pricing page for the currently available pricing. Typically, 1 CPU Credit = 1 CPU for 1 hour. Accelerated hardware such as a GPU compute node is also priced in terms of CPU Credits. Typically, 10 CPU Credit = 1 GPU-accelerated-CPU for 1 hour.
So for example if you connect your Model to a 4 CPU machine and select the wall time limit of 10 hours, then your accounts needs to have at least 40 CPU Credits remaining otherwise your Simulation would not start. If your Simulation gets started and either ends automatically or you stopped it by pressing the stop button after 1 hour and 20 minutes, for example, then we will refund your account with 32 CPU Credits. If you restarted the Simulation on same hardware and with same wall time limit within next 40minutes, say after 20minutes to be exact, then we would again block 40 CPU Credits at the start. But if you stoped the Simulation after 10 minutes only then we would refund your account with full 40 CPU Credits back. You are not charged anything for this second Simulation because you already paid for 2 hours during the first Simulation.
As of this writing, everybody starts with free 20 CPU Credits. We top up your accounts every month for free. You can earn more free Credits. You can also purchase more Credits. Credits do not have any expiration date. Currently you can purchase CPU Credits for as low as as $0.02. This means you can purchase 1 CPU-Hr of computing for $0.02. Kogence reserves the right to change these pricing without notice. Please check the pricing page for the currently available pricing.
Cloud HPC Hardware Terminology
What is 1 CPU on Kogence?
1 CPU on Kogence is same as what is called as 1 CPU in the Resource Monitor of Microsoft Windows 10 PC, for example. Similarly, # of CPUs shown on Kogence is same as what is shown as "CPU(s)" by the
lscpu utility of linux platforms. Different platforms and utilities may call this same logical computing unit by different names. On same Microsoft Windows 10 PC, for example, System Information and Task Manager utilities calls it as "number of logical processors" and Amazon AWS (see here) and Microsoft Azure (see here) call this "number of vCPU". In general,
This is the most relevant logical unit of computing power in the context of cloud High Performance Computing (HPC). Each logical processor in fact has an independent architectural state (i.e. instruction pointer, instruction register sets etc.) and each can independently and simultaneously execute instruction sets from one worker thread or worker process each without needing to do context switching. Each of these register sets are clocked at full CPU clock speed.
In the context of MPI (Message Passing Interface, multi-processing framework) and OpenMP (multi-threading library), for example, each of these CPUs can run an individual MPI process or an individual OpenMP thread without requiring frequent context switches. So if you are scheduling an MPI job on a 4 CPU machine on Kogence then you can run
mpirun -np 4 and there will be no clock cycle sharing or forced context switching among processes.
Between CPUs, Cores, Sockets, Processors, vCPUs, logical processors, hardware threads etc. terminology can get confusing if you are not a computer scientist! Lets take a look at common Microsoft Windows 10 PC that we are all so familiar with. All other computer system will also show similar info. Lets first look at the task manager. This machine has an Intel Core i5-7200 CPU. But don't jump to call it a 1 CPU machine, that is is wrong. You will see below. Here we also notice that this machine has 1 socket, 2 cores and 4 logical processors. Now if you open the System Information, you will see something similar. Now if you open the Resource Monitor, you will see something like this. Notice that this is a 4 CPU machine -- from CPU 0 to CPU 3. If you are on a linux machine you run the
lscpu command, you will see 4 CPU(s) as well.
Hardware threads within a core do share one common floating point processor and cache memory, though. Please check FAQ section to learn more about CPUs and hardware threads. In FLOPS heavy HPC applications it is the utilization of floating point processor that is more important than the utilization of CPU.If a worker thread or a worker processes running on one of the hardware thread can keep the floating point processor of that core fully occupied then even though the system/OS resource management and monitoring utilities (such as
gnome system-monitor on linux platforms) may show overall CPU utilization being capped at most at 50%, you are already getting most out of your HPC server. 50% CPU utilization may often be misleading in the context of HPC. Remember that the CPU utilization percentage is averaged across total number of CPUs (i.e. total number of hardware threads) in your HPC server and if you are using only one hardware thread per core then maximum possible CPU utilization is 50%. This just means that CPU can in theory execute twice as many instruction per unit time provided they are not all asking for access to floating point processor. If they are then 50% utilization is as best you can get.
What happens if you run 2 FLOPS heavy worker threads or worker processes -- one each on each hardware thread? Since hardware threads within a core do share one common floating point processor, for FLOPS heavy HPC applications the worker threads or worker processes running simultaneously on these hardware threads may need to slow down as they need to wait for floating point processor to become available. Because the switch of access to floating point processor from one hardware thread to another hardware thread is extremely efficient in modern servers, you should see improvement in the net throughput in almost all use cases between allowing servers to run two simultaneous hardware threads per core versus allowing only one hardware thread per core. Situation gets more complex when one worker process or worker thread tries to spawn many more worker threads (for example it may be using a parallelized version of BLAS library). Those worker thread context switches will now start to add overhead. Unfortunately, there are no common OS or system utilities that can easily displays the FLOPS performance of your application versus the maximum FLOPS capability of your HPC server. You would need to run your own benchmarks to estimate this.
For these reasons, some FLOPS heavy HPC applications available on Kogence (e.g. Matlab) will not run one independent worker process or worker thread on each hardware thread. Instead they run one worker process or worker thread per core. But Kogence allows you to start multiple instances of these applications. You can experiment with single instance and multiple instance and see if you get better net throughput.
What is a Hardware Thread?
Most High Performance Computing (HPC) servers that Kogence offers are built using Intel microprocessor chips. HPC server motherboards consists of multiple-sockets with each socket plugged with one Intel microprocessor chip. Each microprocessor chip has multiple cores. Each core is built using Hyper-Threading technology (HTT). HTT creates 2 logical processors out of each core. Different operating system (OS) utilities and program identify these logical processors by different names: some identify them as 2 CPUs and others identify them as 2 hardware threads. Hardware threads is a very confusing terminology. It has nothing to do with user (worker) threads that your program might start. Each hardware thread is capable of independently executing an independent task -- ether an independent worker thread or an independent process.
Hyper-Threading Technology is a form of simultaneous multithreading technology introduced by Intel. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per core. Just like a dual-core or dual-socket configuration that uses two separate physical processor, each of these logical processor has its own processor architectural state. Each logical processor can be individually halted, interrupted or directed to execute a specified process/thread, independently from the other logical processor sharing the same physical core. On the other hand, unlike a traditional a dual-core or dual-socket configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface. The sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core. A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The processor may stall due to a cache miss, branch misprediction, or data dependency. The degree of benefit seen when using a hyper-threaded processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.
In vast majority of modern HPC use cases, we find that HTT helps speeding up the application. It is a very effective approach to get most performance for a given cost. At high level any HPC program when loaded into CPU registers as a set of machine instructions can be thought of as directive to CPU to repeatedly looping over following: 1/ Fetch instruction, 2/ Decode instruction and fetch register operands; 3/ Execute arithmetic computation; 4/ Possible memory access (read or write); 5/ Write back results to register. It is the step #4 that most people ignore logically when thinking about speed of executions. In this context, even cached memory is slow, much less main memory. L1 cache typically has a latency of ~2 CPU cycles, L2 cache typically has a latency of ~8 CPU cycles while L3 cache typically has a latency of ~100 CPU cycles. Main memory has about 2X more latency than the L3 cache (~200 CPU-cycles away). If your code opens some I/O pipes to read/write on files, for example, than I/O devices on main memory bus has 100X-1000X more latency than main memory (~20K to ~200K CPU-cycles away). If I/O devices are on network or on PCIe bus than you are looking at miliseconds (~2million CPU-cycles away) level latency at the least.
Now imagine your HPC code is running on a CPU. CPUs are really good in doing step #1 to #3 and step #5. For example, modern CPUs can do 4 floating points (4 FLOPS) per CPU-cycle. If your program needs to access some data from main memory than it will be waiting for ~200 CPU-cycles and the floating point arithmetic compute engine of CPU will be just sitting idle. If it is waiting for data from an I/O then compute engine may be waiting idle for millions of CPU-cycles. If you could use that time, you could have completed 4 million floating point operations utilizing that idle time. That is exactly what the HTT technology accomplishes. Even the most heavily optimized real world HPC code would need to access cache memory frequently at the least. This means there are several 10s to several 100s of CPU-cycles, at the least, that could be utilized by other thread. Whether hardware threading (HTT) will enhance the performance or not basically boils down to the ratio of floating point instructions to instructions that need to fetch or write data from cache, main memory or I/O devices. Even if that ratio is 100 or 1000, you would still expect HTT to boost performance.
In case the HTT is turned off, it is not that you would be completely wasting those idle cpu cycles. The OS can still do the context switching and load other ready to be executed process or thread (user worker thread or OS thread). The benefit of HTT is that the time it takes to do the context switching between hardware threads is much less than that needed by the OS to do the context switch as the Thread Control Block, TCB, is already loaded in the hardware.
However, in cases where both threads are primarily operating on very close (e.g., registers) or relatively close (first-level cache) instructions or data, the overall throughput occasionally decreases compared to non-interleaved, serial execution of the two lines of execution. In the case of LINPACK benchmark, a popular benchmark used to measure supercomputers on the TOP500 list, many studies have shown that you get better performance by disabling HT Technology. We think these are largely artificial constructs and do not represent how real life HPC codes work. These LINPACK benchmarks are specifically designed to probe the speed of floating point operations and deliberately avoid accessing memory or I/O. Real world HPC workloads cannot avoid memory access or I/O as efficiently as these benchmarks do. These benchmarks are hand crafted specifically to probe the speed of floating point operations.
One of the drawbacks of traditional onprem HPC cluster is that it hinders experimentation with HTT for your specific use case at hand. Either entire cluster needs to switch ON HTT or entire cluster needs to switch OFF HTT. Your onprem cluster admin typically makes that decision based on all types of workloads that all other users typically run on that cluster. On kogence cloud HPC platform you create your own personal autoscaling HPC cluster for each model you are executing. You can run same model multiple times on brand new clusters and disable HTT Technology in some cases while leaving it enabled in others. You can easily test both configurations and decide which is best based on empirical evidence. With that said, for the overwhelming majority of workloads, you should leave HTT Technology enabled.
OS configured on Kogence are HTT-aware, meaning they know the distinction between hardware threads and physical cores and would properly schedule user threads to reduce stalled CPU states and enhance performance unless you specifically instruct Kogence platform to pin your processes and worker threads to specific cores or to specific hardware threads.
What are the SISD, SIMD, MISD and MIMD Architectures?
Oldest microprocessor architectures were SISD, which is short for single instruction single data. This means that a single piece of data (say two operands on which add instruction needs to be executed) is loaded on CPU registers and a single instruction (such as the add instruction) is executed on them.
SIMD, which is short for single instruction multiple data, enables us to process multiple data with a single instruction. This allows data to be "vectorized". For example, we may be able add two columns of a matrix in a single CPU clock cycle. The width of SIMD is the number of data that can be processed simultaneously and it is determined by the bit length of registers. Intel hardware has offered vectorized SIMD capability beginning with SSE (Streaming SIMD extensions) which supports 128-bit registers. AVX (advanced vector extensions) offered 256-bit registers and AVX-512 offers 512-bit registers. On Kogence, all hardware offered is AVX-512 capable. Since the theoretical peak performance of a CPU is the value when the application uses the vector width to the full, the utilization of SIMD is crucial for the performance of applications on modern CPU architecture. However, it is not trivial to transform a scalar kernel to a SIMD-vectorized one. Additionally, the optimal method of vectorization is different for each instruction set architecture (ISA). Most software, simulators and solvers that we offer on Kogence, such as GROMACS, LAMMPS, NAMD etc. have been effectively vectorized and are available to use out of the box to our users.
Multiple instruction single data (MISD) is a theoretical processor architecture concept that, to our best knowledge, has never been implemented in any commercially available computing hardware.
All hardware that we offer on Kogence are Multiple instruction multiple data (MIMD) capable. Each CPU in itself is SIMD but all computing hardware we offer consists of multiple CPUs that can independently execute multiple instructions, each instruction working on vectorized data, simultaneously. So the computing the computing hardware in aggregate is known as an MIMD system.
MIMD computing systems can be further divided into two categories -- the shared memory architecture (known as symmetric multiprocessors or SMP) and distributed memory architecture (also known as cluster computing). We offer both on Kogence. Please refer to other sections of this document for more details.
What is the Memory Latency and Bandwidth on Kogence?
Before we answer that question, lets cover some basics.
Virtual Memory and Page Tables
On Kogence, all hardware and operating system (OS) we offer work with virtual memory system. Every process is given an impression that it is working with large, contiguous sections of memory. Physically, the memory of each process may be dispersed across different areas of physical memory, or may even have been "paged out" to a hard disk drive or a solid state drive (always using NVMe on Kogence). When a process requests access to data in its memory, it is the responsibility of the OS to map the virtual address provided by the process to the physical address of the actual memory where that data is stored. The page table is where the OS stores its mappings of virtual addresses to physical addresses. OS stores the page table in the physical memory. Virtual memory address space is broken down into pages. Hardware and OS we offer support multiple sizes of pages and even a same process can use pages of different sizes. Physical memory space is broken down into frames. The page table holds the mapping between a virtual address of a page and the the physical memory address of the corresponding frame. There is also some more auxiliary information in the page table about the page such as if page has been paged out of physical memory into hard disk drive or a solid state drive. OS we offer provide same virtual memory address space to two different processes. This means page table also has to use and store process identification information in addition to virtual memory address to calculate the actual physical memory address.
In HPC, we often deal with shared memory parallelism that means processes can share memory. That means multiple virtual memory pages (as addressed by different processes) may be mapped to same physical memory frame. If NUMA is enabled then reverse if also possible, i.e. same virtual memory page may be mapped to multiple physical memory frames (for read only frames) with individual frames being on local physical memory of each individual CPU.
Cache Memory, Translation Look Aside Buffer (TLB) and Main Physical Memory
Cache and TLB are part of the Memory Management Unity (MMU) which is integrated on same silicon chip together with the microprocessor. Both cache and TLB are used to reduce the time it takes for a running process to access a physical memory location which is located off the microprocessor chip and on a different silicon chip on the motherboard connected on the memory bus on all hardware that Kogence offers. Cache is basically a quick access copy of small sections of physical memory in a static RAM on the microprocessor's silicon chip while TLB is a quick access copy of small sections of page table (which also resides in physical memory) in the static RAM on the microprocessor's silicon chip.
Process' request for a virtual memory address first goes to the L1 cache, which in hardware offered by Kogence, uses virtual addressing (L2 and L3 caches use physical addressing). Each core gets its own L1 cache on Kogence offered hardware. If cache has the data then it is sent to the process. It is called a cache hit. If cache does not have the data (called a cache miss) then request is sent to the TLB. The TLB stores the recent translations of virtual memory address to physical memory address. If the requested address is present in the TLB, the TLB search yields a match quickly and the retrieved physical address can be used to access L2 cache, L3 cache (L2 and L3 cache are accessed by physically addresses and do not need another TLB) or the physical memory, usually in that order (but there are situations where CPU hardware can predictively skip checking intermediate locations), and bring the data into cache and provide same to the process. This is called a TLB hit. If the requested address is not in the TLB, it is a TLB miss, and the translation proceeds by looking up the page table (stored in the main physical memory) through a process called a page walk. The page walk is time-consuming in terms of number of clock cycles it takes, as the process involves reading the contents of multiple physical memory locations and using them to compute the physical address. After the physical address is determined by the page walk, the virtual address to physical address mapping is entered into the TLB. Then that translated address is used to access physical memory and bring the data into cache and provide same to the process. CPU architectures use multiple TLBs.
Latency and Throughputs
Servers offered on Kogence have 3 levels of cache: L1, L2 and L3. L1 cache is further broken down into instruction (L1i) and data (L1d) cache. Each core gets its own L1 and L2 cache while L3 cache is shared among all cores in a socket. Hyperthreads do not get their own cache. Typically our hardware will have 32KB each of L1i and L1d cache, 256KB of L2 cashe and >10MB of L3 cache. Amount of each level of cache differs from hardware to hardware. Check the specification page for more details. On Kogence hardware, TLB may reside between the CPU and the L1 cache. On servers Kogence offer,
- L1 cache typically has a latency of ~2 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
- L2 cache typically has a latency of ~8 CPU cycles and a bandwidth of about ~500 bytes/CPU-cycle.
- L3 cache typically has a latency of ~100 CPU cycles and a bandwidth of about ~200 bytes/CPU-cycle.
- Main memory has a latency of ~200 CPU cycles and a bandwidth of about ~100 bytes/CPU-cycle.
Oldest microprocessor chips had one CPU connected to an off chip RAM through a memory bus. Then came the multi-core microprocessors. Single chip had multiple CPUs in it connected to the same shared off chip RAM through a common memory bus. This shared memory computing architecture is referred to as the symmetric multiprocessor (SMP). What this means is that each CPU can access the common shared RAM in a completely autonomous fashion independent of the other CPUs. The architecture is "symmetric" when see from each core's perspective. There is no master/slave type of concept, there is a single OS across all CPUs and neither of the CPUs has to go through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP). Processes running on each CPU can address the entire common memory space and access that memory space directly though the memory controller. An SMP system is a tightly-coupled system in which multiple processors working under a single operating system and a single virtual memory space can access entire system memory over a common bus or interconnect path.
Applications using open MP for parallelism rely on this shared memory architecture. You can not two threads of same process on CPUs that don't share memory. Older versions of MPI didn't leverage shared memory capability. Meaning if you were running two MPI ranks (i.e. two processes in same process group) on an SMP machine then those two process would have used inter process communication methods to access each others data. Starting from MPI-3, MPI now offers shared memory capability as well.
Of course, in the SMP architecture, the memory access latency will depend on the memory bus contention or memory access conflicts. As the number of cores connected to single memory bus started to grow this contention quickly became an issue and the performance boost expected from multiple CPUs quickly started to disappear. Cache memory architectures, integrated memory controllers (IMC) and multiple channel memory architectures were developed to reduce this memory contention in SMP architectures. In modern hardware that we offer on Kogence, memory architecture is quite complex to deal with possible contentions. Please see Intel Cascade Lake and Intel Skylake architecture documentation pages as well as other sections of this document.
SMP architectures are further broken down into two conceptual frameworks -- Uniform Memory Access (UMA) and Non-uniform Memory Access (NUMA). UMA architectures are basically the earliest SMP architectures where, in the absence of contentions, the memory access latency for each CPU is exactly the same. These UMA SMP architectures are clearly not scalable as more cores and CPUs are added. SMP architectures that we offer on Kogence are NUMA architectures. In NUMA SMP architectures, each CPU does not necessarily have completely same access latency to entire system RAM but, logically, there is still a single common virtual memory address space for all processes running on the single node. Each CPU can address the same entire virtual memory address space. Therefore, architecture is still known as SMP architecture. The older SMP applications will run on newer NUMA memory architecture unchanged -- hence the architecture is still SMP at a logical level.
Within each socket of the architectures we offer, there are two sub-NUMA clusters created by the two IMCs. Each socket gets a mesh interconnect between cores and IMCs, two IMCs and 6 memory channels (3 memory channels per IMC ) with the sockets on same node being connected by the UPI links. Absence any contentions, all the cores in a localization zone (i.e. in the sub-NUMA cluster) have same memory access latency. One can use OS provided utilities to specify process or thread affinities to within sub-NUMA clusters.
Cache management with these modern NUMA SMP architectures is significantly more complex as we need to make sure that the potentially multiple duplicate copies of data loaded into L1, L2 cache of each core as well as the L3 cache of the socket is in sync as multiple CPUs perform read write operations. Hardware architectures we offer are cache coherent NUMA (ccNUMA) which means that the cache coherence is ensured by the hardware itself.
Distributed memory computing (also known as cluster computing or loosely-coupled computing) should be contrasted with the tightly coupled shared memory (SMP) computing. Clusters consists of "nodes". A node is the largest unit of computing where cache coherency can be ensured. Different nodes are running independent copies of OS and have their own virtual memory address space. Processes running on different nodes need to access data through the network hardware (in case of RDMA) and/or OS communication stack (in case of TCP/IP).
What are "Number of Processes" and "Number of Threads"?
A process is a logical representation of an instance of a program that has been submitted to CPU and OS to manage its execution. Process is distinct from a thread -- both user or OS (worker) threads as well as hardware threads (HTT technology). Hardware thread is a confusing terminology. Each hardware thread can execute an independent process or an independent worker thread. On Kogence platform we refer to hardware thread as a CPU.
As opposed to threads, processes provides stronger logical isolation as you are stating independent instances of program to work independently. Intuitively, one can understand starting multiple processes of same program to be similar to starting same program multiple times. For example, if you start multiple instances of Matlab by tying
matlab -desktop & multiple times on the
CloudShell terminal then you are basically starting multiple Matlab processes. You should not be afraid of multiple instances of Matlab corrupting data or files because if one instance of Matlab opens a file then other instance cannot access it. Similarly both instances are creating there own temporary data in the memory to work on so you do not have to think about controlling access to variable in the memory.
Using multi-processing libraries such as MPI to start multiple instances of Matlab is much more preferred method as compared to above mentioned method of starting multiple instances on a shell terminal. As an example, if your code running under each instance of Matlab is trying to access same file then all instances except will crash complaining they cannot access the file. If you start processes through standard libraries such as MPI then you can be assured that MPI will manage those things. It will take care of keeping a process on wait until other processes close the file.
At a lower level, a process is defined by a Process Control Block (PCB). PCB stores state of execution of a process. It has all the information that OS and CPU need to freeze or unfreeze a process. Any set of instructions submitted to CPU that comes with its own PCB is a process. Each process gets a unique process ID (PID) that you can check using OS utilities. So any task that has a unique PID is a process. As an example, if you execute a bash shell script then the bash-shell started to execute the script instructions will get a PID and would be a process. If from that shell script you provide instructions to call another bash shell script then that script will be executed in another child bash shell and that bash shell will gets its own PCB and PID. Modern OS are smart enough to not duplicate the machine code for each instance in the main memory and in cache. For example, they will only load one copy of the text or the code segment in the memory. But each process will get its own program counter, a CPU register, that keeps track of which instruction is being executed currently. Program counter is part of PCB and gets saved and restored when processes is taken out by OS from running state and then brought back to running state. Typically, virtually memory allocated to a process and files and other I/Os opened by a process are restricted to that process. Other processes cannot access those resources. But processes can specifically issue instructions to share these resources with other processes. In HPC we do this by using standard libraries such as MPI.
Processes can start children processes (and those can start grand-children processes etc). Each of those will get their own PCB (and therefore a PID). Processes can also start multiple threads. Threads will operate under same PCB. Threads do not get their own PCB (instead they get a Thread Control Block, TCB, which has a link to the PCB of the parent process). Threads can access resources of the process.
In HPC, typically when you submit a simulation or solver job that job first starts a single process on one of the CPUs on one of the compute node. The solver code support specific types of parallelisms. If it supports multi-threading then it can start multiple threads (threads of a single process) on multiple CPUs of the same node. Threads run on same node and have access to same data and same memory. Solvers code need to make sure that there is no data access conflict when multiple threads are working on same memory/data. Multi-threading does not provide as much isolation as multi-processing does because processes automatically prevent access to memory and data. This makes programming a multi-threaded code somewhat more challenging. On the other hand, some amount of multi-threading capability is easy to leverage by almost any HPC solver. Most solvers use standard basic math libraries such as BLAS and LAPACK. Those libraries internally use multi-threading to do computations in parallel on multiple threads on same compute node.
Most HPC solvers can also do multi-processing as opposed to multi-threading. Meaning multiple CPUs across multiple nodes can be instructed to use independent instances of same solver code (children processes) and operate independently on a specific set of data and instruct children processes to work on them in parallel. As children may be running on different nodes, they don't have access to memory of parent process in the parent compute node. Children processes exchange data with each other and with parent using message passing interface (MPI) library subroutines. If MPI processes are running on same node then they do have ability to access shared memory as well. But all request to access data/memory still goes through MPI subroutines. MPI subroutines manage mode of access to the data automatically by keeping track of which children process is running on which node.
On Kogence, we restrict product of processes and threads to be equal or less than the number of CPUs in the cluster. This eliminates serious overhead from context switching.
What is the Overhead of Process Context Switching?
Context switch refers to which instruction set is a given core executing at a given time. Context switch can refer to switching from one hardware thread to another hardware thread (HTT technology), switching from one user (worker) thread to another user (worker) thread or from one process to another process. Context switching from one hardware thread to another hardware thread has negligible overhead. This FAQ discusses the overhead of process context switch. There is another FAQ that discusses the overhead of worker thread context switch.
When a request is made to execute a different process on the same CPU (either triggered by user code instruction, i.e. voluntary context switch, or by the OS scheduler because running process went into idle state waiting for some I/O or because OS scheduler wants to give processor time to other processes, i.e. involutary context switch) then first a switch from user mode to OS mode is triggered. This simply means that a subroutine from OS code base needs to be called that will perform the task of context switch. OS code is just like usual user code that is loaded into the physical memory. This OS code will save the process control block (PCB, which consists of current values in a set of registers in the CPU such as the program counter, pointer to page table etc.) of the current process to the physical memory and load the PCB of new process from physical memory into the CPU registers. OS will then make another switch from OS mode to user mode to let CPU start executing instruction from new process. In older hardware and OS architectures, a switch from user mode to OS mode itself was implemented like a full process context switch. That meant saving existing process PCB, loading OS PCB, executing OS code, saving OS PCB and then loading the new process PCB. But in modern hardware and OS, cost of switching out of user mode to OS mdoe and then back is much less expensive and in not a significant portion of the cost of context switch anymore. Also, on virtualized cloud HPC hardware, some hypervisors need to give control to the host machine OS and guest OS cannot do this switch. This increases the computational cost of context switch by an order of magnitude. On Kogence platform, we have carefully configured that system to eliminate this latency.
Therefore the fixed cost of doing a context switch consists of: switching from user mode to OS mode; storing and loading of PCB; and switching back from OS mode to user mode. On Kogence cloud HPC platform, switching in and out OS mode back to user mode typically costs about 100 CPU clock cycles. With CPU clock cycles of about ~2 to 3GHz, the time it takes to switch back and forth between OS mode and user mode would take ~50ns. Among the fixed cost components, the cost of storing and loading process PCB is much larger. This is easily 2 orders of magnitude bigger and takes about 2,000 CPU cycles or ~1µs.
Much bigger portion of cost is the variable cost. The variable CPU cycle cost of doing context switch consists of: flushing of the cache; and flushing of the TLB. Note that each process uses the same virtual memory address space but they are each mapped to different physical memory address space through page table. A small portion of page table is cached on the TLB. Now depending upon the number of pages of virtual memory that the old and new processes use (called the working set), CPU might encounter lots of TLB misses on context switching. So it may have to flush the TLB and load new sections of page table. In addition, the old section of physical memory frames that were cached on various levels of cache may also become useless and CPU might encounter lots of cache misses and it may have to flush the cache and lot new physical memory frames into the cache.
The variable CPU cycle cost of switching from old process to new process and back to old process will change dramatically depending the process pinning you instruct the OS scheduler to use. Lets say CPU1 is executing process1. Then OS asked the same CPU1 to start executing process2. After some time, now OS wants to restart the process1. The number of CPUS cycles we have to waste before we can successfully start executing process1 changes dramatically if OS schedules process1 back to original CPU1 or if OS schedules it to a new CPU2 (and instead starts a process3 on original CPU1). Note that each core gets its own L1 and L2 cache while L3 cache is shared between cores. Since address translations from virtual memory pages that preocess1 needs to access the physical memory address may already be present in the TLB of CPU1 and those required frames of physical memory may also be already present in L1 or L2 cache of CPU1 (since process1 was running on CPU1 sometime back), the process1 may restart on CPU1 much quicker than on CPU2. If new CPU2 sits on a different socket then cost will be even higher as both TLB and all levels of cashes will need to be repopulated. By the way the process pinning affects fixed cost of context switching as well because now PCB is also not available in cache and needs to read from main physical memory.
Variable cost also depends strongly on the size of virtual memory pages that a process is using. If processes use very large page sizes than chances of need to flush TLB may be lower (as number of entries needed in the TLB may become smaller) but chances of need to flush cache may increase.
All of the above depend strongly on the size of the caches in the CPU you're using. Typically, if pages are available in cache then time it takes to write pages can easily take 100,000 cycles or few microseconds. If we miss cache and TLB then this can increase by an order of magnitude or to few 10's of microseconds. As we discussed above this also depends on the process pinning instructions your code might give to OS. Past a certain working set size, the fixed cost of context switching is negligible compared to the variable cost due to the cost of accessing memory. Because of all these reasons, it is very hard to put any representative/average number on the cost of context switching.
- Switching from user mode to OS mode and back takes about 100 CPU cycle or about 50ns.
- A simple process context switch (i.e. without cache and TLB flushes) costs about 2K CPU-cycles or about 1µs.
- If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each context switch costs about 100K CPU-cycles or about 50µs.
- As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
- Launching a new process takes about 100K CPU cycles or about 50µs.
- In the HPC world, creating more active threads or processes than there are hardware threads available is extremely detrimental (e.g. in 100K CPU-cycles, CPU could have execute 400K FLOPS). If number of worker threads is same as the number of hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
- On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.
What is the Overhead of Thread Context Switching?
Comparing multi-threading versus multi-processing situation, in some cases, cost of switching from one user (worker) thread to another user (worker) thread (of same process) may be lower compared to the cost of switching from one process to another process. OS does not need to load new PCB because both threads are part of same process. The mapping between the virtual memory pages to the physical memory frames also remains same for old and new threads of same process. Depending upon the working set size of each threads, CPU may not encounter lots of TLB or cache misses. For the most part, the cost of thread-to-thread switching is about the same as the cost of entering and exiting the kernel. On systems such as Linux, that is very cheap. Consequently, thread switching is significantly faster than process switching.
- If you are doing thread switch within same CPU (proper CPU pinning) with no need to do TLB and cache flushes then context switch takes about 100 CPU-cycles or about 50ns.
- A thread switch going to different CPU will cost about the same as a process context switch does i.e. its costs about 2K CPU-cycles or about 1µs. Here we are assuming that virtual memory that thread needs is already in cache and we dont need TLB and cache flushing.
- If your application uses any non-trivial amount of data that would require flushing of TLB and cache, assume that each thread context switch costs about 100K CPU-cycles or about 50µs.
- As a rule of thumb, just copying 30KB of data from one memory location to another memory location takes about 10K CPU-cycles or about 5µs.
- Launching a new threads takes about 10K CPU cycles or about 5µs.
- Creating more active threads than there are hardware threads available is extremely detrimental. If number of worker threads is same as the number f hardware threads then it is easier for the OS scheduler to keep re-scheduling the same threads on the CPU they last used (weak affinity). The recurrent cost of context switches when an application has many more active threads than hardware threads is very high. This is why on Kogence, we do not allow product of processes and threads to be higher than the number of CPU.
- On interactive nodes or the master nodes, we do not restrict the number of jobs, processes and threads you can start. A interactive node has lots of background processes taking small amounts of CPU time. This means number of threads can be more than number of CPUs. Threads tend to bounce around a lot between CPUs. In this case the costs of context switching and thread switching don't significantly differ in practice, unfortunately. For HPC workloads started on interactive or master nodes, you should pin processes and threads to specific CPUs to avoid this overhead.
How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?When using a tool like
gnome-monitoron Kogence HPC servers, you will see CPU usage being reported as being divided into 8 different CPU states. For example.
%Cpu(s): 13.2 us, 1.3 sy, 0.0 ni, 85.3 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
topis showing percentage of time server is spending in each of the eight possible states. Of these 8 states, “system”, “user” and “idle” are the 3 main CPU states. The ni state is a subset of us state and represents a fraction of CPU time that is being spent on low priority tasks. The wa state is the subset of the id state and represents a fraction of CPU time that is being spent while waiting for an I/O operation to complete. These 3 main CPU states and the si, hi and st states add up to 100%.
Please note that these are averaged over all CPUs of your HPC server. So if you started an 8 CPU HPC server then
top will show utilization averaged across all 8 CPUs. You can press
1 to get per-CPU statistics.
- system (sy)
The “system” CPU state shows the amount of CPU time used by the kernel. The kernel is responsible for low-level tasks, like interacting with the hardware, memory allocation, communicating between OS processes, running device drivers and managing the file system. Even the CPU scheduler, which determines which process gets access to the CPU, is run by the kernel. While usually low, the system state utilization can spike when a lot of data is being read from or written to disk, for example. If it stays high for longer periods of time, you might have a problem. So, for example, if CPU is doing a lot context switching then you will see that the CPU may be spending a lot more time in the system state.
- user (us)
The “user” CPU state shows CPU time used by user space processes. These are processes, like your application, or some management daemons and applications started automatically by Kogence that would be running in the background. In short, every CPU time used by anything other than the kernel is marked “user” (including root user), even if it wasn’t started from any user account. If a user-space process needs access to the hardware, it needs to ask the kernel, meaning that would count towards “system” state. Usually, the “user” state uses most of your CPU time. In properly coded HPC applications, it can stay close to the maximum of 100%
- nice (ni)
The “nice” CPU state is a subset of the “user” state and shows the CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks. The nice utility is used to start a program with a particular priority. The default niceness is 0, but can be set anywhere between -20 for the highest priority to 19 for the lowest. CPU time in the “nice” category marks lower-priority tasks that are run when the CPU has some time left to do extra tasks.
- idle (id)
The “idle” CPU state shows the CPU time that’s not actively being used. Internally, idle time is usually calculated by a task with the lowest possible priority (using a positive nice value).
- iowait (wa)
“iowait” is a sub category of the “idle” state. It marks time spent waiting for input or output operations, like reading or writing to disk. When the processor waits for a file to be opened, for example, the time spend will be marked as “iowait”. Instead, if a task running on a given CPU blocks on a synchronous I/O operation, the kernel will suspend that task and allow other tasks to be scheduled on that CPU. In that case, CPU is not idle and this will not be shown in id or wa states.
- hardware interrupt (hi)
The CPU time spent servicing hardware interrupts.
- software interrupt (si)
The CPU time spent servicing software interrupts.
- steal (st)
The “steal” (st) state marks time taken over by the hypervisior.
How Can I Monitor CPU Utilization on Kogence Cloud HPC Servers?
One can use
top or the
gnome-monitor utilities to monitor the memory utilization.
Cloud HPC Network Architecture
Is the Kogence Cloud HPC Cluster Connected to Internet?
If you are starting your simulation on a single cloud HPC server, your HPC server is connected to internet. Once your Model is in the running state, you can click on the Visualaizer button on top right corner to connect to your HPC server over the internet under a secure and encrypted channel of SSL/TLS. If you connect a CloudShell to the software stack of your model through the Stack tab of your Model then you can use the CloudShell terminal to pull repositories over the internet using
If you are starting an autoscaling cloud HPC cluster the the master node of the cluster is connected to the internet in same way.
Workloads running inside the container provisioned on these nodes will also have access to the internet.
Please refer to Network Architecture document for more details.
How is Network the Architected Among the Compute Nodes?
Do the Kogence Cloud HPC Cluster Come with Infiniband Network?
Yes. If you select Network Limited Work Load node then that node is equipped with an OS Bypass network with network bandwidth up to 100Gbps. Please refer to OS Bypass Network and Remote Direct Memory Access documents for more details.
How is the Network Configured Among the Containers?
Cloud HPC Parallel Computing Libraries and Tools
What is BLAS?
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software.
On linux systems, if you don't configure your system properly, you would be using the default GNU BLAS (such as
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS, AtlasBLAS, GotoBLAS and Intel MKL ) that can be used instead of the default base libraries. These libraries are optimized to take advantage of the hardware they are run on, and can be significantly faster than the base implementation (operations such as Matrix multiplications may be over 40 times faster). Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.
What is BLACS?
The BLACS (Basic Linear Algebra Communication Subprograms) is a linear algebra oriented message passing interface for distributed memory cluster computing. It provide basic communication subroutines that are used in PBLAS and ScaLAPACK libraries. It implements a portable message passing library to higher level libraries such as PBLAS and ScaLAPACK libraries. To provide this functionality BLACS can be built using any of the communication layer such as MPI, PVM, NX, MPL, etc available on the hardware and provide a hardware agnostic interface to higher level libraries. It is important to use hardware optimized communication layer to build BLACS to get most optimized performance for your HPC applications.
What is PBLAS?
PBLAS (Parallel BLAS) is the distributed memory versions of the Level 1, 2 and 3 BLAS library. BLAS is used for shared memory parallelism while PBLAS is used for distributed memory parallelism appropriate for clusters of parallel computers (heterogeneous computing). PBLAS uses BLACS for distributed memory communication. Just like BLAS, on Kogence we automatically configure software/solver/simulators with appropriate hardware optimized versions of PBLAS based on the hardware you selected on the Cluster tab of your model. No action is needed from users.
What is LAPACK?
LAPACK is a large, multi-author, Fortran library for numerical linear algebra. It provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision. LAPACK is the modern replacement for LINPACK and EISPACK libraries. LAPACK uses block algorithms, which operate on several columns of a matrix at a time. On machines with high-speed cache memory, these block operations can provide a significant speed advantage.
LAPACK uses BLAS. The speed of subroutines of LAPACK depends on the speed of BLAS. EISPACK did not use any BLAS. LINPACK used only the Level 1 BLAS, which operate on only one or two vectors, or columns of a matrix, at a time. LAPACK's block algorithms also make use of Level 2 and Level 3 BLAS, which operate on larger portions of entire matrices. LAPACK is portable in the sense that LAPACK will run on any machine where the BLAS are available but performance will not be optimized if hardware optimized BLAS and LAPACK are not used.
On linux systems, if you don't configure your system properly, you would be using the default GNU LAPCK (such as
/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1) that is a generic library and is not optimized for the hardware. There are several highly optimized BLAS libraries (such as OpenBLAS/LAPACK, Atlas/LAPACK, GotoBLAS/LAPACK and Intel MKL ) that can be used instead of the default base libraries. If you configure your system properly with hardware optimized LAPACK libraries then your HPC applications will acquire serious boost in performance. Kogence does this automatically for you based on the hardware you selected on the Cluster tab of your model.
What is ScaLAPACK?
ScaLAPACK (Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory clusters of parallel computers (heterogeneous computing). ScaLAPACK uses explicit message passing for inter-process communication using MPI or PVM (this functionality is provided by the BLACS that is used in ScaLAPACK). Just like LAPACK, ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy that includes the off-processor memory of other processors (or processors of other computers), in addition to the hierarchy of registers, cache, and local memory on each processor. ScaLAPACK uses PBLAS and BLACS. ScaLAPACK is portable in the sense that ScaLAPACK will run on any machine where PBLAS, LAPACK and the BLACS are available but performance will not be optimized if hardware optimized PBLAS and LAPACK are not used.
Kogence use ScaLAPACK automatically whenever appropriate for you based on the hardware you selected on the Cluster tab of your model. No action is needed from users.
What is ELPA
ELPA (Eigenvalues SoLvers for Petaflop-Applications) is a matrix diagonalization library for distributed memory clusters of parallel computers (heterogeneous computing). For some operations, ELPA methods can be used instead of ScaLAPACK methods to achieve better performance. ELPA API is very similar to ScaLAPACK API and therefore it is very convenient for HPC application developers to provide support for both libraries and offer ability to link ELPA in additions to ScaLAPACK during the application build. ELPA does not provide full functionality of ScaLAPACK and is not a complete substitute. Whenever HPC applications deployed on Kogence offer support for ELPA, on Kogence, we provision those applications (e.g. Quantum Espresso) with hardware optimized ELPA.
What is FFTW?
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data as well as of even/odd data, i.e. the discrete cosine(DCT)/sine(DST) transforms. At Kogence we automatically use the hardware optimized versions of FFTW based on the hardware you selected on the Cluster tab of your model.
What is Intel MKL?
Intel Math Kernel Library (Intel MKL) is a library that includes Intel's hardware optimized versions of BLAS, LAPACK, ScaLAPACK, FFTW as well as some miscellaneous sparse solvers and vector math subroutines. The routines in MKL are hand-optimized specifically for Intel processors. On Kogence, Intel MKL is automatically configured and linked whenever appropriate based on the hardware you selected on the Cluster tab of your model. No action is needed from users.
What is MPI?
Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. HPC software applications, solver and simulators written using MPI library routines and properly compiled and linked against MPI libraries enable multiprocessing. MPI should not to be confused with OpenMP which, on the other hand, is a library for multithreading. Multiple processes of MPI applications can be run in parallel either on single multi-CPU cloud HPC server using shared memory parallelism (with MPI versions newer that MPI-3) or on an autoscaling cluster of multiple multi-CPU cloud HPC servers using shared memory parallelism within each server and distributed memory parallelism across the servers. Newer versions of MPI standards now offer shared memory parallelism even across cloud HPC servers on an autoscaling cluster. OpenMP applications, on the other hand, enable you to start multiple threads of a single process of the application. A processes, and therefore all its threads, necessarily run on single cloud HPC server. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and open MP (multithreading) at the same time and is some times known as hybrid parallelism.
There are several well-tested and efficient implementations of MPI. On Kogence, currently we support Open MPI (not to be confused with OpenMP), MPICH and Intel MPI (an MPICH derivative) libraries. All of these support various hardware network interfaces either through Open Fabrics Interface (OFI) or through natively implemented support which provides OS Bypass Remote Direct Memory Access capabilities and avoid the inter process communication to go through traditional TCP/IP stack providing significant performance boost. On Kogence, users would not have to worry about the details of hardware and network infrastructure as we will configure appropriate transport layer for best performance for their HPC applications.
Number of slots is one of the basic concepts in MPI. You can define any number of slots per host in the host file and then provide that host file to MPI using
-hostfile option. By defining number of slots, you are letting MPI to schedule that many number of processes on that host. If you do not define it then, by default MPI assumes number of slots to be same as number of core. By using the switch
--use-hwthread-cpus, you can override this and tell MPI to assume number of slots to be same as number of CPUs (i.e. the number of hardware threads). Defining any more slots than the number of CPUs will cause a lot of context switching and may degrade the performance of your applications.
If you wrap your command with
mpirun -np N, for example, then MPI will launch
N copies of your command as
N processes in a process group in a round-robin fashion by slots. These processes are called rank0, rank1 ... and so on. They can share memory and do intera-process communication. It is expected that the command you are launching is properly compiled and linked with MPI so as soon as these
N copies launch, they can talk to each other and decide which tasks each would be working on. If you wrap a non-MPI application command with
mpirun --np N then each of the
N copies would be doing exactly same thing. If you scheduled another
mpirun wrapped command then MPI will launch another set of processes in another process group. These new processes will also be called rank0, rank1 ... and so on within that new process-group. The processes from process-group1 and the processes from process-group2 can only communicate through inter-process communication and not using intera-process communication. They also can not share memory.
-np option you can specify either fewer processes than there are slots or you can also oversubscribe the slots. Oversubscribing the slots will cause lots of context switching and may degrade performance of your application. One can prevent oversubscription by using the
-nooversubscribe option. Oversubscription can also be prevented on per host basis by specifying the
max_slots=N in the hostfile (resource managers and job schedulers do not share this with MPI, this only works when you are explicitly providing host file to the MPI). There are alternative ways to specify number of processes to launch. If you don't specify anything but provide a host file then MPI launches as many processes on each host as there are slots. The number of processes launched can be specified as a multiple of the number of nodes or sockets available using
-npernode N and
-npersocket N options respectively. The
-npersocket option also turns on the -
bind-to-socket option, which is discussed in a later section.
One can map the processes to specific objects in the cluster. This is the initial launch of the processes. One can then further bind the processes to objects in the cluster so that when OS does rescheduling after context switching, these processes remain bound to specific cluster objects.
One can use the option
--map-by foo where
foo can be slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node, sequential, distance, and ppr. For example,
--map-by node will cause rank0 to go to first node, rank1 to the next node and so on until all nodes have one process and then it will restart from first node as needed.
--map-by socket is the default.
Binding processes to specific CPUs is also possible. This tells OS that a given process should always stick to a given CPU as OS is doing context switching between multiples processes and threads. This can improve performance if the operating system is placing processes suboptimally. For example, when we are launching less number of processes than the number of CPUs in a node, OS might oversubscribe some multi-core sockets while leaving other sockets idle; this can lead processes to contend unnecessarily for common resources. Or, OS might spread processes out too widely; this can be suboptimal if application performance is sensitive to interprocess communication costs. Binding can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with.
One can use the option
--bind-to <foo> where foo can be slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none. By default, MPI uses
--bind-to core when the number of processes is <= 2,
--bind-to socket when the number of processes is >2 and
--bind-to none when nodes are being oversubscribed. If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.
The processors to be used for binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a —bind-to l3cache directive will cause the process to be bound to the processors that share a single L3 cache within that socket. To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding algorithm will assign each process located to a socket to a unique core in a round-robin manner. Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located.
What is OpenMP?
OpenMP is an application programming interface (API) supports shared-memory multithreading programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. If an application/solver provides support for both MPI and open MP then one can use both MPI (multiprocessing) and openMP (multithreading) at the same time and is some times known as hybrid parallelism.
What is Charm++
Similar to MPI, Cham++ is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD computing systems with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Charm++ can use MPI as the transport layer (aka Charm over MPI), or you can run MPI programs using Charm++ Adaptive-MPI (AMPI) library (aka MPI over Charm). Charm++ can also use TCP/IP stack or other OS bypass network protocols for messaging needs. The communication protocols and infrastructures supported by Charm++ are UDP, MPI, OFI, UCX, Infiniband, uGNI, and PAMI. Charm++ programs can run without changing the source on all these platforms.
MPI is much more widely adopted standard and as of today only few HPC applications, such as NAMD, uses Charm++. Charm++ does offer some benefits over MPI such relative ease of programing to developers, fault tolerance and separating out machine layer so that application developer does not distinguish whether user executes the application leveraging multi-threading parallelism on shared memory SMP architectures or multi-processing parallelism on distributed memory clusters.
The Charm++ software stack has three components - Charm++, Converse and a machine layer. In general, machine layer handles the exchange of messages among nodes, and interacts with the next layer in the stack - Converse. Converse is responsible for scheduling of tasks (including user code) and is used by Charm++ to execute the user application. Charm++ is the top-most level in which the applications are written.
Building Charm++ and Charm++ Applications
Charm++ can be build to suite different OS, CPU and Networking architectures (i.e. different networking protocols as detailed above). SMP capabilities can be requested as an option during build. SMP capabilities allow hybrid parallelism (multi-threading + multi-processing). "Multicore" builds do not carry any transport layer and therefore can only be used on single SMP machines. Benefit of multicore build over SMP build (of course this comparison only applicable when user wishes to run applications only on single machine) is that in SMP mode, Charm++ spawns one communication thread per process and therefore there is only less thread per process available for application parallelism. Multicore builds can use all available CPUs for worker threads. Also, multicore builds only support multi-threading. Similarly, any SMP capable version of Charm++ can be built with integrated open MP runtime. Charm++ creates its own thread pool and if you use compiler provided open MP library while compiling your own application, such as NAMD, then you can end up oversubscribing. It is also better to have a single runtime control all threads/process (unlike MPI+Open MP). Charm++ built with integrated OMP runtime that can control OMP threads in the application and you don't need to provide compile time OMP options while compile your own applications. To use this library on your applications, you have to add
-module OmpCharm in compile ﬂags to link this library instead of the compiler-provided library in compilers. Without -module OmpCharm, your application will use the compiler-provided Open MP library which running on its own separate runtime. Also you don’t need to add
icc compilers respectively. These ﬂags are already included in the predeﬁned compile options when you build SMP capable Charm++ with OMP option.
In terms of physical resources, Cham++ assume the parallel machine consists of one or more physical nodes, where a physical node is a largest unit over which cache coherent shared memory is feasible (and therefore, the maximal set of cores per which a single process can run). Each physical node may include one or more processor chips (sockets), with shared or private caches between them. Each chip may contain multiple cores, and each core may support multiple hardware threads (SMT/HTT for example).
A chare is a charm++ object such as an array that you want to distribute and do some computation.
Charm++ recognizes two logical entities:
- PE (processing element) = can be a thread (smp mode) or a process (non-smp mode).
- Logical node = one OS process.
SMP Mode: is like MPI + Open MP. You can have multiple processes on each physical node and each process can have multiple threads. In Charm++ terminology, you will say physical node is partitioned into logical nodes (one process = one logical node) and each local node has multiple PEs (in SMP mode, PE = thread).
Non-SMP Mode: is like just MPI. Each PE is a process. Physical node is partitioned into many logical nodes. Each logical node is running one PE (that is one process).
Just like in MPI case, depending on the runtime command-line parameters, a PE (i.e. thread or process) may be associated with a "core" or with a Intel "hardware thread".
Charm++ program, a PE is a unit of mapping and scheduling: each PE has a scheduler with an associated pool of messages. Each chare is assumed to reside on one PE at a time. Physical nodes may be partitioned into one or more logical nodes. Since PEs within a logical node share the same memory address space, the Charm++ runtime system optimizes communication between them by using shared memory.
A Charm++ program can be launched with one or more (logical) nodes per physical node. For example, on a machine with a four-core processor, where each core has two hardware threads, common conﬁgurations in non-SMP mode would be one node per core (four nodes/PEs total) or one node per hardware thread (eight nodes/PEs total). In SMP mode, the most common choice to fully subscribe the physical node would be one logical node containing seven PEs-one OS thread is set aside per process for network communications. When built in the “multicore” mode that lacks network support, a comm thread is unnecessary, and eight PEs can be used in this case. A communication thread is also omitted when using some high-performance network layers such as PAMI.
Alternatively, one can choose to partition the physical node into multiple logical nodes, each containing multiple PEs. One example would be three PEs per logical node and two logical nodes per physical node, again reserving a communication thread per logical node.
It is not a general practice in Charm++ to oversubscribe the underlying physical cores or hardware threads on each node. In other words, a Charm++ program is usually not launched with more PEs than there are physical cores or hardware threads allocated to it. To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain.
Non-SMP Mode: PE = Logical Node (= OS Process)
Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:
8 logical nodes = 8 OS processes. Each running on an hardware thread.
SMP Mode: PE = OS Thread (specifically worker OS thread not the communication OS thread)
Logical Node = one OS Process. Contain more than one PE (i.e. more than one OS thread)
Example: 1 phsyical node, 1 scoket, 4 core, 8 hardware threads:
7 PE (each running as a OS thread).
There will be only one process.
Reserving 1 hardware thread for network communication.
If you had multiple physical nodes then there will be one process per node with 7 PEs on each physical node and there will be one communication thread per process.
There are various benefits associated with SMP mode. For instance, when using SMP mode there is no waiting to receive messages due to long running entry methods. There is also no time spent in sending messages by the worker threads and memory is limited by the node instead of per core. In SMP mode, intra-node messages use simple pointer passing, which bypasses the overhead associated with the network and extraneous copies. Another beneﬁt is that the runtime will not pollute the caches of worker threads with communication-related data in SMP mode.
However, there are also some drawbacks associated with using SMP mode. First and foremost, you sacriﬁce one core to the communication thread. This is not ideal for compute bound applications. Additionally, this communication thread may become a serialization bottleneck in applications with large amounts of communication. Finally, any library code the application may call needs to be thread-safe.
In Kogence, your workflow launch is already optimized considering these tradeoffs when evaluating whether to use SMP mode for your application or deciding how many processes to launch per physical node when using SMP mode.
Charm++ Program Options
charmrun is the standard launcher for Charm++ applications. When compiling Charm++ programs, the charmc linker produces both an executable file and an utility called
charmrun, which is used to load the executable onto the parallel machine.
Unlike mpirun, you can pass options either to charmrun or to the program itself. For example,
charmrun ++ppn N program program-arguments or
charmrun program ++ppn N program-arguments.
Execution on platforms which use platform specific launchers, (i.e., mpirun, aprun, ibrun), can proceed without
charmrun can be used in coordination with those launchers via the
++mpiexec (see below). Programs built using the network version (i.e. non multi-core) of Charm++ can be run alone, without
charmrun. This restricts you to using the processors on the local machine. If the program needs some environment variables to be set for its execution on compute nodes (such as library paths), they can be set in
.charmrunrc under home directory.
charmrun will run that shell script before running the executable. Or you can use,
++runscript option (see below).
In all of the below,
- "p" stands for PE (can be thread or process depending on smp of non-smp modes)
- "n" stands for "node" (i.e. logical node = process)
- "ppn" stands for PE per logical node (i.e. PE per Process)
- PU = Processing Unit = Intel Hyperthreading hardware thread
Single Machine Options to Charm++ Program
+autoProvision: Automatically determine the number of worker threads to launch in order to fully subscribe the machine running the program.
+oneWthPerHost: Launch one worker thread per compute host. By the definition of standalone mode, this always results in exactly one worker thread.
+oneWthPerSocket: Launch one worker thread per CPU socket.
+oneWthPerCore: Launch one worker thread per CPU core.
+oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.
+pN: Explicitly request exactly N worker threads. The default is 1.
Cluster Options to Charm++ Program
Note: Notice 2 ++ sign below.
++autoProvision: Automatically determine the number of processes and threads to launch in order to fully subscribe the available resources.
++processPerHost N: Launch N processes per compute host.
++processPerSocket N : Launch N processes per CPU socket.
++processPerCore N : Launch N processes per CPU core.
++processPerPU N : Launch N processes per CPU processing unit, i.e. hardware thread.
++nN: Run the program with N processes. Functionally identical to
++pN in non-SMP mode (see below). The default is 1.
++pN: Total number of PE (threads or processes) to create. In SMP mode, this refers to worker threads (where n∗ppn=p), otherwise in non-smp mode it refers to processes (n=p). The default is 1. Use of
++p is discouraged in favor of
++oneWthPer* in SMP mode) where desirable, or
++nodelist: File containing list of nodes. Dile is of the format:
host <hostname> <qualifiers>
host node512.kogence.com ++cpus 2 ++shell ssh
++runscript : Script to run node-program with. For example:
$ ./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script. In this case, the
set_env_script is invoked on each node before launching the program
++local: Run charm program only on local machines.
++mpiexec: Use the cluster’s
mpiexec job launcher instead of the built in ssh method. This will pass
-n $P to indicate how many processes to launch. If
-n $P is not required because the number of processes to launch is determined via queueing system environment variables then use
++mpiexec-no-n rather than
++mpiexec. An executable named something other than
mpiexec can be used with the additional argument
++remote-shell runmpi, with ‘runmpi’ replaced by the necessary name. To pass additional arguments to
++remote-shell and list them as part of the value after the executable name as follows:
$ ./charmrun ++mpiexec ++remote-shell "mpiexec --YourArgumentsHere" ./pgm. Use of this option can potentially provide a few benefits: Faster startup compared to the SSH approach charmrun would otherwise use; No need to generate a nodelist file; Multi-node job startup on clusters that do not allow connections from the head/login nodes to the compute nodes. At present, this option depends on the environment variables for some common MPI implementations. It supports OpenMPI (
OMPI_COMM_WORLD_SIZE), M(VA)PICH (
PMI_SIZE), and IBM POE (
CPU Affinity Options
+setcpuaffinity: Set cpu affinity automatically for processes (when Charm++ is based on non-smp versions) or threads (when smp). This option is recommended, as it prevents the OS from unnecessarily moving processes/threads around the processors of a physical node.
+excludecore<core#>: Do not set cpu affinity for the given core number. One can use this option multiple times to provide a list of core numbers to avoid.
Additional options in SMP Mode
++oneWthPerHost: Launch one worker thread per compute host.
++oneWthPerSocket: Launch one worker thread per CPU socket.
++oneWthPerCore: Launch one worker thread per CPU core.
++oneWthPerPU: Launch one worker thread per CPU processing unit, i.e. hardware thread.
++ppn N: Number of PEs (or worker threads) per logical node (OS process).
+pemap L[-U[:S[.R]+O]][,...] : Bind the execution threads to the sequence of cores described by the arguments using the operating system’s CPU afﬁnity functions. Can be used outside SMP mode. PU = hardware thread. By default, this option accepts PU indices assigned by the OS. The user might want to instead provide logical PU indices used by the
hwloc (see here for details). To do this, prepend the sequence with an alphabet L (case-insensitive). For instance,
+pemap L0-3 will instruct the runtime to bind threads to PUs with logical indices 0-3.
+commap p[,q,...] : Bind communication threads to the listed cores, one per process.
To run applications in SMP mode, we generally recommend using one logical node (i.e one OS process) per socket or NUMA domain.
++ppn N will spawn N threads in addition to 1 thread spawned by the runtime for the communication threads, so the total number of threads will be N+1 per node. Consequently, you should map both the worker and communication threads to separate cores. Depending on your system and application, it may be necessary to spawn one thread less than the number of cores in order to leave one free for the OS to run on. An example run command might look like:
./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app <args>. This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per node), each with three worker threads/PEs (++ppn 3). The worker threads/PEs will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 to core 3 and PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. PEs/worker threads 0-2 compromise the ﬁrst logical node and 3-5 are the second logical node. Additionally, the communication threads will be mapped to core 0, for the communication thread of the ﬁrst logical node, and to core 4, for the communication thread of the second logical node. Please keep in mind that +p always speciﬁes the total number of PEs created by Charm++, regardless of mode. The +p option does not include the communication thread, there will always be exactly one of those per logical node. Recommendation is to not do oversubscription. Dont run more threads/processes than the number of hardware threads available.
What are Resource Managers and Job Schedulers?
Resource Manager and Job Scheduler phrases are used interchangeably. Kogence autoscaling cloud HPC clusters are configured with the Grid Engine resource manager. Any command that you invoke on the shell terminal would be executed on the master node. On the other hand, if you want to send your jobs to the compute nodes of your cluster using the shell terminal then please use the
qsub command of the Grid Engine resource manager like below:
qsub -b y -pe mpi num_of_CPU -cwd YourCommand
Be careful with the
num_of_CPU. It has to be either same or less than the number of CPU's in the compute node type that you selected in the Cluster tab of your model. Grid Engine offers a lot of flexibility with many command line switches. Please check the
qsub man page. Specifically you might find following switches to be useful:
-pe mpi: Name of the "parallel environment" on Kogence clusters.
-b y: Command you are invoking is treated as a binary and not as a job submission script.
-cwd: Makes the current folder as the working directory. Output and error files would be generated in this folder.
-wd working_dir_path: Makes the
working_dir_pathas the working directory. Output and error files would be generated in this folder.
-o stdout_file_name: Job output would go in this file.
-e stderr_file_name: Job error would go in this file.
-j y: Sends both the output and error to the output file.
N job_name: Gives job a name. Output and error files would be generated with this name if not explicitly specified using
-eswitches. You can also monitor and manage your job using the job name.
-sync y: By default
qsubcommand returns control to the user immediately after submitting the job to cluster. So you can continue to do other things on the CloudShell terminal including submitting more jobs to the scheduler. This option tells
qsubto not return the control until the job is complete.
-V: Exports the current environment variables (such as
$PATH) to the compute nodes.
You can use
qready to check if cluster is ready before submitting the jobs to cluster. You will submit simulations using the
qsub command. You can use
qwait job_name to wait for job to finish before doing some post prepossessing for submitting a dependent job. We recommend adding a CloudShell to your software stack in the Stack tab GUI. This CloudShell terminal can be used to monitor and manage cluster jobs. You monitor cluster jobs by typing
qstat on the terminal. You can delete jobs using
qdel jobID command. You can monitor compute nodes in the cluster by typing
qhost on the terminal.
qhost lets you monitor the performance and utilization of the compute nodes in your cluster.