Help:Machines

Jump to: navigation, search
Revision as of 17:48, 7 July 2020 by Mukul (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
CPUs Wide2.jpg

Kogence Cloud HPC Hardware Architecture

Microarchitecture details can be found on Intel Skylake and Intel Cascade Lake pages. On modern Intel microprocessors CPU clock frequency scales based on the workload. Please see the CPU Clock Frequencies and Max CPU FLOPS, pages for more details. You can also find more information on the FAQ and User Manual pages.

Kogence Cloud HPC Servers for CPU Limited Workloads (CLWL nodes)

For the CPU limited workloads (CLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake (SKU 8124M) and Intel Cascade Lake (SKU 8275CL) microarchitectures that are optimized for best compute performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for workloads that are limited by the CPU and do not perform excessive hard-disk read/write operations, do not generate excessive internode network communication and do not use excessive amount of RAM. These nodes can clocked at up to 3.6GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

KogenceCPUBoundWorkloadNodesV6.png

Kogence Cloud HPC Servers for Memory Limited Workloads (MLWL nodes)

For the memory limited workloads (MLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake microarchitectures (SKU 8175M) that are optimized for best memory performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. Compared to the CPU limited workload (CLWL) nodes, these MLWL nodes provide larger amount of RAM per unit CPU and provide larger L3 cache per socket but, for non-AVX instructions, these nodes are clocked at lower speed compared to the CLWL nodes. These nodes are most appropriate for workloads that are limited by the RAM, are moderately CPU intensive, do not perform excessive hard-disk read/write operations and do not generate excessive internode network communication. These nodes can clocked at up to 3.1GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

KogenceMemoryBoundWorkloadNodesV2.png

Kogence Cloud HPC Servers for Network Limited Workloads (NLWL nodes)

For the network limited workloads (NLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake microarchitectures (SKU 8124M) that are optimized for both the best network performance as well as the best CPU performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. Compared to the CPU limited workload (CLWL) nodes, these NLWL nodes provide up to 100Gbps of network bandwidth. These nodes are most appropriate for heavily communicating jobs such as the distributed memory MPI jobs that generate large internode network traffic but do not perform excessive hard-disk read/write operations and do not need excessive amount of memory. These nodes can clocked at up to 3.4GHz speeds (non-AVX instructions, with all cores active) and can provide more than 3 teraflops of performance per node (Dual Precision, FP64).

KogenceNetworkBoundWorkloadNodes.png

Kogence Cloud HPC Servers for Storage Limited Workloads (SLWL nodes)

For the storage limited workloads (SLWL) we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Skylake (SKU 8124M) and Intel Cascade Lake (SKU 8275CL) microarchitectures that are optimized for both the best compute as well as best storage performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for workloads that are limited by CPU performance and perform a lot of hard-disk read/write operations but otherwise do not generate excessive internode network communication and do not use excessive amount of RAM. Compared to the CPU limited workload (CLWL) nodes, these SLWL nodes come with NVMe solid state storage and can provide up to 1400K IOPS and up to 5500 MiBps of strorage throughput. These nodes can clocked at up to 3.6GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

KogenceStorageBoundWorkloadNodes.png

Kogence Cloud HPC Servers for Hybrid Workloads (HWL nodes)

For hybrid workloads (HWL) that are constraint from multiple performance bottlenecks we offer enterprise HPC class Intel Xeon Scalable Platinum compute nodes based on the Intel Cascade Lake (SKU 8259CL) microarchitectures that are optimized for best over all performance. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. These nodes are most appropriate for hybrid workloads that are limited by the CPU, require large memory, perform large amount of hard-disk read/write operations and generate large internode network communication such as through large distributed memory MPI jobs. These nodes can clocked at up to 3.1GHz speeds (non-AVX instructions, with all cores active) and can provide more than 4 teraflops of performance per node (Dual Precision, FP64).

KogenceHybridWorkloadNodes.png

Kogence Cloud HPC Servers for GPU Workloads (GWL nodes)

For GPU workloads (GWL) we offer NVIDIA Volta V100 Tensor Core based enterprise HPC class compute nodes. These nodes come either the Intel Xeon Broadwell or the Intel Xeon Scalable (Skylake) CPUs. You can cluster as many of these nodes as you like to create a personal autoscaling HPC cluster. While Broadwell based nodes are designed for primarily the GPU workloads, the Skylake based nodes are designed for more demanding hybrid workloads that need good storage IOPS performance, good network performance and good CPU performance in addition to Volta GPUs.

KogenceGPUWorkloadNodesV3.png

What is 1 CPU On Kogence?

1 CPU on Kogence is same as what is called as 1 CPU in the Resource Monitor of Microsoft Windows 10 PC, for example. Similarly, # of CPUs shown on Kogence is same as what is shown as "CPU(s)" by the lscpu utility of linux platforms. Different platforms and utilities may call this same logical computing unit by different names. On same Microsoft Windows 10 PC, for example, System Information and Task Manager utilities calls it as "number of logical processors" and Amazon AWS (see here) and Microsoft Azure (see here) call this "number of vCPU". In general,

.

This is the most relevant logical unit of computing power in the context of cloud High Performance Computing (HPC). Each logical processor in fact has an independent architectural state (i.e. instruction pointer, instruction register sets etc.) and each can independently and simultaneously execute instruction sets from one worker thread or worker process each without needing to do context switching. Each of these register sets are clocked at full CPU clock speed.

In the context of MPI (Message Passing Interface, multi-processing framework) and OpenMP (multi-threading library), for example, each of these CPUs can run an individual MPI process or an individual OpenMP thread without requiring frequent context switches. So if you are scheduling an MPI job on a 4 CPU machine on Kogence then you can run mpirun -np 4 and there will be no clock cycle sharing or forced context switching among processes.

Between CPUs, Cores, Sockets, Processors, vCPUs, logical processors, hardware threads etc. terminology can get confusing if you are not a computer scientist! Lets take a look at common Microsoft Windows 10 PC that we are all so familiar with. All other computer system will also show similar info. Lets first look at the task manager. This machine has an Intel Core i5-7200 CPU. But don't jump to call it a 1 CPU machine, that is is wrong. You will see below. Here we also notice that this machine has 1 socket, 2 cores and 4 logical processors. Now if you open the System Information, you will see something similar. Now if you open the Resource Monitor, you will see something like this. Notice that this is a 4 CPU machine -- from CPU 0 to CPU 3. If you are on a linux machine you run the lscpu command, you will see 4 CPU(s) as well.

Hardware threads within a core do share one common floating point processor and cache memory, though. Please check FAQ section to learn more about CPUs and hardware threads. In FLOPS heavy HPC applications it is the utilization of floating point processor that is more important than the utilization of CPU.If a worker thread or a worker processes running on one of the hardware thread can keep the floating point processor of that core fully occupied then even though the system/OS resource management and monitoring utilities (such as top or gnome system-monitor on linux platforms) may show overall CPU utilization being capped at most at 50%, you are already getting most out of your HPC server. 50% CPU utilization may often be misleading in the context of HPC. Remember that the CPU utilization percentage is averaged across total number of CPUs (i.e. total number of hardware threads) in your HPC server and if you are using only one hardware thread per core then maximum possible CPU utilization is 50%. This just means that CPU can in theory execute twice as many instruction per unit time provided they are not all asking for access to floating point processor. If they are then 50% utilization is as best you can get.

What happens if you run 2 FLOPS heavy worker threads or worker processes -- one each on each hardware thread? Since hardware threads within a core do share one common floating point processor, for FLOPS heavy HPC applications the worker threads or worker processes running simultaneously on these hardware threads may need to slow down as they need to wait for floating point processor to become available. Because the switch of access to floating point processor from one hardware thread to another hardware thread is extremely efficient in modern servers, you should see improvement in the net throughput in almost all use cases between allowing servers to run two simultaneous hardware threads per core versus allowing only one hardware thread per core. Situation gets more complex when one worker process or worker thread tries to spawn many more worker threads (for example it may be using a parallelized version of BLAS library). Those worker thread context switches will now start to add overhead. Unfortunately, there are no common OS or system utilities that can easily displays the FLOPS performance of your application versus the maximum FLOPS capability of your HPC server. You would need to run your own benchmarks to estimate this.

For these reasons, some FLOPS heavy HPC applications available on Kogence (e.g. Matlab) will not run one independent worker process or worker thread on each hardware thread. Instead they run one worker process or worker thread per core. But Kogence allows you to start multiple instances of these applications. You can experiment with single instance and multiple instance and see if you get better net throughput.