Help:Intel Cascade Lake

Jump to: navigation, search

For CPU bound workloads, Kogence offers 2nd Generation Intel Xeon Scalable (formerly known as Cascade Lake Platinum Sacalable) single nodes. The new Cascade Lake microarchitecture launched in Apr 2019, is the successor to the Skylake architecture and based on enhanced 14nm process technology.

Intel brand names can get very confusing. See Intel Xeon Scalable for more details.

On Kogence, each Cascade Lake node is available with 48 CPU, 96 CPU or 96 CPU bare metal configurations. Users can create autoscaling clusters of as many of these nodes as they like.

Compared to Skylake nodes, Cascade Lake nodes enjoy higher turbo boost frequency, supports Turbo Boost Max Technology 3.0 and Intel Speed Select Technologies (see Intel CPU Clock Frequencies for more details), support of AVX-512 VNNI instructions which provide significantly more efficient Artificial Intelligence workload's inference acceleration, have higher communication bandwidth between sockets and have higher memory bandwidth. Cascade Lake also introduces in-hardware mitigations for the Spectre and Meltdown security flaws.

Individual Core Clock Frequencies

The enhanced 14nm process which allows Intel to extract an additional power efficiency, allowing them to clock those processors higher. For the Cascade Lake models offered at Kogence:

Non-AVX Instructions: Guaranteed base frequency is 3.0GHz. Turbo boost frequency when all cores are active is 3.6GHz. Turbo boost frequency when only one core is active is 3.9GHz.

AVX-2.0 Instructions: Guaranteed base frequency is 2.4GHz. Turbo boost frequency when all cores are active is 3.3GHz. Turbo boost frequency when only one core is active is 3.6GHz.

AVX-512 Instructions: Guaranteed base frequency is 2.1GHz. Turbo boost frequency when all cores are active is 2.7GHz. Turbo boost frequency when only one core is active is 3.5GHz.

See Intel CPU Clock Frequencies for more details.

Individual Core FLOPs Performance

Non-AVX Instructions: Kogence Cascade Lake nodes can provide 12 to 15.6 DP GFLOPs per second per core for non-AVX instructions. One can do 4 DP FLOPs per clock cycle. The guaranteed minimum base frequency for non-AVX instructions is 3.0GHz. This means we can get a minimum of 12 DP GFLOPs per second per core for non-AVX instructions. With turbo boost, one can get between 14.4 to 15.6 GFLOPs per second per core depending upon how many cores are active.

AVX-2.0 Instructions: Kogence Cascade Lake nodes can provide 38.4 to 57.6 DP GFLOPs per second per core for AVX-2.0 instructions. AVX-2.0 units can do 256 bit arithmetic. For double precision (DP, 32bits) floating point operations (FLOPs), this means that each unit can do 256/32 = 8 such operations in one clock cycle. On Kogence Cascade nodes, each core has two AVX-2.0 units (see below for details), so they are capable of 16 DP FLOPs per clock cycle. As mentioned above the minimum guaranteed base frequency for AVX-2.0 instruction is 2.4GHz. This means a minimum of 16 DP FLOPs can be performed in 0.416ns (i.e. 1/2.4GHz). That means Kogence Cascade Lake nodes can do 38.4 DP GFLOPs per second per core at minimum guaranteed clock frequency for AVX-2.0 instructions. With turbo boost frequencies, Kogence Cascade Lake nodes can do 52.8 GFLOPs per second per core when all cores are active, and 57.6 GFLOPs per second per core when only one core is active.

AVX-512 Instructions: Kogence Cascade Lake nodes can provide 67 to 112 DP GFLOPs per second per core for AVX-512 instructions. On Kogence Cascade nodes, each core has 2 AVX-512 fused multiply addition (FMA) units (see below for details). There FMA units have 512 bits registers. So they are capable of 32 Double Precision (DP, 32bits) Floating Point Operations (FLOPs) per clock cycle for AVX-512 instructions (512/32 * 2 = 32). As mentioned above the minimum guaranteed base frequency for AVX-512 instruction is 2.1GHz. This means a minimum of 32 DP FLOPs can be performed in 0.476ns (i.e. 1/2.1GHz) per core. That means Kogence Cascade Lake nodes can do 67.2 DP GFLOPs per second per core at minimum guaranteed clock frequency. With turbo boost frequencies, Kogence Cascade Lake nodes can do 86.4 DP GFLOPs per second per core when all cores are active, and 112 GFLOPs per second per core when only one core is active.

Individual Core Architecture

The architecture of each core of Cascade Lake is largely identical to that of Skylake. Cascade Lake includes support for the new AVX-512 Vector Neural Network Instructions (VNNI), which provide significantly more efficient Artificial Intelligence workload's inference acceleration. In order to accommodate the new AVX512-VNNI instructions, new logic was added on Port 0 and Port 1. Where there was previously two FMA units for doing fused multiply-add floating-point operations, in Cascade Lake, new VNNI logic was added to that block which does a similar operation but works on integer data types. Support for 8-bit and 16-bit integers was added. It's worth noting that since integers dynamic range is quite low, the accumulation is performed on a 32-bit integer destination.

Furthermore, there are 3 intersocket UPI links compared to 2 in the Skylake (see below), the memory bandwidth of each channel within each sockets is better compared to Skylake (see below), Intel Turbo Boost Max Technology 3.0 and Intel Speed Selelct Technologies have been introduced in Cascade Lake. See Intel CPU Clock Frequencies for more details. Cascade Lake also introduces in-hardware mitigations for the Spectre and Meltdown security flaws. See details below.

Cache Architecture of Each Core

  • L1i instruction cache: 32 KB, private to each core; 64 sets; 64 B/line; 8-way
  • L1d data cache: 32 KB, private to each core; 64 sets; 64 B/line; 8-way; fastest latency: 4 cycles; 128 B/cycle load bandwidth; 64 B/cycle store bandwidth; write-back policy
  • L2 cache: 1 MB, private to each core; 64 B/line; 16-way; fastest latency: 14 cycles; 64 B/cycle bandwidth to L1 cache; write-back policy
  • L3 cache: shared non-inclusive 1.375 MB/core; total of 27.5 MB, shared by 24 cores in each socket; 2,048 sets; 64 B/line; fully associative; fastest latency: 50 – 70 cycles; write-back policy

Socket Architecture

Interconnection Between Cores Within a Socket

Cascade Lake's basic socket architecture is largely identical to the predecessor Skylake. Just like the Skylake, the Cascade lake sockets have integrated memory controller (IMC), meaning memory controlled is integrated into the same chip (i.e. the processor/socket). Just like the Skylake, the Cascade Lake-based HPC servers make use of Intel's mesh interconnect architecture. The cores (including their private L1 and L2 caches) and the IMC are organized as an array of tiles in rows and columns - each with dedicated connections going through each of the rows and columns allowing for the shortest path between any tile, reducing latency, and improving the bandwidth. Each row and each column of tiles is a half duplex ring interconnect. For the HPC servers that we offer on Kogence, each socket has 30 tiles in 5x6 arrays. Two of these tiles are IMC while the 24 tiles are the 24 cores (e.g. 48 CPUs/socket) and the rest of 4 tiles are unoccupied.

Interconnection Between Sockets

On Kogence, we offer the the Xeon Scalable class of processors. This Scalable class of processors include the Intel's Ultra Path Interconnect (UPI) interconnect links so that multiple sockets can be connected to each other 2-way. UPI is a high-efficiency coherent interconnect between sockets, allowing multiple sockets to share a single shared address space. Whereas Skylake (1st generation Xeon Scalable) consists of 2 full duplex UPI links, the Cascade Lake Scalable processors (2nd generation Xeon Scalable, formerly CSL-SP) consist of 3 full duplex UPI links.The UPI runs at a speed of 10.4 gigatransfers per second (GiT/s). Each link contains separate lanes for the two directions. The total full-duplex bandwidth (3 links x 2 directions) is 62.4 GiB/s. On Kogence we currently offer 2 socket HPC servers (24 cores/socket, total 48 cores and 96 CPUs), Intel does offer ability to connect up to 8 of these sockets through 8-way interconnect using 3 UPI between each socket.

Memory Architecture

Like Skylake, there are two IMC per socket. These two IMC's, form two sub-NUMA clusters within each Cascade Lake socket, creating two localization domains. There are three memory channels per sub-NUMA cluster (total of 6 channels per socket, i.e. hexa-channel). Each channel can be connected with up to two memory DIMMs (total of 12 DIMMs per socket). On Kogence compute bound workload nodes, there is one 16-gibibyte (GiB) dual rank DDR4 DIMM with error correcting code (ECC) per channel. In total, the amount of memory is 48 GiB per sub-NUMA cluster, 96 GiB per socket, and 192 GiB per node. Compared to the Skylake nodes, the speed of each memory channel is increased from 2,666 MHz (in Skylake) to 2,933 MHz (in Cascade Lake). An 8-byte read or write can take place per cycle per channel. With a total of 6 memory channels, the total half-duplex memory bandwidth is approximately 2,933*8*6 = 140.78 GB/s = 131.11 GiB/s per socket.

I/O Architecture

For I/O all models incorporate 48x (3x16) lanes of PCIe 3.0. There is an additional x4 lanes PCIe 3.0 reserved exclusively for DMI for the the Lewisburg (LBG) chipset. For a selected number of models, specifically those with F suffix, they have an Omni-Path Host Fabric Interface (HFI) on-package (see Integrated Omni-Path).

Kogence Setup

For CPU bound workloads, Kogence offers Intel Cascade Lake single nodes, each with 48 CPU, 96 CPU and 96 CPU bare metal configurations. Users can create autoscaling clusters of as many of these nodes as they like. Hyperthreading is turned ON, meaning you can run 2 hardware threads per core. Turbo Boost is turned ON. Maximum Turbo Frequency is 3.90 GHz for non-AVX, and 3.6 GHz for AVX2 and 3.5 GHz for AVX-512 instructions. One can get an additional TBMT 3.0 turbo boost in the range of 100-200MHz which changes from core to core and node to node and depends on semiconductor process variabilities.

x86 Instruction Set Extensions

Cascade Lake introduces support for AVX-512 Vector Neural Network Instructions (VNNI) instruction set which is designed to improve the performance of Artificial Intelligence workloads by improving the throughput of tight inner convolutional loop operations. In addition to the instruction sets SSE, SSE2, SSE3, Supplemental SSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, and AVX512[F,CD,BW,DQ,VL], which are available in its Skylake predecessor, Cascade Lake also includes the new AVX-512 VNNI, which provide significant, more efficient deep-learning inference acceleration.

With 512-bit floating-point vector registers and two floating-point functional units, each capable of Fused Multiply-Add (FMA), a Cascade Lake core can deliver 32 double-precision floating-point operations per cycle.

If you are deploying your own custom containers on Kogence, use the Intel compiler flag -xCORE-AVX512 for Skylake and Cascade Lake specific optimizations. The optimization flag -qopt-zmm-usage=high -xCORE-AVX512 may benefit floating-point heavy applications running on Skylake and Cascade Lake. If you want a single executable that will run on Skylake, Cascade Lake as well as processors that do not support AVX-512 instruction sets, you can let the suitable optimization to be determined at run time by compiling your application using the option -O3 -ipo -axCORE-AVX512,CORE-AVX2,AVX -xSSE4.2.