Help:Intel Skylake

Jump to: navigation, search

Kogence offers Intel Xeon Scalable Platinum 1st Generation (formerly known as Skyake Platinum Scalable)) and Intel Xeon Scalable Platinum 2nd Generation (formerly known as Cascade Lake Platinum Scalable) single nodes. Intel brand names can get very confusing. See Intel Xeon Scalable for more details.

On Kogence, each Skylake node is available with 2 CPU to 72 CPU configurations. Users can create autoscaling clusters of as many of these nodes as they like.

Individual Core Clock Frequencies

Actual frequency of clock at which individual cores are clocked on model Intel microprocessors is determined at real time and the behvior is quite complex. See Intel CPU Clock Frequencies for more details.

The enhanced 14nm process which allows Intel to extract an additional power efficiency, allowing them to clock those processors higher. For the CPU limited work load (CLWL) Skylake nodes (model 8124M) offered at Kogence:

Non-AVX Instructions: Guaranteed base frequency is 3.0GHz. Turbo boost frequency when all cores are active is 3.4GHz. Turbo boost frequency when only one core is active is 3.5GHz.

AVX-2.0 Instructions: Guaranteed base frequency is 2.6GHz. Turbo boost frequency when all cores are active is 3.3GHz. Turbo boost frequency when only one core is active is 3.5GHz.

AVX-512 Instructions: Guaranteed base frequency is 2.1GHz. Turbo boost frequency when all cores are active is 2.7GHz. Turbo boost frequency when only one core is active is 3.5GHz.

For the memory limited work load (MLWL) Skylake nodes (model 8175M) offered at Kogence, for non-AVX instructions, the guaranteed base frequency is 2.5GHz. Turbo boost frequency when all cores are active is 3.1GHz. Turbo boost frequency when only one core is active is 3.5GHz.

Individual Core FLOPs Performance

Non-AVX Instructions: Kogence Skyake nodes can provide 12 to 14 DP GFLOPs per second per core for non-AVX instructions. One can do 4 DP FLOPs per clock cycle. The guaranteed minimum base frequency for non-AVX instructions is 3.0GHz. This means we can get a minimum of 12 DP GFLOPs per second per core for non-AVX instructions. With turbo boost, one can get between 13.6 to 14 DP GFLOPs per second per core depending upon how many cores are active.

AVX-2.0 Instructions: Kogence Skyake nodes can provide 41.6 to 56 DP GFLOPs per second per core for AVX-2.0 instructions. AVX-2.0 units can do 256 bit arithmetic. For double precision (DP, 32bits) floating point operations (FLOPs), this means that each unit can do 256/32 = 8 such operations in one clock cycle. On Kogence Skylake nodes, each core has two AVX-2.0 units (see below for details), so they are capable of 16 DP FLOPs per clock cycle. As mentioned above the minimum guaranteed base frequency for AVX-2.0 instruction is 2.6GHz. This means a minimum of 16 DP FLOPs can be performed in 0.3846ns (i.e. 1/2.6GHz). That means Kogence Skylake nodes can do 41.6 DP GFLOPs per second per core at minimum guaranteed clock frequency for AVX-2.0 instructions. With turbo boost frequencies, Kogence Slylake nodes can do 52.8 GFLOPs per second per core when all cores are active, and 56 GFLOPs per second per core when only one core is active.

AVX-512 Instructions: Kogence Skylake Lake nodes can provide 67.2 to 112 DP GFLOPs per second per core for AVX-512 instructions. On Kogence Skylake nodes, each core has 2 AVX-512 fused multiply addition (FMA) units (see below for details). There FMA units have 512 bits registers. So they are capable of 32 Double Precision (DP, 32bits) Floating Point Operations (FLOPs) per clock cycle for AVX-512 instructions (512/32 * 2 = 32). As mentioned above the minimum guaranteed base frequency for AVX-512 instruction is 2.1GHz. This means a minimum of 32 DP FLOPs can be performed in 0.476ns (i.e. 1/2.1GHz) per core. That means Kogence Cascade Lake nodes can do 67.2 DP GFLOPs per second per core at minimum guaranteed clock frequency. With turbo boost frequencies, Kogence Cascade Lake nodes can do 86.4 DP GFLOPs per second per core when all cores are active, and 112 GFLOPs per second per core when only one core is active.

Individual Core Architecture

The architecture of each core of Skylake is largely identical to that of Cascade Lake. Skylake does not support AVX-512 Vector Neural Network Instructions (VNNI), which provide significant, more efficient deep-learning inference acceleration. Furthermore, there are 2 intersocket UPI links compared to 3 in the Cascade Lake, the memory bandwidth within each sockets is slightly smaller. Cascade Lake nodes enjoy higher turbo boost frequency, supports Turbo Boost Max Technology 3.0 and Intel Speed Select Technologies (see Intel CPU Clock Frequencies for more details). Cascade Lake also introduces in-hardware mitigations for the Spectre and Meltdown security flaws. See details below.

Cache Architecture of Each Core

  • L1i instruction cache: 32 KB, private to each core; 64 sets; 64 B/line; 8-way
  • L1d data cache: 32 KB, private to each core; 64 sets; 64 B/line; 8-way; fastest latency: 4 cycles; 128 B/cycle load bandwidth; 64 B/cycle store bandwidth; write-back policy
  • L2 cache: 1 MB, private to each core; 64 B/line; 16-way; fastest latency: 14 cycles; 64 B/cycle bandwidth to L1 cache; write-back policy
  • L3 cache: shared non-inclusive 1.375 MB/core; total of 27.5 MB, shared by 24 cores in each socket; 2,048 sets; 64 B/line; fully associative; fastest latency: 50 – 70 cycles; write-back policy

Socket Architecture

Interconnection Between Cores Within a Socket

Skylake's basic socket architecture is largely identical to the Cascade Lake. Just like the Cascade Lake, the Skylake sockets have integrated memory controller (IMC), meaning memory controller is integrated into the same chip (i.e. the processor/socket). Just like the Cascade Lake, the Skylake-based HPC servers make use of Intel's mesh interconnect architecture. The cores (including their private L1 and L2 caches) and the IMC are organized as an array of tiles in rows and columns - each with dedicated connections going through each of the rows and columns allowing for the shortest path between any two tiles, reducing latency, and improving the bandwidth. Each row and each column of tiles is a half duplex ring interconnect. For the HPC servers that we offer on Kogence, each socket has 30 tiles in 5x6 arrays. Two of these tiles are IMC while the 24 tiles are the 24 cores (e.g. 48 CPUs/socket) and the rest of 4 tiles are unoccupied.

Interconnection Between Sockets

On Kogence, we offer the the Xeon Scalable class of processors. This Scalable class of processors include the Intel's Ultra Path Interconnect (UPI) interconnect links so that multiple sockets can be connected to each other 2-way. UPI is a high-efficiency coherent interconnect between sockets, allowing multiple sockets to share a single shared address space. Whereas various SKUs of Skylake (1st generation Xeon Scalable) may consist of either 2 or 3 full duplex UPI links, all SKUs of the Cascade Lake Scalable processors (2nd generation Xeon Scalable, formerly CSL-SP) consist of 3 full duplex UPI links. Skylake SKUs we offer on Kogence (8124M, 8175M, 8151C) are all equipped with 3 full duplex UPI links. The UPI runs at a speed of 10.4 gigatransfers per second (GiT/s). Each link contains separate lanes for the two directions. The total full-duplex bandwidth (3 links x 2 directions) is 62.4 GiB/s.

Memory Architecture

There are two IMC per socket. These two IMC's, form two sub-NUMA clusters within each Skylake socket, creating two localization domains. There are three memory channels per sub-NUMA cluster (total of 6 channels per socket, i.e. hexa-channel). Each channel can be connected with up to two memory DIMMs (total of 12 DIMMs per socket). On Kogence compute bound workload nodes, there is one 16-gibibyte (GiB) dual rank DDR4 DIMM with error correcting code (ECC) per channel. In total, the amount of memory is 48 GiB per sub-NUMA cluster, 96 GiB per socket, and 192 GiB per node. The speed of each memory channel is 2,666 MHz. An 8-byte read or write can take place per cycle per channel. With a total of 6 memory channels, the total half-duplex memory bandwidth is approximately 2,666*8*6 = 128 GB/s = 119.18 GiB/s per socket.

I/O Architecture

For I/O all models incorporate 48x (3x16) lanes of PCIe 3.0. There is an additional x4 lanes PCIe 3.0 reserved exclusively for DMI for the the Lewisburg (LBG) chipset. For a selected number of models, specifically those with F suffix, they have an Omni-Path Host Fabric Interface (HFI) on-package.

Kogence Setup

For CPU bound workloads, Kogence offers Intel Skylake single nodes, each with 2 CPU to 72 CPU configurations. Users can create autoscaling clusters of as many of these nodes as they like. Hyperthreading is turned ON, meaning you can run 2 hardware threads per core. Turbo Boost is turned ON. Maximum Turbo Frequency is 3.50 GHz for non-AVX, AVX2 and AVX-512 instructions.

x86 Instruction Set Extensions

Skylake support the AVX-512 and AVX512[F,CD,BW,DQ,VL] instruction set, in addition to the instruction sets SSE, SSE2, SSE3, Supplemental SSE3, SSE4.1, SSE4.2, AVX and AVX2.

With 512-bit floating-point vector registers and two floating-point functional units, each capable of Fused Multiply-Add (FMA), a Skylake core can deliver 32 double-precision floating-point operations per cycle.

If you are deploying your own custom containers on Kogence, use the Intel compiler flag -xCORE-AVX512 for Skylake and Cascade Lake specific optimizations. The optimization flag -qopt-zmm-usage=high -xCORE-AVX512 may benefit floating-point heavy applications running on Skylake and Cascade Lake. If you want a single executable that will run on Skylake, Cascade Lake as well as processors that do not support AVX-512 instruction sets, you can let the suitable optimization to be determined at run time by compiling your application using the option -O3 -ipo -axCORE-AVX512,CORE-AVX2,AVX -xSSE4.2.