Choosing the Right Building Blocks for Your Virtual SMP

October 16, 2018 | By Benzi Galili

CHOOSING THE RIGHT BUILDING BLOCKS

If you’ve had any experience in designing solutions for IT deployment — specifically with HPC solutions — then you already know that there are always trade-offs. Budget is almost always finite and tight, and you must make sure that you use those scarce dollars to yield the best system for your workloads. If you’re lucky, your mix of workloads can be characterized, and you can apply this knowledge to your solution design.

If you’ve decided to deploy an Intel-based SMP solution, you’ll need to ask yourself which processor to use, in which server model and how much memory is needed per node. The answers to these questions depend on a few factors. For the sake of this discussion, I’ll work with Intel’s Xeon Scalable Processors (Xeon SP), which are the latest generation, and ignore older generation processors (e.g. Haswell E5-v3 and E7-v, or Broadwell E5-v4 and E7-v4).

With the introduction of the Xeon SP, Intel forfeited a long-standing differentiating feature between high- and low-end processors — namely, the number of DIMMs per socket. Before Xeon SP, high-end processors, aimed at mission-critical applications such as databases, had double the DIMM slots. For example, the E7-v4 family had a 24-DIMM capacity per socket compared to the E5-v4 family with a maximum 12 DIMM per socket. With the Xeon SP, all processors can connect to only 12 DIMMs. Intel compensated for this with the introduction of the ”M” part numbers (e.g. Xeon 8180M), which were certified to use 128GB DIMMs for a hefty premium on both the processor and the memory for the users that truly needed the extra memory capacity.

Obviously, there are other differentiating features between the “bins” of Xeon Bronze, Silver, Gold and Platinum (respectively 31xx, 41xx, [5|6]1xx, and 81xx]) — most notably, the number of cores. But other differentiating features include cache sizes, number of AVX-512 FMA units, availability of hyperthreading and turbo, and number of UPI links (Intel UltraPath Interconnect) for interconnecting processors. Most of those are less relevant to the SMP solution architecture discussion, except for UPI, which I will touch on later.

To give you a feel for how strong a differentiation there is around the available memory capacity, consider this: The ”M” processors cost an extra $3,000. In other words, to move up from a maximum of 768GB per socket (12x 64GB DIMMs) to 1536GB per socket (12x 128GB DIMMs), Intel charges a premium of $250 per DIMM, or $3.91 per GB, going from 64GB to 128GB per DIMM. As of October 2018, DRAM is roughly around $10/GB for DIMMs up to 64GB in size, and $20/GB for 128GB DIMMs.

Of course, with vSMP ServerONE, you can mix and match any building-block — any certified server model and any processor model. Moreover, vSMP ServerONE does not require that all processors be identical. This means that if you need modest processing power coupled with large memory, you could provision as low as one node with the processing power you need, and connect it to nodes carrying lower-spec, cheaper processors or even older servers with older processors to be used as memory donators. This gives you a level of flexibility and savings that simply doesn’t exist in hard-wired systems.

As such, if you are interested mostly in memory expansion, you could optimize by reducing the overall cost of $/GB at the platform level. If it is cheaper to add two TBs of system memory by adding a quad-socket server with that much memory, take that path. But if it’s more cost effective to do that by adding two dual-socket servers, each with 1TB of memory, you could choose that path. Moreover, if you have four older servers with 512GB each that you were planning to retire, you could simply use those to increase the total memory of the virtual SMP.

And, if your workload is more compute-bound than memory-bound, you should first check what influences the workload more — the total number of cores, or the clock-speed of the cores? In other words, does your workload scale well with more cores, or should you go for a lower number of cores at a higher frequency to yield better performance? Workload characterization can be done on a standard dual-socket server, for example. Once you have the answer, simply select the processors that make the most sense for your workload, at the most cost-effective “packaging” (typically a dual-socket or quadsocket server).

In the case that your workload’s main bottleneck is memory bandwidth, you should choose dual-socket servers.

A few words (and diagrams) about Intel’s UPI and Xeon SP:

The Xeon SP models 31xx, 41xx and 51xx have only two UPI links. That allows for pretty good inter-process connectivity for a dual-socket system, but limits 4-socket systems to a “ring” topology, which is suboptimal.

Xeon SP models 61xx and 81xx have three UPI links each. This allows a system vendor to interconnect four processors in a full mesh, which makes for better performance due to decreased latency and increased in-memory bandwidth.

To summarize, if you can characterize your typical SMP workloads, you could potentially optimize the solution’s building blocks to maximize your overall performance, or to reduce cost. Doing so might require thinking not only of the GHz and FLOPS, but also of the other aspects of the hardware components of choice, such as total RAM per node and internal interconnects.

Want to learn more? Schedule a call with one of ScaleMP’s System Architects.