Intel Xeon D-2700\* or 1700 SoC are the preferred "low power servers" in practic...

geerlingguy · 2024-04-26T20:56:54

Note that this board runs RK1's (which use the RK3588 SoC), not Raspberry Pis.

It burns 18.5W with CPU cores maxed out, producing 224 Gflops of compute (see: https://github.com/geerlingguy/top500-benchmark/issues/27). Efficiency-wise, it's a little more power efficient than at least last-gen AMD/Intel chips, and a little less than Ampere CPUs and Apple's M1. I haven't had a chance to compare to M2/M3 for power efficiency yet.

The Turing Pi 2 has a BMC that allows out-of-band management, and the firmware is fairly stable at this point, and the community helped build a nice CLI for it so IMO it's even nicer (from automation/remote sysadmin aspect) than IPMI for some things. There's no remote VNC/display, but these aren't meant to have displays (even though node 1 has HDMI routed to a connector on the back).

The most crippling thing for my use cases is the single 1 Gbps connection externally, and the internal 1 Gbps Ethernet switching. It'd be nice if it did at least 2.5 Gbps.

dragontamer · 2024-04-26T21:10:00

Xeons / AMD EPYCs are measured in (low single-digit) TFlops, not GFlops.

AVX512 gets places, especially with Fused Multiply-accumulate (FMA) instructions and multiple pipelines.

That the thing though: TFlops is still "weak" because GPU market crushes Linpack so hard its just not comparable. But I'd assume that your typical Xeon D will run circles around these other cores watt-for-watt.

Its the GPUs that Xeon D is competing with and struggling with. But if we're talking about a bunch of CPU-style compute (non-Linpack workloads), or even the occassional matrix-multiply, its hard to beat Xeon or EPYC in cost and power-efficiency.

And you certainly ain't gonna beat a Xeon or EPYC with tiny ARM chips right now.

geerlingguy · 2024-04-27T02:25:16

I thought we were talking about Xeon D (1700 series specifically called out)—the typical CPUs in that range are in the low 100s of Gflops, similar to the performance of this 4 node cluster of RK3588s.

Obviously, throw a 64, 96, or 128+ core EPYC or the latest Xeons at it and that will make it cry (while gobbling up hundreds of watts doing so ;)

Only the Xeon D-2700 series starts touching above 500 Gflops in that generation, and those CPUs (just the CPUs, mind you) are in the $1000-2000 range, over 100W TDP.

dragontamer · 2024-04-27T03:51:18

Xeon D 2700 comes in 16-core flavors. At 2GHz AVX512 with 1x FMA per clock tick, that's at least 1TFlop, especially because other pipelines are still available to contribute to the processing.

2GHz x 16 SIMD (32-bits per lane) x 16 cores x 2 ops per instruction (Multiply+Accumulate) == 1TFlop or there-abouts.

> Obviously, throw a 64, 96, or 128+ core EPYC or the latest Xeons at it and that will make it cry (while gobbling up hundreds of watts doing so ;)

And those are closer to 4 to 10 TFlops. The bigger Xeons are 2x FMA or more instructions per clock tick IIRC. So not only are there more cores, each core can do double the work.

We are talking about the "little" Xeon D with relatively low clocks. But 512-bit FMA really gets going.

---------

GPUs are well into 20TFlop to 50TFlop regions.

-------

Although I'm calculating ~32-bit Flops, you may have done 64-bit Flops. Its a bit ambiguous but we might not be apples-to-apples yet.

justinclift · 2024-04-26T21:52:01

> And you certainly ain't gonna beat a Xeon or EPYC with tiny ARM chips right now.

Do you include Apple's M family in that list of tiny ARM chips?

dragontamer · 2024-04-26T22:01:13

That's an edge case. But I'm mostly talking about RK1 or Rasp. Pi 5.

I'm well aware of "large" ARM chips like Fujitsu A64FX ARM that did well (when it came out) in terms of Linpack performance, thanks to hugely efficient designs like HBM and 512-bit SVE.

I'm not very well aware of what Apple M1 offers to the table, I know some people are experimenting with it but its a far larger chip than RK1 or Rasp. Pi. But its probably smaller than A64FX.

I know that Apple M1 has only 128-bit vectors though, so that's a big penalty vs AVX512 or even the older Fujitsu A64FX. I'd expect them to be bad at Linpack / Matrix multiplication as this is where SIMD shines exceptionally well... and AVX512 on Xeon is a very good SIMD implementation.

I recognize Apple M1 has multiple 128-bit pipelines that operate in parallel per core, so its better than it looks, but there's a huge power-efficiency advantage to 512-bit sized vector units in the Linpack-style matrix-multiplication code.

tubs · 2024-04-26T22:45:30

M chips has some matrix multipliers in hardware as well.

nubinetwork · 2024-05-04T10:18:16

> The most crippling thing for my use cases is the single 1 Gbps connection externally, and the internal 1 Gbps Ethernet switching. It'd be nice if it did at least 2.5 Gbps.

Their Github seems to suggest they're working on getting both jacks working at the same time... it's been a couple days since I last checked, so they might have it working now.

chokolad · 2024-04-26T21:52:45

> Note that this board runs RK1's (which use the RK3588 SoC), not Raspberry Pis.

Turing Pi 2 was designed to run Paspberry Pi CM4 via adapter as well as Jetson Nano (?). RK1 is somewhat recent development that was shipped way after Turing Pi2 release.

geerlingguy · 2024-04-27T02:27:29

I'm specifically calling out the build and SoM's that were mentioned in the OP here.

wizee · 2024-04-26T23:31:16

The RK3588 does seem to comfortably beat the Xeon D-1700/D-2700 series at performance per watt and per dollar.

Compare the RK3588 to the Xeon D-1718T as an example:

RK3588: https://browser.geekbench.com/v6/cpu/5757475

Xeon D-1718T: https://browser.geekbench.com/v6/cpu/945

This Xeon is a 46W TDP part with a $274 USD unit price for the CPU alone. The RK3588 has a ~11W TDP and entire Orange Pi 5 boards can be bought for around $120 USB. The RK3588 can get around 62% of the multicore performance for around a quarter the power budget and vastly cheaper SBC prices compared to Xeon D series servers.

dragontamer · 2024-04-27T02:16:11

TDP is heat watts, not electricity watts.

To measure electricity, you need to actually measure it. It's not a specified value. It also depends on workload.

You will find that way high performance, all chips use more Watts than you expect.

But given most of the benchmarks I've seen, Xeon and EPYC hold firm as the most obvious powet-efficient and power-perfirmant chips in practice.

Apple M1 is pretty much the only chip in this discussion that's in the running. And only in 64-bit performance at that, the AVX512 units in Xeon are too good otherwise.

geerlingguy · 2024-04-27T02:30:00

EPYC is decent, Xeon has a long way to go still.

You mention actually measuring it, that's exactly what I've been doing in my top500 project: https://github.com/geerlingguy/top500-benchmark#results

The RK3588 gets up to 4.64 Gflops/W, only beaten by a couple expensive Ampere Altra chips in my testing so far.

A few of the newest EPYC chips touch that range, but none of the Xeons I've seen do (yet). I would like to test a few more but getting access to bare metal is tough, especially since I only need it for a day or two, but need to get it plugged into a consistent power monitoring device.

dragontamer · 2024-04-27T03:59:22

Interesting project!

Well, good luck with it. If you're using HPC Linpack, that could be a x2 multiplier difference between your results and my napkin math, as I've been assuming 32-bit Floats, while Linpack is 64-bit Floats IIRC.