Hacker News new | past | comments | ask | show | jobs | submit login
Turing RK1 is 2x faster, 1.8x pricier than Pi 5 (jeffgeerling.com)
68 points by ingve 10 days ago | hide | past | favorite | 49 comments





I wish that the half-width 10" rack was more popular in the US. I got the impression that they are more popular in Europea.

It should be a good size for home networking gear and small servers. Also, it should be possible to mount 10" device in 19" rack. I don't think it is possible to mount two together.


Back in the day some people tried to go the other way. Half depth servers that were light enough to fit into musician's audio racks, which also tend to be rather small affairs of 6-12 units tall. Noise however is often the problem with these things. Substantially because the 1-2u height leads to comical fans.

I suppose that servers would tend to be taller in the half-width racks?


agreed, and it should be great for form factors in the range of apple tvs, mac minis, soho-size network switches, NUCs. If what you want is a homelab cluster, a 19 inch rack seems needlessly large. If only this 10 (or is it 10.5?) inch rack were a more standardized form factor, with gear designed to slide right in and out of it.

These things are getting too expensive.

For most embedded projects is cheaper to either scale down to arduino/esp or up to a used laptop.


That Pi/Pi-likes can do traditional embedded things like i2c or GPIO seems almost incidental these days. The used/low end laptop is much more comparable but then you have something big, bulky, power hungry (almost always with a fan), and already outdated (maybe not in terms of absolute performance but in technology/support). If you have any need for more than 1 at a time things get awkward and dissimilar and if you happen to run across a need for a traditional embedded use case e.g. GPIO on the side then the interfaces won't really be there.

And the other side is the cheaper ones still exist e.g. the just launched CM4S goes down to $30 with some storage included. Not exactly microcontroller cost but not exactly bad for the convenience even if that's your own use case. Then if you need to scale up your ceiling is much higher (in relative terms even) than 10 years ago.


What kind of projects do people have that need both i2c _and_ a powerful compute? What percentage of your projects are like that?

You can hook up bunch of sensors to a $5 esp32 and that thing uses much less power. If you need to do any heavy computing, you can offload it over WiFi to your homelab or the cloud.


For 'serious' use I would always recommend ECC RAM.

If you're not running in two independent regions of tier 4 data centers, are you even really serious? /s

There are 'serious' applications of technology that can tolerate memory faults.


> There are 'serious' applications of technology that can tolerate memory faults.

We can define "serious" use to mean "use where memory faults would be more expensive than buying ECC RAM" :)


Which is actually happens pretty easily.

If you're using it professionally, the first time you use a day to debug something that turns out to be just ECC fault, you've lost the price advantage.


Not to mention multiplanetary!

only two? you know you should have N+2 deployment for real redundancy

Only if N+2 is an odd number, otherwise you may end up with a split brain scenario

That's why I always do N=-1 replication :D

Why? For that price you get a used laptop with higher perf.

Intel Xeon D-2700* or 1700 SoC are the preferred "low power servers" in practice. Yes, they idle at 25W, but I bet you that the Xeon D-2500 will have more performance than 10x Rasp. Pi 5.

More importantly: there will be server-class features like out-of-band management/IPMI, 10Gbase-T ports, etc. etc.


Note that this board runs RK1's (which use the RK3588 SoC), not Raspberry Pis.

It burns 18.5W with CPU cores maxed out, producing 224 Gflops of compute (see: https://github.com/geerlingguy/top500-benchmark/issues/27). Efficiency-wise, it's a little more power efficient than at least last-gen AMD/Intel chips, and a little less than Ampere CPUs and Apple's M1. I haven't had a chance to compare to M2/M3 for power efficiency yet.

The Turing Pi 2 has a BMC that allows out-of-band management, and the firmware is fairly stable at this point, and the community helped build a nice CLI for it so IMO it's even nicer (from automation/remote sysadmin aspect) than IPMI for some things. There's no remote VNC/display, but these aren't meant to have displays (even though node 1 has HDMI routed to a connector on the back).

The most crippling thing for my use cases is the single 1 Gbps connection externally, and the internal 1 Gbps Ethernet switching. It'd be nice if it did at least 2.5 Gbps.


Xeons / AMD EPYCs are measured in (low single-digit) TFlops, not GFlops.

AVX512 gets places, especially with Fused Multiply-accumulate (FMA) instructions and multiple pipelines.

That the thing though: TFlops is still "weak" because GPU market crushes Linpack so hard its just not comparable. But I'd assume that your typical Xeon D will run circles around these other cores watt-for-watt.

Its the GPUs that Xeon D is competing with and struggling with. But if we're talking about a bunch of CPU-style compute (non-Linpack workloads), or even the occassional matrix-multiply, its hard to beat Xeon or EPYC in cost and power-efficiency.

And you certainly ain't gonna beat a Xeon or EPYC with tiny ARM chips right now.


I thought we were talking about Xeon D (1700 series specifically called out)—the typical CPUs in that range are in the low 100s of Gflops, similar to the performance of this 4 node cluster of RK3588s.

Obviously, throw a 64, 96, or 128+ core EPYC or the latest Xeons at it and that will make it cry (while gobbling up hundreds of watts doing so ;)

Only the Xeon D-2700 series starts touching above 500 Gflops in that generation, and those CPUs (just the CPUs, mind you) are in the $1000-2000 range, over 100W TDP.


Xeon D 2700 comes in 16-core flavors. At 2GHz AVX512 with 1x FMA per clock tick, that's at least 1TFlop, especially because other pipelines are still available to contribute to the processing.

2GHz x 16 SIMD (32-bits per lane) x 16 cores x 2 ops per instruction (Multiply+Accumulate) == 1TFlop or there-abouts.

> Obviously, throw a 64, 96, or 128+ core EPYC or the latest Xeons at it and that will make it cry (while gobbling up hundreds of watts doing so ;)

And those are closer to 4 to 10 TFlops. The bigger Xeons are 2x FMA or more instructions per clock tick IIRC. So not only are there more cores, each core can do double the work.

We are talking about the "little" Xeon D with relatively low clocks. But 512-bit FMA really gets going.

---------

GPUs are well into 20TFlop to 50TFlop regions.

-------

Although I'm calculating ~32-bit Flops, you may have done 64-bit Flops. Its a bit ambiguous but we might not be apples-to-apples yet.


> And you certainly ain't gonna beat a Xeon or EPYC with tiny ARM chips right now.

Do you include Apple's M family in that list of tiny ARM chips?


That's an edge case. But I'm mostly talking about RK1 or Rasp. Pi 5.

I'm well aware of "large" ARM chips like Fujitsu A64FX ARM that did well (when it came out) in terms of Linpack performance, thanks to hugely efficient designs like HBM and 512-bit SVE.

I'm not very well aware of what Apple M1 offers to the table, I know some people are experimenting with it but its a far larger chip than RK1 or Rasp. Pi. But its probably smaller than A64FX.

I know that Apple M1 has only 128-bit vectors though, so that's a big penalty vs AVX512 or even the older Fujitsu A64FX. I'd expect them to be bad at Linpack / Matrix multiplication as this is where SIMD shines exceptionally well... and AVX512 on Xeon is a very good SIMD implementation.

I recognize Apple M1 has multiple 128-bit pipelines that operate in parallel per core, so its better than it looks, but there's a huge power-efficiency advantage to 512-bit sized vector units in the Linpack-style matrix-multiplication code.


M chips has some matrix multipliers in hardware as well.

> The most crippling thing for my use cases is the single 1 Gbps connection externally, and the internal 1 Gbps Ethernet switching. It'd be nice if it did at least 2.5 Gbps.

Their Github seems to suggest they're working on getting both jacks working at the same time... it's been a couple days since I last checked, so they might have it working now.


> Note that this board runs RK1's (which use the RK3588 SoC), not Raspberry Pis.

Turing Pi 2 was designed to run Paspberry Pi CM4 via adapter as well as Jetson Nano (?). RK1 is somewhat recent development that was shipped way after Turing Pi2 release.


I'm specifically calling out the build and SoM's that were mentioned in the OP here.

The RK3588 does seem to comfortably beat the Xeon D-1700/D-2700 series at performance per watt and per dollar.

Compare the RK3588 to the Xeon D-1718T as an example:

RK3588: https://browser.geekbench.com/v6/cpu/5757475

Xeon D-1718T: https://browser.geekbench.com/v6/cpu/945

This Xeon is a 46W TDP part with a $274 USD unit price for the CPU alone. The RK3588 has a ~11W TDP and entire Orange Pi 5 boards can be bought for around $120 USB. The RK3588 can get around 62% of the multicore performance for around a quarter the power budget and vastly cheaper SBC prices compared to Xeon D series servers.


TDP is heat watts, not electricity watts.

To measure electricity, you need to actually measure it. It's not a specified value. It also depends on workload.

You will find that way high performance, all chips use more Watts than you expect.

But given most of the benchmarks I've seen, Xeon and EPYC hold firm as the most obvious powet-efficient and power-perfirmant chips in practice.

Apple M1 is pretty much the only chip in this discussion that's in the running. And only in 64-bit performance at that, the AVX512 units in Xeon are too good otherwise.


EPYC is decent, Xeon has a long way to go still.

You mention actually measuring it, that's exactly what I've been doing in my top500 project: https://github.com/geerlingguy/top500-benchmark#results

The RK3588 gets up to 4.64 Gflops/W, only beaten by a couple expensive Ampere Altra chips in my testing so far.

A few of the newest EPYC chips touch that range, but none of the Xeons I've seen do (yet). I would like to test a few more but getting access to bare metal is tough, especially since I only need it for a day or two, but need to get it plugged into a consistent power monitoring device.


Interesting project!

Well, good luck with it. If you're using HPC Linpack, that could be a x2 multiplier difference between your results and my napkin math, as I've been assuming 32-bit Floats, while Linpack is 64-bit Floats IIRC.


Depends on how long you want to run it and your utility costs, performance/watt is good on newer chips like these.

I wish it were easier to quantify the energy cost of building these newer more efficient devices, as opposed to repurposing existing hardware. When you add e-waste to the equation, it gets cloudier.

Goes back to the reduce, reuse, recycle thing...

It would probably be better for the environment if we force software devs to optimize, so we could have all the modern goodies but using like 1/1000th of the energy to get there.

Back in the 80s and 90s, programmers would proudly shave off kB of code and use little hacks to pull every last bit out of limited resources. Now we run like 90 instances of a browser, requiring like 2 GB of RAM for a simple chat app :D


There is way to little environmentalist pressure on IT companies. It is like they have not realized how wasteful the IT world is.

I think there need to be some tiny Lego or similar figurines operating those tiny blades on those tiny racks :-D

I love the idea of Lego minifigs posed inside a chassis for these compute module cluster boards, like they're IT workers in a datacenter.

Every time you pull a server and pop the lid, they're in there working.

(Maybe not for production, lest a minifig get sucked into a fan, taking down a mission-critical cat picture sharing app.)


It looks like RK1 is only compatible with the Turing Pi 2 4-slot board. I was hoping that it was CM4 compatible. Module faster than Pi5 before CM5 has come out would be useful.

I think the power draw is going to be too high to be CM4 compatible.

It looks like the RTK1 has the exact power draw, 7W, of CM4.

It seems that the carrier board is same as Nvidia Jetson.


> It may be an irrational hobby

Refreshing to hear someone just come out and say it. I don't get it but hey, I'm into all kinds of weird stuff that doesn't make sense either.

Just don't try to attempt some argument the capability, performance, or even power usage makes any practical sense. It doesn't.


It certainly makes for a fun homelab to set up a bare metal cluster of say k8s in a desk drawer, where Pis usually end up anyway.

The original one of these was 6 slots wasn't it? How did we end up with 4 in the last two generations? Bandwidth concerns?

You're probably thinking of the DeskPi board.

https://deskpi.com/products/deskpi-super6c-raspberry-pi-cm4-...

It's a different product.


No but that one is interesting.

I believe Jeff is how I heard about Turing in the first place (or at least the first time I saw their stuff working.)

Sure enough he still has the old videos up and I was wrong; it had 7 slots not 6, which means the new one loses almost half its slots, and explains some of my remembered disappointment.

https://www.jeffgeerling.com/blog/2020/raspberry-pi-cluster-...


Yeah; they went for expansion over raw compute capacity. It's an odd tradeoff, since I think for my own needs, I would like at least one M.2 slot, and maybe an A+E-key or mPCIe for a 4G/5G modem, but I don't need two, nor USB 3. That, plus going from an 8-port switch chip to more is probably expensive.

The DeskPi Super6C is more compact and IMO a little more compute focused. Both serve a purpose. I do like the BMC included on the Turing Pi 2, and it makes me more comfortable running it in production (easy to re-flash a node remotely if I need to).


I watched part of the video yesterday. I think he had 4 NVMe drives in M.2 slots on the underside of the board.

Their placement of the power supply is also unfortunate. They probably should move the slots over half an inch and rotate the power 90° so it doesn’t block airflow for front mounted case fans.

And it really should have a 2.5 Gb Ethernet port.


Imagine a Beowulf cluster of these...

And hot grits?

Oh, I'm so sorry in advance:

And Natalie Portmans...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: