Hacker News new | past | comments | ask | show | jobs | submit login
FPGA Architecture for Deep Learning: Survey and Future Directions (arxiv.org)
128 points by matt_d 11 days ago | hide | past | favorite | 52 comments





The big challenge when it comes to using FPGAs for deep learning is pretty simple: all of that reprogrammability comes at a performance cost. If you're doing something highly specific that conventional GPUs are bad at, like genomics research [1] or high-frequency trading [2], the performance tradeoff is worth it. But for deep learning, GPUs and AI ASICs are highly optimized for most of these computations, and an FPGA won't offer huge performance increases.

The main advantage FPGAs offer is being able to take advantage of new model optimizations much earlier than ASIC implementations could. Those proposed ternary LLMs could potentially run much faster on FPGAs, because the hardware could be optimized for exclusively ternary ops. [3]

Not to toot my own horn, but I wrote up a blog post recently about building practical FPGA acceleration and which applications are best suited for it: https://www.zach.be/p/how-to-build-a-commercial-open-source

[1] https://aws.amazon.com/solutions/case-studies/munich-leukemi...

[2] https://careers.imc.com/us/en/blogarticle/how-are-fpgas-used...

[3] https://arxiv.org/abs/2402.17764


Are you trying to scare people away from FPGAs? GPUs aren't actually that _good_ at deep learning, but they are in the right place at the right time.

You can rent high end FPGAs on AWS, https://github.com/aws/aws-fpga there is no better time to get into FPGAs. On the low end there is the excellent https://hackaday.com/2019/01/14/ulx3s-an-open-source-lattice...

Modern FPGA platforms like Xilinx Alveo have 35TB/s of SRAM bandwidth and 460GB/s of HBM bandwidth. https://www.xilinx.com/products/boards-and-kits/alveo/u55c.h...


If I remember correctly about 80% of a modern FPGA's silicon is is used for connections. FPGA have their uses and very often a big part in them is the Field Programmability. If that is not required, there is no good reason another solution (ASIC, GPU, etc.) couldn't beat the FPGA in theory. Now, in practice there are some niches, where this is not absolutely true, but I agree with GP that I see challenges for deep learning.

An ASIC will always have better performance than an FPGA, but it will have an acceptable cost only if it is produced in a large enough number. You will always want an ASIC, but only seldom you will able to afford it.

So the decision of ASIC vs. FPGA is trivial, it is always based on the estimated price of the ASIC, based on the number of ASICs that would be needed.

The decision between off-the-shelf components, i.e. GPUs and FPGAs, is done based on performance per dollar and performance per W and it depends very strongly on the intended application. If the application must compute many operations with bigger numbers, e.g. FP32 or FP16, then it is unlikely that an FPGA can compete with a GPU. When arithmetic computations do not form the bulk of an algorithm, then an FPGA may be competitive, but a detailed analysis must be made for any specific application.


I'm definitely not! I'm a hardware designer and I work with FPGAs all the time, for both work and for personal projects. Like with all things, there's a right tool for every job, and I think for modern DL algorithms like Transformers, GPUs and AI ASICs are the better tools. For rapid hard prototyping, or for implementing specialized architectures, FPGAs are far better.

Large fast FPGAs are great but very expensive, small size slow FPGAs are not practical for most solutions, where ARM controllers are used, significantly cheaper.

Cost and practicality are context dependent.

500GB/s is going to limit it to at best 1/4 the DL performance of an nvidia gpu. I’m not sure what the floating point perf of these FPGAs are but I imagine that also might set a fundamental performance limit at a small fraction of a GPU.

Well I keep seeing all models quantized and for 2-bit, 4-bit and 1-bit quantizations I had good very good inference performance (either througput or latency) on CNNs and some RNNs on Alveo boards using FINN (so, mostly high level synthesis and very little actual fpga wrangling). No idea about the current status of all these, will read the paper though :-)

$300 board (I'm including shipping and customs) is not low end. Low end FPGA boards are ~30$ these days.

There are two other problems with FPGAs:

1. They are hard to use (program). If you're a regular ML engineer, there will be a steep learning curve with Verilog/VHDL and the specifics of the chip you choose, especially if you want to squeeze all the performance out of it. For most researchers it's just not worth it. And for production deployment it's not worth the risk of investing into an unproven platform. Microsoft tried it many years ago to accelerate their search/whatever, and I think they abandoned it.

2. Cost. High performance FPGA chips are expensive. Like A100 to H100 price range. Very few people would be willing to spend this much to accelerate their DL models unless the speedup is > 2x compared to GPUs.


FPGAs are also reasonably good at breadboarding modules to be added to ASICs. You scale down the timing and you can run the same HDL and perform software integration at the same time as the HDL is optimized.

Much cheaper and faster than gate level simulation.


I was part of a startup that did ternary CNNs on FPGA in 2017. It involved a ton of nitty gritty work and massive loss of generalit, and in the end a Raspberry Pi could solve the same problem faster and cheaper.

Every couple of years I revisit the FPGA topic, eager to build something exciting. I always end up with a ton of research, where I learn a lot but ultimately shy away from building something.

This is because I cannot find a project that is doable and affordable for a hobbyist but at the same time requires an FPGA in some sense. To put it bluntly: I can blink a LED for a fiver with a micro instead of spending hundreds for an FPGA.

So, assuming I am reasonably experienced in software development and electronics and I have 1000 USD and a week to spend.

What could I build that shows off the capabilities of an FPGA?


I work at one of the big 3 FPGA companies, so I can give you an idea of where our teams spend most of their time, and you can translate that into a hobbyist project as you will.

1. Video and Broadcast. Lots of things to be done here. New protocols are being introduced every year by IEEE for sending video between systems. Most cutting-edge cameras have some sort of FPGA inside doing niche image processing. You can get a sensor and build yourself your own Camera-on-Chip. It's a fantastic way to lose a year or two (I can attest to that). Some good material on the matter here: https://www.mathworks.com/discovery/fpga-image-processing.ht...

2. Compute Acceleration. This is more data centre-specific. SmartNICs, IPUs and the like. Hard to make a dent unless you want to spend 200k on a DevKit, but you could prototype one on a small scale. Some sort of smart FPGA switch that redirects Ethernet traffic between a bunch of Raspberry Pis dependent on one factor or another. One company that comes to mind is Napatech. They make a bunch of really interesting FPGA servers systems: https://www.napatech.com/products/nt200a02-smartnic-capture/

3. Robotics and Computer Vision. Plenty of low-hanging fruit to be plucked here. A rediculous amount of IO, all needed to work in near realtime. Hardware acceleration kernels on top of open standards like ROS 2. I always point people in the direction of Acceleration Robotics' startup in Barcelona for this. They're epic: https://github.com/ros-acceleration

4. Telecomunications. This is a bit of a dark art area for me, where the RF engineers get involved. From what my colleagues tell me, FPGAs are good for this because any other device doesn't service the massive MIMO antenna arrays besides building custom ASICs, and the rate of innovation in this area means an ASIC made one year is redundant the next. Software-defined radios are the current trend. You could have fun making your own radio using an FPGA: https://github.com/dawsonjon/FPGA-radio


Reasonably experienced and 'a week' can mean vastly different things... It's certainly easier to keep the cost down with longer time-frames.

For a focus on electronics rather than implementing some kind of toy 'algorithm accelerator', I find low-hanging/interesting projects where the combination of requirements exceed a micro's peripheral capabilities - i.e. multiple input/output/processing tasks which could be performed on a micro individually, but adding synchronisation or latency requirements makes it rather non-trivial.

- Very wide/parallel input/output tasks: ADC/DACs for higher samplerate/bitdepth/channel count than typically accessible with even high-end micros

- Implementing unique/specialised protocols which would have required bit-banging, abuse of timer/other peripherals on a micro (i.e. interesting things people achieve with PIO blocks on RP2040 etc)

- Signal processing: digital filters and control systems are great because you can see/hear/interact with the output which can help build a sense of achievement.

When starting out, it's also less overwhelming to start with smaller parts and allocate the budget to the rest of the electronics. They're still incredibly capable and won't seem as under-utilised. Some random project ideas:

- Driving large frame-buffers to display(s) or large sets of LED matrices at high frame rate - https://gregdavill.com/posts/d20/

- Realtime audio filters - the Eurorack community might have some inspiration.

- Multi-channel synchonous detection, lock-in amplifiers, distributed timing reference/control,

- Find a sensing application that's interesting and then take it to the logical extreme - arrays of photo/hall-effect sensors sampled at high speed and displayed, accelerometers/IMU sensor fusion

- Laser galvanometers and piezo actuators are getting more accessible

- Small but precise/fast motion stages for positioning or sensing might present a good combination of input, output, filtering and control systems.

- With more time/experience you could branch into more interesting (IMO) areas like RF or imaging systems.

With more info about your interest areas I can give more specific suggestions.


Good list, thanks. I have a couple of years professional experience as a software dev and worked in the embedded space too. Nowadays I am in security and that is definitely an area of interest.

I only dabble with recreationally reverse engineering industrial/consumer grade HW and following blogs/conferences, so I can only provide a rough shotgun of search terms to try and hit something you're interested in:

- The Glasgow interface explorer is an example of a smaller FPGA making interface level RE tooling more accessible.

- The Chipwhisperer hardware has a focus on power supply glitching, side-channel attacks and general hardware security education/testing.

- There's a handful of FPGA-based implementations intended for high-speed protocol sniffing/MiTM (TCP/IP, USB and CANBus are both pretty common) on github etc, Cynthion is one example.

- Some recent projects have been trying to implement and improve the FOSS ARM Cortex programming and trace experience, Orbuculum ORBTrace probe is an example though the benefits aren't fully realised yet.

- In an odd use-case for an FPGA, I've personally seen hardware that enforces brutal/paranoid DRM/licencing via customised downloaded bitstreams to guards against reverse-engineering/copy efforts, all to most likely run a soft-CPU. I've read (unsubstantiated) that this approach appears on some military hardware.

- Slightly adjacent to specific FPGA projects, but the SDR tooing ecosystem has lots of cool stuff to play with for wireless signal identification/spoofing/re-implementation. HackRF, LimeSDR, GNUradio etc. If you want to get deep then there's lots of overlap with custom FPGA implementations.


Thanks a lot. This is a rabbit hole I will happily go down.

What about FPGAs as a means to experiment in ML and hardware architectures?

Um, no? The actual problem is that most FPGAs already have DPUs for machine learning integrated on them. Some Xilinx FPGAs have 400 "AI Engines" which provide significantly more compute than the programmable logic, the almost 2000 DSP slices or the ARM cores. This means that the problem with FPGAs is primarily lack of SRAM and limited memory bandwidth.

https://www.xilinx.com/products/boards-and-kits/vck190.html


Reconfigurable logic may be used to implement fairly small models, in applications where they're already employed and adding a coprocessor specifically for ML is either infeasible or doesn't make sense. As the paper mentions, you need hard logic blocks for arithmetic (if not floating point), and these are always in short supply. In DSP applications I've worked on we used the fpga for timing and i/o, to control jitter and sample from many ADCs in parallel - but apart from some filtering then ran the numbers on an adjacent non-reconfigurable core. You can get huge chips with a lot of hard logic built in, but they're relatively expensive compared to fpga + traditional coprocessor with shared memory. The high end application specific chips are shockingly expensive - worse than GPUs because there is comparatively little market for them. We had one evaluation board that was like $100K iirc.

Yep. Google has TPUs, and AMD/Nvidia/Intel/Apple/Qualcomm all have tensor coprocessors now. From a CPU or GPU to an FPGA, the cost/benefit is huge. With every device having tensor cores, not so much. ASIP and ASICs are likely the way to go, at least for common operations like matrix multiplication.

Meanwhile others are trying to bring back analog CPUs for similar kinds of workloads, it is going to be interesting which ones end up winning in the long run.

For example, https://mythic.ai/products/m1076-analog-matrix-processor/


My pet project is to take these ideas and go to the logical end, arriving at a systolic array I call a BitGrid.

It's a Cartesian grid of 4 bit look up tables, with bits to/from each neighbor. This allows each output to be independent, maximizing utilization.

To solve timing issues, each cell would be clocked, with 2 phases for the grid, in a checkerboard pattern. This makes all inputs stable, and timing deterministic. Unlike an FPGA, latency is not the primary limit of performance, as everything is thus pipelined.

I wrote a simulator, and started learning VHDL in order to program an FPGA board I bought for prototyping the concept. My eventual goal is an ASIC through tiny tapeout.

The big software hurdle is compiling expressions into a directed graph of bitwise operations.

Because data only travels to the neighbors, all the lines in a chip are short, and it should be possible to use far fewer metallization layers in an actual chip than a CPU for example.


https://tinytapeout.com/ now lest you purchase additional tiles for $50, each tile supports about 1k digital logic gates.

Next one closes June 1.

https://tinytapeout.com/faq/

You might enjoy this talk from the last Latchup on Wave Pipelining

https://fossi-foundation.org/latch-up/2024#riding-the-wave-b...

https://www.cs.princeton.edu/courses/archive/fall01/cs597a/w...


I think the problem with such an architecture is that latency isn't a limit on timing closure from a hardware sense now, but you still have to consider it now from the software compilation perspective, and it might severely impact performance.

From what I've thought about this, the problem applies to any high fanout signal. For instance, if you want to implement a multiplexer for two n-bit operands, you'll need the select bit to be in ~n places at once (if you compile the n-bit multiplex into n 1-bit multiplexes). Compiling LUTs to route this select signal to the right places in the grid synchronously with the arrival of the data signals is complex and amounts to a similar sort of problem one faces with hardware compilation (akin to setup timing). In this architecture, you're replacing actual routing resources (wires) with LUT entries. Instead of considering the propagation of a signal down a wire in terms of nanoseconds, you'll be thinking about it in terms of cycles to traverse the grid. Unless the clock rate is absurdly high, signals like this will probably cause a performance problem for your design.

FPGAs/ASICs also have a problem of this flavor but it generally only happens for one signal: the clock. FPGAs address this by not using regular routing resources for the clock, and instead using special, pre-routed nets for clock distribution. I imagine you'd probably need a solution like this to deal with high fanout signals in an efficient way.


What you're describing is a cellular automaton, in the same vein as of Conway's Game of Life. You can do lots of interesting things with those, but it's emphatically not where I'd start for a flexible computing platform.

Why not go the extra mile, and make each tile a small CPU? There's a Zachtronics have called TIS-100 with this premise.

https://store.steampowered.com/app/370360/TIS100/


Because each cell has it's own state, and 64 bits of "program" (16 bits in each of the 4 LUTs), it's unlike the game of life, where the rule is the same for each cell.

I looked at a lot of choices for architecture, and wanted to allow data paths to cross without conflict, and the 4 in/4 out choice worked best without going too far.

Someone did work out how you could run the game of life on a BitGrid, it's in the Esoteric Languages wiki

https://esolangs.org/wiki/Bitgrid

I see it as something like a Turing machine, a bit less abstract, and much, much faster at computing real results. I hope it can democratize access to PetaFLOPS.

The question I can't seem to find an answer to is simple... how much power does a 4 bit in/out set of LUTs with a latch take statically? How many femtojoules does it take to switch?

If those numbers are good enough, it's entirely possible that really fast compute is on the table of possibilities. If not, it's another Turing machine.


The power is easy enough to calculate.

Let's take the a Kintex Ultrascale+ from Xilinx as a fairly typical example of a modern FPGA. Relevant documentation is the UltraScale Architecture CLB User Guide [1] and the Xilinx Power Estimator spreadsheet [2].

Each "slice" contains two flip-flops and a lookup table with 6 input bits and 2 output bits. So two slices is enough to implement each cell with room to spare.

Let's say you have a 200 x 200 grid = 40k cells. That's 80k LUTs and 160k flip-flops. That's about 29% of the resources on a XCKU9P. If we assume a 100 MHz clock and 25% toggle rate (somewhat arbitrary), that's 4e12 state-changes per second. The spreadsheet indicates that circuit will consume 850 mW, or about 200 fJ per state-change.

That said, this is NOT an efficient way to do arithmetic. You'd need N cells to do a fixed-point addition with N-bit arguments, and O(N^2) (give or take) to do a fixed-point multiplication. Floating point requires orders of magnitude more. There's a reason modern FPGAs have dedicated paths for fast addition and hardwired multiplier macros.

[1] https://www.xilinx.com/content/dam/xilinx/support/documents/...

[2] https://www.xilinx.com/products/technology/power/xpe.html


>Why not go the extra mile, and make each tile a small CPU?

Xilinx AI Engine and Ryzen AI is exactly that.


Interesting idea.

The main question for me is if it will be efficient, in the sense that you need program/models that can be binpacked into the size of your design and need data all at the same time in various stages otherwise a lot of your silicon will be under-utilized (since you don't have memory, you can't trade between compute and RAM to efficiently use your silicon die size).

Rather than our current breed of neural network architectures and models, you'd probably need to look into alternatives like spiking neural network and see if they can store data as frequency and activation patterns.


> since you don't have memory, you can't trade between compute and RAM to efficiently use your silicon die size

As I understand it, BitGrid would be a nice architecture for applications which:

  # Are compute-heavy
  # Don't need a lot, or high-bandwidth memory
  # Where the complex architecture (programming) of GPUs/FPGAs is a barrier
  # Where the computation is something other than one for which dedicated accelerators exist (like GPU, DSPs, or a CPU's vector extensions, FP math etc)
If most or all of those conditions hold, BitGrid could be a good architecture. Ideally, configured/programmed as if it were a simple memory array.

No doubt such applications exist. But between those many competing options, I suspect BitGrid will have a hard time carving out a niche where its advantages weigh heavily enough to matter.

Disclaimer: if nothing else, I would love it as a dead-simple-to-use-sea-of-LUTs. FPGAs are powerful but complicated beasts (and usually rely on closed source toolchains).


DE Shaw has a systolic supercomputer: https://en.m.wikipedia.org/wiki/Anton_(computer)

How do you know of this? This is really cool


Cool idea. What does the expression/spec language look like? I would guess it has to be mapped... not straightforwardly. It seems like your goal is maximize throughput, but then the datapath would have to be planar, no?

I don't have a language for it... I've been stuck at analysis paralysis for far, far too long on this one. The code I did write, was all figured out by hand.

I'm thinking it'll have to end up being a set of equations, much like the tables that get spewed when you compile VHDL for an FPGA.


Systolic arrays are essentially how matmul is implemented in tensor cores in GPUs and TPUs.

Dangs got some work cut out for him tonight

I've never seen so much spam in a thread before.

Its across everything on the frontpage. Hundreds of spam links.

It looks very much like a concerted attack of some kind. The spam seems quite ineffective, so I'm wondering if the "real" attack is trying to get the moderators to hell-ban a legitimate person to silence them or something.

I dunno, its really weird. Its hard for me to take this spam at face value, I can't imagine that this website is a good target for such spam. Its clearly just here to annoy the moderators and make them tired vs... something. Reminds me of Starcraft Muta Harass: you're attacking the APM and attention-span of the opponent, not actually trying to straight up win with Muta. If that makes any sense.


Or it's some kind of weird-ass SEO campaign (not that it'll work, dang has implemented rel=nofollow a few months ago). Man, I've been on HN for >11 years now, literally never saw a spam campaign this intensive.

Given they all link to Discord, I wonder if @dang should consider contacting their abuse team to investigate and take down those channels.

Maybe it's exactly what they want (I didn't check the links). Maybe they want to make someone they don't like banned.

    function modifyElements(pSel, cSel, rxStr) {
        const regex = new RegExp(rxStr, 'i');
        const pEls = document.querySelectorAll(pSel);
        pEls.forEach(pEl => {
            const fEl = pEl.querySelector(cSel);
            if (fEl && regex.test(fEl.textContent)) {
                pEl.style.display = 'none';
            }
        });
    }
    
    let rx = /(hi are u lonely|want (an )?ai gf?)/i;
    
    modifyElements(".athing.comtr", ".comment", rx);

I'm glad to see this is being studied. I did a brief half-semester project summarizing the usefulness of accelerators after the death of Dennard's law and Moore's law (this was when Intel was still pushing out incremental improvements on 14nm). The short summary is that accelerators offered substantial performance per watt improvement, at the cost of longer development timelines, cost of manufacturing in the case of ASICs and ASIPs, and the inability to fit large algorithms on FPGAs. Though Microsoft found a way to daisy-chain FPGAs for Bing's page rank, it just didn't produce the order-of-magnitude improvement necessary to justify moving to FPGAs. With smaller algorithms or bigger FPGAs or occasional operation offloading (like what we do now with TPUs and GPUs), they could be good candidates for accelerators.

Of course the real killer is how easy it is(n't) to develop for them. AMD offers much better performance per dollar than Nvidia, but of course their poor drivers make using their hardware a fool's errand.

RADs sound like a good idea. We should be making reprogrammable hardware available and easy to use for everyone.


Brainwave is/was very successful? Still serving your searches! (I work on one of the teams that came out of brainwave, not with them directly)

https://www.microsoft.com/en-us/research/project/project-bra...

https://www.microsoft.com/en-us/research/publication/a-confi...

https://www.microsoft.com/en-us/research/blog/a-microsoft-cu... -> This led to the OCP MX work:

https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...


> AMD offers much better performance per dollar than Nvidia, but of course their poor drivers make using their hardware a fool's errand.

I would never use their IP or drivers, but Xilinx hardware is top notch. What drivers are you using? XRT?


Is there any small risc-v soft-core with big ass SVE (scalable vector extensions)? I would like to play around with them but the only option seem to be cloud instances like gravitron and small (128bit) like the licheerv nano (c906) which also seems to only support a beta version of the standard.

You seem to be confusing two different things:

gravitron: has ARM SVE extension

licheerv nano: has RISC-V Vector extension

Both ARM SVE and the RISC-V Vector Extension (RVV) are vector length agnostic, but quite different in their design.

Since RISC-V allows for open source implementations, there are a few projects working on RVV implementations. Check out the t1 project, they implement a configurable RVV implementation, that is supposed to be driven by a small scalar core, e.g. RocketChip. [0] There are configurations for vector length of 512-bit to 16K bit. Connecting it to RocketChip isn't officially supported just yet, and it uses the spike emulator for the scalar part.

They offer a pre-build docker environment, so you can play around with the RTL simulation via "docker run --name t1 -it -v $PWD:/workspace --rm ghcr.io/chipsalliance/t1-machamp:latest /bin/bash" This drops you into a shell, and you can start simulating a 1K vector length processor, with "ip-emulator --no-logging -C yourProgram", see the tests/ director for example code. At least in theory, but there might still be a few bugs.

[0] https://github.com/chipsalliance/t1


Have them self-modify.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: