Hacker News new | past | comments | ask | show | jobs | submit login

My pet project is to take these ideas and go to the logical end, arriving at a systolic array I call a BitGrid.

It's a Cartesian grid of 4 bit look up tables, with bits to/from each neighbor. This allows each output to be independent, maximizing utilization.

To solve timing issues, each cell would be clocked, with 2 phases for the grid, in a checkerboard pattern. This makes all inputs stable, and timing deterministic. Unlike an FPGA, latency is not the primary limit of performance, as everything is thus pipelined.

I wrote a simulator, and started learning VHDL in order to program an FPGA board I bought for prototyping the concept. My eventual goal is an ASIC through tiny tapeout.

The big software hurdle is compiling expressions into a directed graph of bitwise operations.

Because data only travels to the neighbors, all the lines in a chip are short, and it should be possible to use far fewer metallization layers in an actual chip than a CPU for example.




https://tinytapeout.com/ now lest you purchase additional tiles for $50, each tile supports about 1k digital logic gates.

Next one closes June 1.

https://tinytapeout.com/faq/

You might enjoy this talk from the last Latchup on Wave Pipelining

https://fossi-foundation.org/latch-up/2024#riding-the-wave-b...

https://www.cs.princeton.edu/courses/archive/fall01/cs597a/w...


I think the problem with such an architecture is that latency isn't a limit on timing closure from a hardware sense now, but you still have to consider it now from the software compilation perspective, and it might severely impact performance.

From what I've thought about this, the problem applies to any high fanout signal. For instance, if you want to implement a multiplexer for two n-bit operands, you'll need the select bit to be in ~n places at once (if you compile the n-bit multiplex into n 1-bit multiplexes). Compiling LUTs to route this select signal to the right places in the grid synchronously with the arrival of the data signals is complex and amounts to a similar sort of problem one faces with hardware compilation (akin to setup timing). In this architecture, you're replacing actual routing resources (wires) with LUT entries. Instead of considering the propagation of a signal down a wire in terms of nanoseconds, you'll be thinking about it in terms of cycles to traverse the grid. Unless the clock rate is absurdly high, signals like this will probably cause a performance problem for your design.

FPGAs/ASICs also have a problem of this flavor but it generally only happens for one signal: the clock. FPGAs address this by not using regular routing resources for the clock, and instead using special, pre-routed nets for clock distribution. I imagine you'd probably need a solution like this to deal with high fanout signals in an efficient way.


What you're describing is a cellular automaton, in the same vein as of Conway's Game of Life. You can do lots of interesting things with those, but it's emphatically not where I'd start for a flexible computing platform.

Why not go the extra mile, and make each tile a small CPU? There's a Zachtronics have called TIS-100 with this premise.

https://store.steampowered.com/app/370360/TIS100/


Because each cell has it's own state, and 64 bits of "program" (16 bits in each of the 4 LUTs), it's unlike the game of life, where the rule is the same for each cell.

I looked at a lot of choices for architecture, and wanted to allow data paths to cross without conflict, and the 4 in/4 out choice worked best without going too far.

Someone did work out how you could run the game of life on a BitGrid, it's in the Esoteric Languages wiki

https://esolangs.org/wiki/Bitgrid

I see it as something like a Turing machine, a bit less abstract, and much, much faster at computing real results. I hope it can democratize access to PetaFLOPS.

The question I can't seem to find an answer to is simple... how much power does a 4 bit in/out set of LUTs with a latch take statically? How many femtojoules does it take to switch?

If those numbers are good enough, it's entirely possible that really fast compute is on the table of possibilities. If not, it's another Turing machine.


The power is easy enough to calculate.

Let's take the a Kintex Ultrascale+ from Xilinx as a fairly typical example of a modern FPGA. Relevant documentation is the UltraScale Architecture CLB User Guide [1] and the Xilinx Power Estimator spreadsheet [2].

Each "slice" contains two flip-flops and a lookup table with 6 input bits and 2 output bits. So two slices is enough to implement each cell with room to spare.

Let's say you have a 200 x 200 grid = 40k cells. That's 80k LUTs and 160k flip-flops. That's about 29% of the resources on a XCKU9P. If we assume a 100 MHz clock and 25% toggle rate (somewhat arbitrary), that's 4e12 state-changes per second. The spreadsheet indicates that circuit will consume 850 mW, or about 200 fJ per state-change.

That said, this is NOT an efficient way to do arithmetic. You'd need N cells to do a fixed-point addition with N-bit arguments, and O(N^2) (give or take) to do a fixed-point multiplication. Floating point requires orders of magnitude more. There's a reason modern FPGAs have dedicated paths for fast addition and hardwired multiplier macros.

[1] https://www.xilinx.com/content/dam/xilinx/support/documents/...

[2] https://www.xilinx.com/products/technology/power/xpe.html


>Why not go the extra mile, and make each tile a small CPU?

Xilinx AI Engine and Ryzen AI is exactly that.


Interesting idea.

The main question for me is if it will be efficient, in the sense that you need program/models that can be binpacked into the size of your design and need data all at the same time in various stages otherwise a lot of your silicon will be under-utilized (since you don't have memory, you can't trade between compute and RAM to efficiently use your silicon die size).

Rather than our current breed of neural network architectures and models, you'd probably need to look into alternatives like spiking neural network and see if they can store data as frequency and activation patterns.


> since you don't have memory, you can't trade between compute and RAM to efficiently use your silicon die size

As I understand it, BitGrid would be a nice architecture for applications which:

  # Are compute-heavy
  # Don't need a lot, or high-bandwidth memory
  # Where the complex architecture (programming) of GPUs/FPGAs is a barrier
  # Where the computation is something other than one for which dedicated accelerators exist (like GPU, DSPs, or a CPU's vector extensions, FP math etc)
If most or all of those conditions hold, BitGrid could be a good architecture. Ideally, configured/programmed as if it were a simple memory array.

No doubt such applications exist. But between those many competing options, I suspect BitGrid will have a hard time carving out a niche where its advantages weigh heavily enough to matter.

Disclaimer: if nothing else, I would love it as a dead-simple-to-use-sea-of-LUTs. FPGAs are powerful but complicated beasts (and usually rely on closed source toolchains).


DE Shaw has a systolic supercomputer: https://en.m.wikipedia.org/wiki/Anton_(computer)


How do you know of this? This is really cool



Cool idea. What does the expression/spec language look like? I would guess it has to be mapped... not straightforwardly. It seems like your goal is maximize throughput, but then the datapath would have to be planar, no?


I don't have a language for it... I've been stuck at analysis paralysis for far, far too long on this one. The code I did write, was all figured out by hand.

I'm thinking it'll have to end up being a set of equations, much like the tables that get spewed when you compile VHDL for an FPGA.


Systolic arrays are essentially how matmul is implemented in tensor cores in GPUs and TPUs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: