Hacker News new | past | comments | ask | show | jobs | submit login
Fast linked lists (dygalo.dev)
182 points by dmitry_dygalo 14 days ago | hide | past | favorite | 127 comments



It strikes me the bottleneck for this problem isn't Vec or List, it's the serde_json Value type that needs to be used. This is useful for serializing/deserializing values into Rust types but if you're trying to validate JSON against a schema you don't actually need the JSON value data, just the types of the nodes (or more specifically, you only need some of the value data, and probably not much of it, so don't pay for it when you don't have to).

If you implemented your own parser for the schema and the JSON and only used an AST to validate + span information (which can just be a pair of u16s for start/end of a token) then you can collect your error messages very, very quickly and generate the error message once validation is finished.

Heavily optimized compilers will do this for semantic analysis and type checking, where the IR they use can be constructed quickly and then used with the input text to get helpful messages out, while the guts of the algorithm is only working with semantic information in a structure that's compact and easy to access/manipulate.

All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.


> All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

I had a thought in my reply [0] on this that actually might let him eat his cake and have it too in this regard. I think you can heavily abuse serde::de::Visitor to schema validate without actually parsing (or with less parsing, at any rate). I went into more detail in my comment but I wanted to ping you (@duped).

[0]: https://news.ycombinator.com/item?id=40357159


Or, use sonic_rs::Value which uses simd and arena allocation for keys and in my case for big payloads was 8x faster than serds_json.

Or, use sonic_rs::LazyValue if you want to parse json on demand as you traverse it.


Hey, that's a nifty looking library. Thanks for pointing it out to me!

I've seen a lexer/parser scheme that encodes the lexer token type along with the token file location information into a u64 integer, something like

    struct Token {
        token_type: u8,
        type_info: u8,
        token_start: u32,    // offset into the source file.
        token_len: u16
    }
It's blazing fast. The lexer/parser can process millions of lines per second. The textual information is included, and the location information is included.


This is roughly what my JSON parser does. It does type-checking, albeit without using JSON-schema, but an object descriptor that you have to define to parse and generate JSON.

It's been developed for embedded systems (it was written originally for a NATS implementation in the Zephyr RTOS), so it's a bit limited and there's no easy way to know where some parsing/type validation error happened, but the information is there if one wants to obtain it: https://github.com/lpereira/lwan/blob/master/src/samples/tec...


I think the key insight is that the true benchmark is bytes/second (bandwidth) of the lexer/parser, so reducing the size of the output data (tokens/AST nodes) is a massive gain in the amount of data that you can process in the same amount of time.

The fewer bytes you can pack data into the more data that you can process per second. Computers may be the most complex machines ever built but the simple fact of having fewer things to touch means you can touch more things in the same amount of time remains true.


That is really cool idea! Thank you


Just use capn’proto. No deserialization needed !


Whats your experience like using it? Is it ergonomic or does it require you to do lots of type gymnastics?


This repo has a nice pub/sub implementation based on capnp: https://github.com/commaai/cereal/blob/master/log.capnp


The "maybe you don't need a linked list" proposal at the bottom seems significantly better than the options presented in the post:

- almost no cost in the non-erroring path

- no extra data structures to manage

- a lot less code

I think the post would benefit from a better analysis of why this doesn't work for them.


Indeed, I agree with your points.

This idea was added after I wrote the post and wasn't taken from my own optimization efforts in `jsonschema`. Originally, in `jsonschema` the output type is actually a badly composed iterator and I intended to simplify it to just a `Result<(), ValidationError>` for the article, but with this output, there are actually way better optimizations than I originally implemented.

If I'd discovered this idea earlier, I'd probably spend more time investigating it.


Indeed, also building a linked list over the stack like that is a crafty but very weird design. Keep it simple.


Nice post, Dmitry!

Two suggestions: it’s not immediately obvious whether subsequent benchmark result tables/rows show deltas from the original approach or from the preceding one (it’s the latter, which is more impressive). Maybe call that out the first couple of times?

Second, the “using the single Vec but mutating it” option would presumably benefit from a reserve() or with_capacity() call. Since in that approach you push to the vector in both erroring and non-erroring branches, it doesn’t have to be exact (though you could do a bfs search to find maximum depth, that doesn’t strike me as a great idea) and could be up to some constant value since a single block memory allocation is cheap in this context. (Additionally, the schema you are validating against defines a minimum depth whereas the default vec has a capacity of zero, so you’re guaranteed that it’s a bad choice unless you’re validating an empty, invalid object.)

But I agree with the sibling comment from @duped that actually parsing to JSON is the biggest bottleneck and simply parsing to the minimum requirements for validation would be far cheaper, although it depends on if you’ll be parsing immediately after in case it isn’t invalid (make the common case fast, assuming the common case here is absence of errors rather than presence of them) or if you really do just want to validate the schema (which isn’t that rare of a requirement, in and of itself).

(Edit: I do wonder if you can still use serde and serde_json but use the deserialize module’s `Visitor` trait/impl to “deserialize” to an enum { Success, ValidationError(..) }` so you don’t have to write your own parser, get to use the already crazy-optimized serde code, and still avoid actually fully parsing the JSON in order to merely validate it.)

If this were in the real world, I would use a custom slab allocator, possibly from storage on the stack rather than the heap, to back the Vec (and go with the last design with no linked lists whatsoever). But a compromise would be to give something like the mimalloc crate a try!


Thanks!

> Two suggestions: it’s not immediately obvious whether subsequent benchmark result tables/rows show deltas from the original approach or from the preceding one (it’s the latter, which is more impressive). Maybe call that out the first couple of times?

Agree!

> Second, the “using the single Vec but mutating it” option would presumably benefit from a reserve() or with_capacity() call. Since in that approach you push to the vector in both erroring and non-erroring branches, it doesn’t have to be exact (though you could do a bfs search to find maximum depth, that doesn’t strike me as a great idea) and could be up to some constant value since a single block memory allocation is cheap in this context. (Additionally, the schema you are validating against defines a minimum depth whereas the default vec has a capacity of zero, so you’re guaranteed that it’s a bad choice unless you’re validating an empty, invalid object.)

Oh, this is a cool observation! Indeed it feels like `with_capacity` would help here

> But I agree with the sibling comment from @duped that actually parsing to JSON is the biggest bottleneck and simply parsing to the minimum requirements for validation would be far cheaper, although it depends on if you’ll be parsing immediately after in case it isn’t invalid (make the common case fast, assuming the common case here is absence of errors rather than presence of them) or if you really do just want to validate the schema (which isn’t that rare of a requirement, in and of itself).

My initial assumption was that usually the input is already parsed. E.g. validating incoming data inside an API endpoint which is then passed somewhere else in the same representation. But I think that is a fair use case too and I was actually thinking of implementing it at some point via a generic `Json` trait which does not imply certain representation.

> (Edit: I do wonder if you can still use serde and serde_json but use the deserialize module’s `Visitor` trait/impl to “deserialize” to an enum { Success, ValidationError(..) }` so you don’t have to write your own parser, get to use the already crazy-optimized serde code, and still avoid actually fully parsing the JSON in order to merely validate it.)

Now when I read the details, it feels like a really really cool idea!

> If this were in the real world, I would use a custom slab allocator, possibly from storage on the stack rather than the heap, to back the Vec (and go with the last design with no linked lists whatsoever). But a compromise would be to give something like the mimalloc crate a try!

Nice! In the original `jsonschema` implementation the `validate` function returns `Result<(), ErrorIter>` which makes it more complex to apply that approach, but I think it still should be possible.


7-8 years ago I created GlueList (https://github.com/ertugrulcetin/GlueList) in order to build faster version of LinkedList + ArrayList. It was a fun effort.


If I have this right, what you've built is this, storing M items in M / N nodes where N is the radix of the array?

    0                 1                        M / Nth node
    [ N elem ] [.] -> [N elem] [.] -> .... -> [M - (M / N) elem] [null] 
And so if you want to index the `i`th element you chase the pointers

    index (node i) :
        acc = 0
        while acc + N < i
           acc += N
           node = node.next
        return node.array[j - acc]
Or something like that? You can do better using a B tree and exploit the fact that you don't need keys to just store pointers in the branch nodes. This reduces the number of pointer dereferences in large arrays.

For example say you have N = 8 and there are 1024 elements in the list, you would need 127 pointer dereferences to reach the end of the array. In a B-tree you only have 3. (double check my math, I might be off by one).

This is the typical implementation for immutable/persistent vector types. If you keep the links between leave nodes at the bottom of the tree, you've got a B+ tree and have fast iteration through it as well.


You're probably right, I was in college back then and wanted to work on something cool, this was my idea.


It was fun because you didn't write it in Rust :)

https://rust-unofficial.github.io/too-many-lists/


Similar to an unrolled linked list: https://en.wikipedia.org/wiki/Unrolled_linked_list

But you size the nodes dynamically, rather than using a fixed size.


I believe that's called a "rope"


Rope is binary tree of arrays, for O(logN) index operation.


Ah, my bad. What's the name for the thing he's talking about?


It's called an "unrolled" linked list

Intrusive linked lists might bring down the allocations further, reduce the memory footprint, & more importantly improve locality when doing pointer chasing (single predictable indirection vs double hard-to-predict indirection).


> Linked lists are taught as fundamental data structures in programming courses, but they are more commonly encountered in tech interviews than in real-world projects.

I beg to disagree.

In kernels, drivers, and embedded systems they are very common.


Most people who take data structures courses or perform tech interviews don't end up working on kernels, drivers, or embedded systems though. To me, it sounds like the point being made is that there are a large number of programmers who have learned about linked lists but haven't run into many cases where they needed them in the world world, and I think it's accurate.


This was my intention


Agree, I can't recall using anything more complicated than lists/arrays or hash tables (key/value stores) in practice, in many years of (mostly web application) programming. And even those I'm not coding from scratch, I'm using classes or functions that my programming language gives me. For anything more complicated than that, I'm using a database, which of course is using many data structures under the covers but I don't directly touch those.


I used to use them all the time. However, now? I would be hard pressed to not use one of the many built in vector/list/dict/hash items in many languages now. I would have to be truly doing something very low level or for speed to use one.


As a counterpoint, I’ve been working on collaborative text editing. I ended up implementing a custom b-tree because we needed a few features that I couldn’t find in any off the shelf library:

- My values represent runs of characters in the document.

- Inserts in the tree may split a run.

- Runs have a size - 0 if the run is marked as deleted or the number of characters otherwise. The size changes as we process edits

- Every inserted character has an ID. I need to be able to look up any run by its ID, and then edit it. (Including querying the run’s current position in the tree and editing the run’s size).

It’s an interesting data structure problem, and it took a few weeks to have a good solution (and a few weeks more later rewriting it in safe rust & improving performance in the process).

I love this stuff. I think it’s pretty rare to find a reason to code your own collection types these days, but it certainly comes up from time to time!


> I love this stuff. I think it’s pretty rare to find a reason to code your own collection types these days, but it certainly comes up from time to time!

Absolutely! That is one of the places you want to use that style of programming. As the base classes and built in structs do not really cover it yet.

Also as a counterpoint sometimes the built in ones have some very interesting degenerate cases. I had one in an old library that basically doubled its memory footprint every time you exceeded its buffer. That was a point to change it to be a fixed allocation or something else. If i had no idea of the fundamentals I would have been totally in the weeds and no idea why it was doing it.


You need a linked list to write hello world in any Lisp, though.

Seems like the glaring exception to the rule!


As source code, but not necessarily as running code.

SBCL:

    * (defun hello-world () (write-string "hello world"))
    HELLO-WORLD

    * (disassemble #'hello-world)
    ; disassembly for HELLO-WORLD
    ; Size: 36 bytes. Origin: #x100311C85C                        ; HELLO-WORLD
    ; 5C:       AA0A40F9         LDR R0, [THREAD, #16]            ; binding-stack-pointer
    ; 60:       4A0B00F9         STR R0, [CFP, #16]
    ; 64:       EAFDFF58         LDR R0, #x100311C820             ; "hello world"
    ; 68:       570080D2         MOVZ NARGS, #2
    ; 6C:       29EC80D2         MOVZ TMP, #1889
    ; 70:       BE6B69F8         LDR LR, [NULL, TMP]              ; WRITE-STRING
    ; 74:       DE130091         ADD LR, LR, #4
    ; 78:       C0031FD6         BR LR
    ; 7C:       E00120D4         BRK #15                          ; Invalid argument count trap
The actual code for this example is machine code (which references a string, which is a vector), here without linked lists.

No, you don't need to use linked lists to send a string to the standard output port in most Lisps. You just call a function.


(write-string "hello world") has linked list semantics, the fact that compilers can be smart enough to ignore this is not the point I'm making.

(write-string (cdr '(write-string "hello-world"))) also has to work, so it's pretty easy to materialize that semantics at any point.


> has linked list semantics

Nope; it has linked list syntax (that certainly isn't ignored even by very good compilers). Syntax isn't semantics.

The semantics is that a function write-string is called, with a string as its argument.

The second expression has linked list processing in its semantics because you stuck in a cdr, as well as a quote which makes a piece of the program available as run-time list datum. (This is semantics that could be easily optimized away in the executable form, but I would say that it has linked list processing in its abstract semantics.)


> Nope; it has linked list syntax (that certainly isn't ignored even by very good compilers)

We're looking at the same string and seeing different things. You're seeing `(write-string "hello world")` as a program, I'm seeing it as an expression.

It has linked list semantics, which you can preserve until runtime like this `'(write-string "hello world")`. Note that I didn't change the string, I changed its context. If the original were living in a string, and you called read on it, it would become a linked list. If you called eval on that list, it would become a function call. This is basic stuff which I'm well aware you know, so I'm not sure what all the quibbling is about.

You literally need a linked list to write a program in a language in which the code becomes linked lists. And you're going to have a bad time writing Lisp if you don't get the hang of cons cells, early and often.

Is "code is data" true, or false? You're trying to have it both ways here.

The "large number of programmers who have learned about linked lists but haven't run into many cases where they needed them in the world world" include approximately zero programmers who have wielded Lisp in anger, is my point. I thought that was pretty clear from context, but I guess not.


The compiler manipulating linked lists to get your program into executable form, versus that program manipulating linked lists to do its job, are different things.

You're also likely provoking list manipulation by running C hello world. The C grammar has lists, like lists of parameter declarators in a function, or argument expressions in a function call.

By the time you've compiled your C program, you've likely "used", lists, trees and hash tables.

Classic line-numbered BASIC interpreters stored the lines as a linked list. For instance in

  10 PRINT "Hello"
  20 GOTO 10
the 10 line is stored as a datum which has a pointer to the 20. Some BASIC implementations implemented a forward GOTO as a linear search through the linked list starting at the current line, and a backwards GOTO as a scan from the beginning.

So in addition to not being able to write C hello without linked lists, the same holds for BASIC.

The way you present your idea about Lisp is harmful because it is likely to be misinterpreted and become misinformation in the eyes of those who not so well informed.

The Lisp community still has to deal with nonsense like that the execution of Lisp programs is slow because linked lists are continuously being traversed, or that the only data structure is a list.

Think about how you might be playing into that. What do you think it looks like when you say that you can't write a hello world, without using linked lists.


Really only because they’re so goddamn easy. I find myself using linked lists a lot less since adopting rust for embedded code (even with no_std and no allocator, but especially when alloc-only std data structures are within reach).


Linked lists were heavily used in application software before the appearance of standard libraries and Java, which is when dynamically sizable array-based lists become common. There also wasn't a gap between the performance of linked lists and arrays before CPU became significantly faster than RAM.


Modern processor and cache performance lend themselves to vectors and SSA. Linked lists just don't scale well outside of niche uses.


>In kernels, drivers, and embedded systems they are very common.

Out of all the programmers in the world, what percentage of them do you think work in the kernel/driver/embedded spaces?


First, my only guess is that everyone's guesses are going to be wildly wrong. People who work in such spaces will greatly overestimate. People who don't will greatly underestimate. (This is mostly due to how many comments I've read on HN that implicitly assume that most people's problems and perspectives are the same as the commenter's.)

Second, linked lists are useful in a lot more places than that. Probably a better proxy would be low-level coders. You almost always want a linked list somewhere when you're dealing with memory addresses and pointers. Maybe not for the primary collections, but there are always secondary ones that need to maintain a collection with a different membership or ordering, and vectors of pointers don't have many clear advantages over intrusive linked lists for those secondary collections.


Yeah intrusive collections in C is the biggest use I’ve seen. I played with a physics engine a few years ago (chipmunk2d) which made heavy use of intrusive linked lists to store all the objects in the world model. I suspect there’s some clever data structures out there that might have better performance, but the intrusive linked list approach was simple and fast!


1%


More like 0.01% -- if we consider enterprise programmers, web programmers, and application/game programmers which I'd expect to be the largest groups...


Yep. There aren't many software developers I know who have ever touched {Linux, macOS, FreeBSD, Windows} kernel code except for embedded devs, driver devs, security researchers, hobbyists, and SREs/PEs.

The % who have touched kernel bits, wrote a triangle engine scene renderer, wrote a compiler, touched server metal in production, have worked on ASICs, and can put together ML/AI building blocks shrinks way, way down to a handful of living humans.


if the value really is 0.01%, then the education pipeline needs to be revised. 'blue collar' programmer positions should be the majority.

This not about blue collar vs white colar. After all corporate programmers and web programmers can both be blue colar, and systems programmers can be white colar (if we're using "blue colar" to mean smaller salaries and fewer percs - otherwise programming is a white colar job anyway).

This is about how many work in kernels/embedded systems/etc vs more common programming gigs. And that's less about how many are trained to do so, but rather how many are needed.


There are plenty of good uses for linked list and their variants. Like LRU lists come to mind; I couldn't bet that it's the most efficient way to implement them but they're pretty darn good. Then obviously things like breadth first search need a type of queue data structure. It often can come down to memory pressure, if you've got Gigs to spare, then allocating a contiguous block of memory for a list of something isn't a big deal, if memory is tight and maybe fragmented, linked lists can and will get it done. They have their places.

I did start to encounter some fresh grads with degrees that said "computer science" on them that couldn't answer some basic linked list questions. I was beginning to think it was a bad set of questions until I hit those kids. If you claim to know "computer science" and don't know what a linked list is, especially beyond some text books stuff, I'm probably not interested.


Why? Why would someone reach for a linked list in a kernel, driver, or embedded system?


No memory allocation/reallocation, preallocated resources managed in e.g. a free list. Also for things like packetized networks, lists are handy for filling as you progress down the stack while using fixed sized packet buffers, or reassembling fragments.

In embedded world, memory often needs to be exactly controlled, and allocation failures are fatal without a more complex MMU. In kernel world, I believe the main reason is that allocations can block.


In kernels, it's usually hard to get general-purpose allocation working reliably in all contexts. And you need that for resizable vectors. With lists, you just need to be able to grab an element-sized block. Quite often, it's even done with the memory page granularity.

In addition, a lot of data structures might be shared across multiple cores. Linked lists can be traversed and mutated concurrently (although with a bit of care).


I wonder how much of that is due to the kernel history, and the influence of C idioms, and not because of some inherent design superiority.

I'd be convinced once I see pure Rust kernels geared towards modern machines suddenly using linked lists everywhere. Otherwise I'm leaning towards it being a side-effect of the language choice and culture.

Also because I've seen the same kind of reasoning applied to compilers (e.g. "of course you need linked lists in compilers, they are extremely graph traversal heavy"). But one look at modern compilers implemented in Rust paint a very different picture, with index-based vectors, data-oriented design and flattened ASTs everywhere.


Getting a general memory allocator working in kernel contexts is a hard task. You need to make sure it can't block and is re-enterable, that it doesn't result in fragmentation, and that it can be used from multiple threads.

It can be solved (or worked around), but it's understandable that people don't _want_ to do that.


Intrusive lists are really powerful for those kinds of scenarios, and technically are linked lists. They're widely used in the kernel, IIRC.


O(n) iteration but pretty much guaranteed O(1) for every other operation. If that's the semantic you need, then linked lists are your friend.


Any time you have a computer interacting with the outside world in an asynchronous fashion you basically have to have some form of buffering which takes the form of a queue/fifo. A linked list is the most performant/natural way of modeling a queue in our ubiquitous computing infrastructure.

I/e in a DMA-based ethernet driver, the ethernet MAC receives packets asynchronously from the processor, perhaps faster than the processor can ingest them. So the mac interrupts the processor to give it new packets, and the processor can't sit processing the packets in the interrupt context, so it needs to put them into some ordered list for processing later when it has downtime. In a true embedded system, the memory for this list is going to be fixed or statically allocated, but you still don't really want to have an array-style list with fixed indexing, as you'll have to manage what happens when the index wraps around back to 0 etc, so instead you just construct a linked list in that pre-allocated memory.

I wouldn't say linked lists aren't really used in high-level applications, as I said they're used all over the place whenever you have external asynchronous communication, it's just that modern high-level frameworks/libs totally abstract this away from most people writing high level code.


Easier to avoid allocation errors, e.g. in the Linux kernel. I think Alice Ryhl mentioned it here - https://www.youtube.com/watch?v=CEznkXjYFb4


How do linked list prevent allocation errors? If anything it would seem to make them worse.

My experience in embedded, everything is hardcoded as a compile time constant, including fixed size arrays (or vectors of a fixed capacity)


Intrusive linked lists eliminate the allocation entirely. With a vector<Obj>, you have the Obj allocation and then potential vector-related reallocations. With an intrusive linked list, you only have the Obj allocation. So your code that adds/removes list entries does no additional allocation at all, it reuses a pointer or two that was allocated as part of the original Obj allocation. Often the list manipulation happens at a time when allocation failures are inconvenient or impossible to handle.


In more complex embedded software you are likely to see free lists used to manage pools of preallocated resources (like event structs etc) or pools of fixed sized memory buffers.


In embedded, you often need message queues.

A common way to implement these is to have an array of messages, sized for the worst case scenario and use this as the message pool.

You keep the unused messages in a single linked "free-list", and keep the used messages in a double linked queue or fifo structure.

That way you get O(1) allocation, de-allocation, enqueue and dequeue operations for your message queue.

Another example for this paradigm are job queues. You might have several actuators or sensors connected to a single interface and want to talk to them. The high level "business" logic enqueues such jobs and an interrupt driven logic works on these jobs in the background, aka interrupts.

And because you only move some pointers around for each of these operations it is perfectly fine to do so in interrupt handlers.

What you really want to avoid is to move kilobytes of data around. That quickly leads to missing other interrupts in time.


I'd say most developers don't write kernels/drivers or embeds, at least from what I've seen. I am not saying that there are not many devs like this, but rather that there are fewer kernel devs than web devs.


I beg to disagree^2. Tasks, threads, and processes are often structured as rings where there is always a "next" to maintain simplicity of task switching. The overall architecture of resources is modelable as cyclic graphs but implemented as rings, deques, single LLs, and other data structures.


I don't do any of those things and I still use lists constantly. Kinda strange to learn that many others don't use them at all it seems.


What kinds of things are you using them for usually? Is it mostly in C/C++?


linked lists shine when you can perform a O(1) remove operation if you have a reference to an object on the list. This is very common when using C structs and not possible in Java for example.


these cases are usually cases where you want to use a (hash) Set. If you're Ok changing everyone's indexing, the indexing didn't matter.


Linked list benchmarks are amazing.... if you don't thrash your cache on inserts so all its elements are contiguous. You get all the benefits of a vector and a linked list, without the reality that linked lists mostly don't get populated 100% consecutively, and thus can be anywhere in memory.


Most languages with linked lists as an important part (lisps mostly) all have well optimized linked lists that end up with a lot better memory locality.


How does that work? I don't follow how any runtime or compile-time optimization can solve the problem of locality for a linked list. If the data wasn't allocated sequentially, then it's simply not going to be sequential (unless you move it).


Instead of thinking of a linked list structurally, think of it functionally. You can have a token that represents a location in the linked list. From that token you have a Fetch() operation that will fetch what is at that location, a Next() operation that will fetch the next token or some indication that you are done, and depending on your language, some sort of insert or append operation (mutability factors in here).

While there is a natural encoding of this process into a pointer to the target value, and a pointer to the next node, it is not the only encoding possible by any means.

A good exercise if you are having trouble with this is to implement a little linked list that performs those operations on something that simply backs to an array, including the blind O(n) copy when appending another list. It should be quite short. But that is not the only alternate implementation, either, and you can easily build others where the interface is maintained but you pay only log n additional factors maximum for operations, easily recovered from the constant factors which on modern hardware are often staggering.

Once you break the entanglement between memory representation and the API a data structure offers in your mind, many ways become obvious as to how to possibly improve this structure while maintaining the same operations, many of which of course already exist and have names.


Linked lists are, by definition, a value with a pointer to the next. That's what makes it "linked". The API you're describing is the iterator pattern, which indeed can be backed by almost anything (be it a linked list, array, tree, etc.).


Both replies as I write this are ignoring the question I was answering. The question was, how can a compiler implement a linked list as anything other than a linked list? The answer is what I gave. Being a compiler, it may also do things like an analysis of the use cases to prove that there are no relevant changes to performance (or that performance only improves) and fall back to a "true" linked list if it can't prove how it is being used, or, being a compiler, it may just tell you to get stuffed and deal with the performance differences. Depends on the choices the compiler author(s) made.

But just because Lisp has cons and cdr does not mean the compiler is obligated to only and exactly use things that structurally look like linked lists in the actual implementation. You need to break the relationship between memory layout and API to understand what is going on here. You may choose later to put them back together; the naive memory layout is often a good one. (Linked lists nowadays are arguably one of the larger exceptions to that, but there are still circumstances even so where that may not be an issue; see Erlang's memory management for instance and ponder how that impacts linked list locality.) But you can't understand what compilers may be doing if you do not realize that there is a distinction.


> Instead of thinking of a linked list structurally, think of it functionally

If your argument is that a linked-list-like data structure can be implemented using something other than a linked list, then I agree with you. Vector tries (for example) are great for that use case.

But a vector trie isn't a linked list, it's a vector trie. As such, it will be faster for some usage patterns, equal for some, and completely degenerate for some others. Just like any other data structure that isn't a linked list.

It wouldn't really be an "optimization" to implement my linked list code with something other than a linked list - it would be rewriting my code.


Every linked list discussion I've ever gotten into ends with someone finding a way to modify the std::list data structure in such a way that totally breaks the definition of the LL data structure. We may as well call it "the law of linked list arguments"


SBCL tries its hardest to allocate sequentially, then moves lists to be sequential in GC.


The entire theoretical advantage of linked lists over vectors (constant time insertion and deletion) comes from not sequentially allocating them.


I don't think this is what hayley-patton meant (although it's not crystal clear). I think SBCL does memory management with a bump allocator and a moving collector. So if you build a linked list sequentially, the cons cells will be allocated sequentially, and will be sequential in memory. If you build it in random order, they won't be. But when the collector runs, it will move the whole list, and it will move the cells in sequential order, and then it will be.


If you have a compacting GC you are going to move the nodes anyway, so why not reallocate them sequentially?


That implies you took the time to sort them, so your O(1) insertion time just became O(N log N). Sure, that cost is spread over how ever many inserts you did between GCs but it isn't O(1) anymore.


SBCL cheats and copies starting from the first cons cell that is referenced from outside the list; this tends to be the start of the list, and so traversing just works. The contiguous copying also ends when a cons cell which was already copied is found, so there can't be more discontinuities than there are cons cells referenced from outside the list, regardless of which order we discover the references.


There are other advantages of linked lists over vectors besides those that you've listed.


Well, I once wrote a small scheme that did loop unrolling and allocated lists in as many steps as the unroll went, which was capped at a maximum of 5.

That worked surprisingly well, but there were lots of edge cases since I am a professional bassoonist and not a compiler guy.

I generalized it and made all lists allocated in chunks of 16 and then let the GC truncate it. I spent a stupid amount of work doing the first thing and the second thing was done in an afternoon.

Then there are all kinds of


A sufficiently smart compiler can detect that a new node gets allocated and then appended to a list, and allocate the node near the other nodes of that list.

To have a good chance that there is space near the other nodes, you’d have to reserve memory up front for each list you’d want to do that with, though, and that will kill cache locality before that optimization sets in. Corrections welcome, but I don’t see that (easily) being a net win. But hey, with a sufficiently smart compiler at hand, you can do wonders. (Edit: when looking at pure functions, it many times may not be that hard to discriminate between “this allocation will be returned” and “this allocation is temporary”. For example, if you use map to apply a function to all elements in a list, and that function is pure, detecting that only the return values of that function will be visible after the function returns isn’t that hard)

An easier win is that lisp garbage collectors tend to move memory, and thus can try to move nodes that reference each other close together. You get that for free with a semispace collector, but that has quite a few disadvantages, so you don’t want to use that. A generational collector also will do it a bit, though (say you create a few million nodes of garbage to build a list of a few hundred results. Then, after generational collection, those few hundred cells will be closer together than before)


If needed, linked list can be backed by a vector, that is resized, when needed.


If anyone knows of a detailed writeup of a language implementation like this, please link to it. The thread under this comment is disappointingly full of "a sufficiently smart compiler could ..." type stuff.


Why cant we 'simply' get processor extension to mark data as pointer so that the prefetcher would actually know what to fetch.

From my understanding this is what led to the recent 'unpatchable' exploit in Apple's M1, but rather then trying to guess it by some heuristic, why not just give compilers option to make that optimization?


Apple Silicon does this. If it prefetches something that looks like a pointer, it will also fetch the pointed-to memory. It's a cool feature, and is especially useful for Apple, since their Objective-C collections only store pointers – but it also can re-open the door for certain timing attacks by violating preconditions of constant-time cryptography algorithms.

https://en.wikipedia.org/wiki/GoFetch


I don't think making CPU issue (likely bogus) pre-fetches for every field in the cache line that's marked as a pointer is really that good idea. At best, you save couple of cycles because the fetches are started a one or two instructions earlier before the actual load instruction for "loading the linked pointer" is issued. At worst, you keep thrashing your cache loading data you're not going to read, delaying fetching the data you will read.


For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy


Because if you're issuing a bogus pre-fetch, you can't cancel it, can you? So that's 90 or something cycles that's the fetch for your actual data is being delayed. Pointer chasing already strains the memory bandwidth, trying to request even more data from memory will only worsen things.

And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:

        beqz    a0, .end
    .loop:
        ld      a1, 0(a0)   ; a1 = curr->data
        ld      a0, 8(a0)   ; curr = curr->next
        ; do something with payload in a1 here
        bnez    a1, .loop
    .end:
becomes

        la      s1, sentinel
        beq     a0, s1, .end
        ld      a1, 0(a0)
        ld      a2, 8(a0)
        ld      a3, 0(a2)
        ld      a4, 8(a2)
        ld      a5, 0(a4)
        ld      a6, 8(a4)
        beq     a6, s1, .trail
    .loop:
        ld      t0, 0(a6)
        ld      t1, 8(a6)
        ld      t2, 0(t1)
        ld      t3, 8(t1)
        ld      t4, 0(t3)
        ld      t5, 8(t3)
        ; do something with three payloads in a1, a3, a5 here
        mv      a1, t0
        mv      a2, t1
        mv      a3, t2
        mv      a4, t3
        mv      a5, t4
        mv      a6, t5
        bne     t5, s1, .loop
        mv      a0, a2
        beq     a2, s1, .end
     .trail:
        ld      a1, 0(a0)
        ld      a0, 8(a0)
        ; do something with payload in a1 here
        bne     a0, s1, .trail
     .end:
As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.

Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].

[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...


This was the idea of Itanium. It failed mostly because of economics.

It turns out programmers, or rather their employers, don't really care about using hardware efficiency. They care about shipping things yesterday, because that's how business deals get closed, and making software efficient is secondary to money changing hands. Performance never really matters to business people after the checks are cashed.

Multicore computers have been ubiquitous for more than a decade, yet the overwhelming majority of software built today is single-threaded microservices, where in they spend most of their time serializing and deserializing message payloads.

This is all really to say that most performance is already being left on the table for the majority of what computers are used for today.


I do want to say that I think the Itanic would have fared way, way better in a post-LLVM world where the importance of smart, optimizing compilers is much more valued and understood and language designers actively work hand-in-hand with compiler devs far more often (with much more significant back-and-forth from hardware manufacturers).


I don't think LLVM is particularly good at optimizing VLIW code.

Very good optimizing compilers existed before LLVM. Intel had one specifically for Itanium. It wasn't enough.


Why would llvm be particularly good at optimizing vliw code when there’s no demand for it to be? You can’t believe everything else would remain the same in the hypothetical I posed.


A) optimizing for VLIW is hard. B) the null hypothesis would be no change.

Given how much of today's computer needs are dependent on a database query, this is no surprise. Who cares about the micros you gain with added efficiency while there's a 100ms db query return in the path?


Apparently Apple and Intel do, since they introduced those changes into their silicon


Not every pipeline involves a db query.


Where do you think DBs run?


I mean sure, I don't doubt 99% of end-user programmers wouldn't look twice at something like this, but compilers designers probably would care.

And it's not like the companies arent trying this idea (again, M1 exploit). But for whatever reason they want to keep cpus as black box, perfect abstract machines, even though we know they aren't


An exception to this is precisely parsing and validation in which linked lists can and ideally should be populated consecutively (using an arena allocator). I agree that general purpose linked lists are rarely optimal but they are hard to beat for parsers where they hit a pareto sweet spot of implementation simplicity and good performance.


This is such an important point.

Benchmarking is hard, and very often is done in ideal conditions, which never materialize in real use cases.


Another cool aspect of this and (if I understand correctly), where Rust really helps you is that you can explore multiple branches in parallel, since the linked list is immutable.


To be fair, lack of concurrency safety never stopped C and C++ devs from boldly doing just that, anyway!


What if the linked list is cycled or doubly-linked?


I was talking about this case in particular, where we know the list is basically isomorphic to the call stack.


Then Rust isn't the right tool for the job. Rust is great for tree-like structures which is 99% of what you encounter anyway. Unless you're writing a kernel or something.


I was hoping to see optimization of the actual linked list manipulation and traversal (pipelining? i'm not sure what you'd do), but this is still a neat post. It's cool to see thought put into various parts of the problem, like reallocation/preallocation, stack allocation, etc.


reallocation/preallocation actually does increase manipulation and traversal performance.

The major slowness of linked lists is cache inconsistency. New nodes can be put all over the memory space. However, if you can make sure all or part of the list exists in contiguous blocks of memory, then there's a good chance that when the CPU loads up the next node it will also grab the next 3 nodes in a cache line.

The closer these node addresses are in memory, the faster things will be.


Link lists can move items in O(1) but their O(N) search can be bad because of all of the cache line misses.


Linked lists get a bum rap.

Yes, if you have a simple choice between a vector and a linked list, then the vector is vastly superior due to locality and (for non-intrusive linked lists) allocations. So much so that vectors often win even when you're doing lots of O(n) deletions that would be O(1) with a linked list.

But that doesn't mean that linked lists are useless! A vector gives you a single ordering. What if you need multiple? What if you need a free list, which you're never going to be iterating over but will just grab off an item at a time? I find it quite common to have one "natural" order, for which I will use a vector (or equivalently, a bump allocator of fixed-size items), and then one or several auxiliary orders like the entries belonging to a particular owner or the ones that will need to be visited for some sort of cleanup or an undo list or some sort of stack or queue of pending items to process. Your common iteration will be in memory order, but that doesn't mean you won't ever want to do different iterations.

It annoys me that this is always omitted, with the attitude that linked lists are obsolete and useless because vectors be moar better faster gooder, to the point that it's a waste of time to learn how to manipulate linked lists anymore.

I guess a lot of this is probably due to the popularity of high level languages, where you're just dealing with references to everything in the first place. But in those, the arguments in favor of vectors are often not valid because that blessed golden vector of Objects is giving you no more locality than giving your Object a `next` field: the vector of Objects is represented internally as a vector of pointers to the actual Object data, so your oh so nicely cached lookups are doing memory reads that are 100% pure overhead compared to following a `next` link from an Object that fits into a cache line. In both cases, your performance is going to be limited by the locality of your Object data, which is the same whether you have a vector of pointers or Object data with an intrusive pointer.

Also, if you have an existing setup and need to introduce a new list, it is sometimes far easier to drop in a `next` (and maybe `prev`) field than to refactor everything to accommodate a new vector. Especially since the vector will move all of your data when resizing the vector, which invalidates any pointers you might be using for other purposes. If you'll be iterating that list frequently, then the vector may very well be a good idea. If it's just for error cases or slow paths, then linked lists really aren't bad at all.

I'm not trying to argue for linked lists here so much as arguing against the blanket arguments against them.

</rant>


> What if you need multiple?

You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

If you want to remove elements, you can very much use tombstones to flag deleted elements and then clean up after some threshold.


> You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

Cheaper in space depends on the percentage of elements in the auxiliary list. An intrusive list has space for every element to be in the list, which is wasteful if few are. A vector that grows by doubling could waste nearly half of its elements. Indexes can often be smaller than pointers, though, which favors the vector approach.

Faster is debatable. Iterating the vector of indexes is quite fast, but indirecting into the data in a random order is still slow. An intrusive linked list doesn't need to do the former, only the latter. (Then again, it also bloats up the data slightly, which matters for small items since fewer fit in cache.)

The main reason why linked lists could still be at an advantage is if you don't want allocations in your auxiliary list manipulation path. Maybe you don't want the unpredictable timing, or maybe you can't handle failure.

I agree on tombstoning, but note that you're giving up some of the vector advantages by doing so. Your processing loop now has a much less predictable branch in it. (Though really, the vector probably still wins here, since the linked list is built out of data dependent accesses!)

Sometimes these auxiliary lists need to contain things that are no longer in the main list, too, as in the case of a free list. (And if you swap such things to the end, then you've just invalidated all of your indexes.) And non-intrusive linked lists can point to variable-size elements (though they lose most of the other benefits of an intrusive linked list.)

Anyway, it's the usual "use the right tool for the right job", I just claim that linked lists are sometimes the right tool.

(Random side note: I use "indexes" instead of "indices" these days, for a silly reason—"indices" is a fine word, I have no issue with it, but it seems to encourage some people to use the non-word "indice" which is an abomination.)


Linked lists are often the wrong choice because they're rarely performant or efficient when compared to vecs in the real world.

In general, use vecs until you absolutely can't.

Perhaps there are a few uses for LL's in limited circumstances.


Wouldn't using push_back prevent the need to reverse the Vec at the end?


This article is disingenuous with its Vec benchmark. Each call to `validate` creates a new Vec, but that means you allocate + free the vec for each validation. Why not store the vec on the validator to reuse the allocation? Why not mention this in the article, i had to dig in the git history to find out whether the vec was getting reallocated. This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

It would be interesting to see the performance of a `Vec<&str>` where you reuse the vector, but also a `Vec<u8>` where you copy the path bytes directly into the vector and don't bother doing any pointer traversals. The example path sections are all very small - 'inner', 'another', 5 bytes, 7 bytes - less than the length of a pointer! storing a whole `&str` is 16 bytes per element and then you have to rebuild it again anyway in the invalid case.

---

This whole article is kinda bad, it's titled 'blazingly fast linked lists' which gives it some authority but the approach is all wrong. Man, be responsible if you're choosing titles like this. Someone's going to read this and assume it's a reasonable approach, but the entire section with Vec is bonkers.

Why are we designing 'blazingly fast' algorithms with rust primitives rather than thinking about where the data needs to go first? Why are we even considering vector clones or other crazy stuff? The thought process behind the naive approach and step 1 is insane to me:

1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

You also mention:

> In some scenarios, where modifications occur way less often than clones, you can consider using Arc as explained in this video

I understand you're trying to be complete, but 'some scenarios' is doing a lot of work here. An Arc<[T]> approach is literally just the same as the naive approach but with extra atomic refcounts! Why mention it in this context?

You finally get around to mutating the vector + using it like a stack, but then comment:

> However, this approach requires more bookkeeping and somewhat more lifetime annotations, which can increase code complexity.

I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.


> how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

To me this is a self answering question.


Yes, when I saw the immutable path is cloned and appended a key, I knew the benchmark was a strawman.


> This article is disingenuous with its Vec benchmark. Each call to `validate` creates a new Vec, but that means you allocate + free the vec for each validation. Why not store the vec on the validator to reuse the allocation? Why not mention this in the article, i had to dig in the git history to find out whether the vec was getting reallocated.

The idea comes back to [0] which is similar to one of the steps in the article, and before adding `push` & `pop` I just cloned it to make things work. That's what Rust beginners do.

> This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

Maybe from today's point in time, I'd think the same.

> It would be interesting to see the performance of a `Vec<&str>` where you reuse the vector, but also a `Vec<u8>` where you copy the path bytes directly into the vector and don't bother doing any pointer traversals. The example path sections are all very small - 'inner', 'another', 5 bytes, 7 bytes - less than the length of a pointer! storing a whole `&str` is 16 bytes per element and then you have to rebuild it again anyway in the invalid case.

Yeah, that makes sense to try!

> This whole article is kinda bad, it's titled 'blazingly fast linked lists' which gives it some authority but the approach is all wrong. Man, be responsible if you're choosing titles like this. Someone's going to read this and assume it's a reasonable approach, but the entire section with Vec is bonkers.

> Why are we designing 'blazingly fast' algorithms with rust primitives rather than thinking about where the data needs to go first? Why are we even considering vector clones or other crazy stuff? The thought process behind the naive approach and step 1 is insane to me:

> 1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

> 2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

That's clickbait-y, though none of the article's ideas aim to be a silver bullet. I mean, there are admittedly dumb ideas in the article, though I won't believe that somebody would come up with a reasonable solution without trying something stupid first. However, I might have used better wording to highlight that and mention that I've come up with some of these ideas when was working on `jsonschema` in the past.

> I understand you're trying to be complete, but 'some scenarios' is doing a lot of work here. An Arc<[T]> approach is literally just the same as the naive approach but with extra atomic refcounts! Why mention it in this context?

If you don't need to mutate the data and need to store it in some other struct, it might be useful, i.e. just to have cheap clones. But dang, that indeed is a whole different story.

> I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

Fair. Adding `imbl` wasn't a really good idea for this context at all.

Overall I think what you say is kind of fair, but I think that our perspectives on the goals of the article are quite different (which does not disregard the criticism).

Thank you for taking the time and answer!

- [0] - https://github.com/Stranger6667/jsonschema-rs/commit/1a1c6c3...


[flagged]


Opposite for me; I can read C/C++ fine including messy template code, but Rust's syntax is overall easier for me to read, and has a much simpler grammar.


I often think about this - my pet theory is that the kinds of smart people who can create new languages don't have a problem with a new syntax. While us plebes struggle, learning both new concepts and new syntax at the same time. :)


this is interesting!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: