Hacker News new | past | comments | ask | show | jobs | submit | matt_d's comments login

The Future of Weak Memory (FOWM) 2024 talks: https://www.youtube.com/playlist?list=PLyrlk8Xaylp6u1S3R6gH0...

Abstract: https://popl24.sigplan.org/details/fowm-2024-papers/16/What-...

The C++11 memory model was first included with thread support in C++11, and then incrementally updated with later revisions. I plan to summarize what I learned, both as a C++ standards committee member, and more recently as a frequent user of this model, mentioning as many of these as I have time for:

The C++ committee began with a view that higher level synchronization facilities like mutexes and barriers should constitute perhaps 90% of thread synchronization, sequentially consistent atomics, maybe another 9%, and weakly ordered atomics the other 1%. What I’ve observed in C++ code is often very far from that. I see roughly as much atomics as mutex use, in spite of some official encouragement to the contrary. Much of that uses weakly ordered atomics. I see essentially no clever lock-free data structures, along the lines of lock-free linked lists in the code I work with. I do see a lot of atomic flags, counters, fixed-size caches implemented with atomics, and the like. Code bases vary, but I think this is not atypical.

In spite of their frequent use, the pay-off from weakly ordered atomics is decreasing, and is much less than it was in Pentium 4 times. The perceived benefit on most modern mainstream CPUs seems to significantly exceed the actual benefit, though probably not so on GPUs. In my mind this casts a bit of doubt on the need to expose dependency-based ordering, as in the unsuccessful memory_order_consume, to the programmer, in spite of an abundance of use cases. Even memory_order_seq_cst is often not significantly slower. I’ll illustrate with a microbenchmark.

We initially knew way too little about implementability on various architectures. This came back to bite us recently [Lahav et al.] This remains scary in places. Hardware constraints forced us into a change that makes the interaction between acquire/release and seq_cst hard to explain, and far less intuitive than I would like. It seems to be generally believed that this is hard or impossible to avoid with very high levels of concurrency, as with GPUs.

We knew at the start that the out-of-thin-air problem would be an issue. We initially tried to side-step it, which was a worse disaster than the current hand-waving. This has not stopped memory_order_relaxed from being widely used. Practical code seems to work, but it is not provably correct given the C++ spec, and I will argue that the line between this and non-working code will inherently remain too fuzzy for working programmers. [P1217]

Unsurprisingly, programmers very rarely read the memory model in the standard. We learned that commonly compiler writers do not either. The real audience for language memory models mostly consists of researchers who generate instruction mapping tables for particular architectures. The translation from a mathematical model to standardese is both error prone, and largely pointless. We need to find a way to avoid the standardese.

Atomics mappings are part of the platform application binary interface, and need to be standardized. They often include arbitrary conventions that need to be consistently followed by all compilers on a system for all programming languages. Later evolution of these conventions is not always practical. I’ll give a recent RISC-V example of such a problem.


I'd recommend by starting with the talk by Daniel Marshall (one of the authors) from Lambda Days 2023, "A Hitchhiker's Guide to Linearity", https://www.youtube.com/watch?v=QtlkqJGdnuM

The Granule Project's website, https://granule-project.github.io/, links relevant resources, too, in particular https://granule-project.github.io/granule.html and https://github.com/granule-project/granule/blob/main/example...

The latter tutorial is based on "Quantitative program reasoning with graded modal types" (ICFP 2019), with the talk and paper available at https://dl.acm.org/doi/10.1145/3341714

See also "Linearity and Uniqueness: An Entente Cordiale" (ESOP 2022), https://granule-project.github.io/papers/esop22-paper.pdf


Paper: https://2023.splashcon.org/details/iwaco-2023-papers/5/Borro...

> Hylo is a language for high-level systems programming that promises safety without loss of efficiency. It is based on mutable value semantics, a discipline that emphasizes the independence of values to support local reasoning. The result—in contrast with approaches based on sophisticated aliasing restrictions—is an efficient, expressive language with a simple type system and no need for lifetime annotations.

> Safety guarantees in Hylo programs are verified by an abstract interpreter processing an intermediate representation, Hylo IR, that models lifetime properties with ghost instructions. Further, lifetime constraints are used to eliminate unnecessary memory allocations predictably.

https://www.hylo-lang.org/

https://github.com/Hylo-lang/Hylo


Abstract:

"Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%."

Programming Languages Amenable to MEMOIR

"At its core, MEMOIR proposes collections as value types. In this paper, we implement a library in C/C++ to provide this functionality, however many languages exist which provide the guarantees needed for a MEMOIR compiler. Languages with mutable value semantics [48], which degrades references to second-class citizens, are amenable to SSA construction, as they are analogous to our MUT library. Such languages include Swift’s struct types [69] and Hylo [70].

Languages with single-ownership, i.e., “borrowing”, which guarantee that only one mutable reference will exist at a time can be used to construct a MEMOIR program. An example of this is Rust [71], which is steadily entering the programming zeitgeist. Similarly to Rust, newer languages such as Mojo [72] and Vale [73] have similar ownership models. Of note, use ϕ ’s cannot be constructed for these languages, as multiple immutable references may exist at once. While the aforementioned languages are promising directions of future work, the lack of accepted benchmark suites implemented in them, unlike C/C++, was deemed too large a barrier to adoption in our research at present.

Collection-oriented languages [74–77] have existed for many years now. APL [74] and SETL [77] serve as prime examples of their philosophy, focusing on arrays and sets, respectively, as prime concepts of the language. As such they are interesting source languages for compilation and, furthermore, their implementations provide a wealth of resources on optimizing collection-oriented programs [78]. A recent example of these concepts being exploited outside of their original languages is parallel block-delayed sequences [79], which implements loop-fusion techniques on sequences as a library for Parallel ML and C++. Investigating the extent these optimizations could be performed statically with MEMOIR provides an interesting starting point for this line of research."


Likely worth mentioning that Andy (the author) has been organizing an ML⇄DB Seminar Series (Machine Learning for Databases + Databases for Machine Learning) for the past few months (Fall 2023); materials & lectures: https://db.cs.cmu.edu/seminar2023/, https://www.youtube.com/playlist?list=PLSE8ODhjZXjYVdJKka5g3...


FWIW, its use has grown over time, with new applications most significantly including MLIR (Multi-Level IR compiler framework, which allows you to define your own intermediate representations as well as compiler passes operating on them), https://mlir.llvm.org/docs/DefiningDialects/Operations/

The docs have been improving, https://llvm.org/docs/TableGen/

I also like these resources:

Lessons in TableGen FOSDEM 2019; Nicolai Hähnle https://fosdem.org/2019/schedule/event/llvm_tablegen/

- What has TableGen ever done for us?, http://nhaehnle.blogspot.com/2018/02/tablegen-1-what-has-tab... - Functional Programming, http://nhaehnle.blogspot.com/2018/02/tablegen-2-functional-p... - Bits, http://nhaehnle.blogspot.com/2018/02/tablegen-3-bits.html - Resolving variables, http://nhaehnle.blogspot.com/2018/03/tablegen-4-resolving-va... - DAGs, http://nhaehnle.blogspot.com/2018/03/tablegen-5-dags.html


Abstract:

> It is 50 years since Tony Hoare observed “certain close analogies between the methods used for structuring data and the methods for structuring a program which processes that data”. But programs have both static structure (following the data) and dynamic structure (execution), and these need not coincide. For example, breadth-first tree traversal should be executed across the grain of the tree structure. I will present a technique for resolving the tension between these conflicting forces: the static structure specifies a multi-phase computation, whose dynamic execution structure might be entirely different. The appropriate abstraction turns out to be an applicative functor - similar to but different from the free applicative.

> This is joint work with Oisin Kidney, Tom Schrijvers, and Nicolas Wu.

Related paper:

Phases in Software Architecture (architectural pearl) - The First ACM SIGPLAN Workshop on Functional Software Architecture - FP in the Large - http://www.cs.ox.ac.uk/jeremy.gibbons/publications/phases.pd... - https://icfp23.sigplan.org/details/funarch-2023/5/Phases-in-...

> The large-scale structure of executing a computation can often be thought of as being separated into distinct phases. But the most natural form in which to specify that computation may well have a different and conflicting structure. For example, the computation might consist of gathering data from some locations, processing it, then distributing the results back to the same locations; it may be executed in three phases—gather, process, distribute—but mostly conveniently specified orthogonally—by location. We have recently shown that this multi-phase structure can be expressed as a novel applicative functor (also known as an idiom, or lax monoidal functor). Here we summarize the idea from the perspective of software architecture. At the end, we speculate about applications to choreography and multi-tier architecture.



The following is a pretty good overview:

"A Survey of Computer Architecture Simulation Techniques and Tools" - IEEE Access 2019 - Ayaz Akram, Lina Sawalha - https://ieeexplore.ieee.org/document/8718630

For more see also: https://github.com/MattPD/cpplinks/blob/master/comparch.md#e...


Abstract:

> Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and the GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance.

> This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, an AST and a bytecode one, based on RPython as well as GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations.

> Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack.

> Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.

Artifact: https://zenodo.org/record/8198950


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: