They haven't (eaten the world etc). They just happen to be the models that trend...

adw · 2024-04-23T21:10:31

> much more inefficient

The tricky part here is that "efficiency" is not a single dimension! Transformers are much more "efficient" in one sense, in that they appear to be able to absorb much more data before they saturate; they're in general less computationally efficient in that you can't exploit symmetries as hard, for example, at implementation time.

Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

Fully-connected NNs are more general than transformers, but they have _so many_ degrees of freedom that the numerical optimization problem is impractical. If someone figures out how to stabilize that training and make these implementable on current or future hardware, you're absolutely right that you'll see people use them. I don't think transformers are magic; you're entirely correct in saying that they're the current knee on the implementability/trainability curve, and that can easily shift given different unit economics.

I think one of the fundamental disconnects here is that people who come at AI from the perspective of logic down think of things very differently to people like me who come at it from thermodynamics _up_.

Modern machine learning is just "applications of maximum entropy", and to someone with a thermodynamics background, that's intuitively obvious (not necessarily correct! just obvious) –in a meaningful sense the _universe_ is a process of gradient descent, so "of course" the answer for some local domain models is maximum-entropy too. In that world view, the higher-order structure is _entirely emergent_. I'm, by training, a crystallographer, so the idea that you can get highly regular structure emerging from merciless application of a single principle is just baked into my worldview very deeply.

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

YeGoblynQueenne · 2024-04-24T15:26:41

>> Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

I don't know about that, I'll be honest. Do you have a reference? I suspect it won't disagree with what I'm saying, that neural nets just can't use strong enough bias to avoid overfitting. I didn't say that in so many words, above, but that's the point of having a good inductive bias, that you're not left, as a learner, to the mercy of the data.

>> Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

No that's absolutely a standard assumption in logic :) Think of grammars; like Chomsky likes to say, human language "makes infinite use of finite means" (quoting Wilhelm von Humboldt). Chomsky of course believes that human language is the result of a simple set of rules, very much like logical theories. Personally, I have no idea, but Chomsky consistently and even today pisses off all the linguists and all the machine learning people, so he must be doing something right.

Btw, I'm not coming from the perspective of mathematical logic, only. It's complicated, but, e.g. my MSc was in data science and my PhD in a symbolic form of machine learning. See, learning and logic, or learning and reasoning, are not incompatible, they're fundamentally the same.

adw · 2024-04-23T21:22:57

> They haven't (eaten the world etc).

To clarify what I mean on this specific bit: the SOTA results in 2D and 3D vision, audio, translation, NLP, etc are all transformers. Past results do not necessarily predict future performance, and it would be absurd to claim that an immutable state of affairs, but it's certainly interesting that all of the domain-specific architectures have been flattened in a very short period of time.

YeGoblynQueenne · 2024-04-24T15:27:51

Thanks for clarifying. Well, my argument is that the state of the art is more the result of trends in research than of the true capabilities of different approaches.

Take my little rant about Rich Sutton's (a god, btw) Bitter Lesson with respect to RL. So, there's AlphaGo, AlphaZero and μZero, yes? AlphaGo knows the rules of Go and starts with some expert knowledge, and beats very human Go player. AlphaZero knows the rules of Go but has no expert knowledge and it beats AlphaGo. And μZero neither knows the rules of Go, nor has expert knowledge, and it beats AlphaZero, and can also plays chess, shoggi and Atari games, with one hand while eating a banana. Do you know how hard it is to eat a banana with one hand? Unpeeled!

Easy to draw a conclusion from that. Except all those systems were developed and used by DeepMind, and there are very few entities besides DeepMind that can even train them, so all we know is what DeepMind claims and we have no way to check their claims. For example, can I test different configurations of μZero, with and without knowledge of the rules of the game and expert knowledge? Not really. And it's clear to me that DeepMind are pushing very, very hard a form of AI that relies on having gigantic resources, like the ones the just completely coincidentally happen to be among the few entities to have access to. So I remain unconvinced.

(I need to re-read the μZero paper, it's in my pdf buffer. I didn't get it the first time I read it, and it might well be that they did make sufficient ablation studies to convince even me and I just don't remember it).