That's the "bitter lesson", right? Which is really a sour lesson- as in sour gra...

gwern · 2024-04-23T21:35:34

Vapnik: https://www.cs.princeton.edu/courses/archive/spring13/cos511... https://engineering.columbia.edu/files/engineering/vapnik.pd... https://www.learningtheory.org/learning-has-just-started-an-... https://nautil.us/teaching-me-softly-234576/

The main paper: https://gwern.net/doc/reinforcement-learning/exploration/act...

It sounds kinda crazy (is there really that much far transfer?), but you know, I think it would work... He just needed to use LLMs instead: https://arxiv.org/abs/2309.10668#deepmind

YeGoblynQueenne · 2024-04-24T15:11:32

Yeah, that's one of the papers in that line of research by Vapnik. He's got a few with similar content. Visually, it's not the paper I remember, I'll have to read it again to be sure.

If I remember correctly, Vapnik's point is, we know that Big Data Deep Learning works; now, try to do the same thing with small data. Very much like my point that capabilities of models are not improving, only the scale increasing.

adw · 2024-04-23T18:04:23

> The language of continuous functions is alien to most humans and it's very difficult to encode one's intuitive, common sense knowledge into a system using that language

In other words; machine learned models are octopus brains (https://www.scientificamerican.com/article/the-mind-of-an-oc...) and that creeps you out. Fair enough, it creeps me out too, and we should honour our emotions — I'm no rationalist – but we should also be aware of the risks of confusing our emotional responses with reality.

YeGoblynQueenne · 2024-04-24T15:13:27

Please don't god mode me? Machine learning doesn't creep me out. I'm sorry it creeps you out. In my culture, octopus is a prized delicacy, my dad used to fish them out of the sea with his bare hands when I was a kid. If you wanna creep me out, you should try snake, not octopus.

og_kalu · 2024-04-23T17:08:09

>Two, all the loud successes of statistical machine learning in the last couple of decades are closely tied to minutely specialised neural net architectures: CNNs for image classification, LSTMs for translation, Transformers for vision, Difussion models and Ganns for image generation. If that's not encoding knowledge of a domain, what is?

Transformers, Diffusion for Vision, Image generation are really odd examples here. None of those architectures or training processes are tuned for Vision in mind lol. It was what? 3 years after Attention 2017 before the famous Vit paper. CNNs have lost a lot of favor to Vits, LSTMs are not the best performing translators today.

The bitter lesson is that less encoding of "expert" knowledge results in better performance and this has absolutely held up. The "encoding of knowledge" you call these architectures is nowhere near that of the GOFAI kind and even more than that, less biased NN architectures seem to be winning out.

>That's because the minutely specialised architectures in point number two are inefficient as all hell; the result of not having a good way to encode expert knowledge.

Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

>If capabilities were improving, we should see the number of examples required to train a state-of-the-art system either staying the same, or going down. Well, they ain't.

The capabilities of models are certainly increasing. Even your example is blatantly wrong. Do you realize how much more data and compute it would take to train a Vanilla RNN to say GPT-3 level performance?

YeGoblynQueenne · 2024-04-23T17:16:12

>> Inefficient is a whole lot better than can't even play the game, the story of GOFAI for the last few decades.

See e.g. my link above where GOFAI plays the game (Atari) very well indeed.

Also see Watson winning Jeopardy (a hybrid system, but mainly GOFAI - using frames and Prolog for knowledge extraction, encoding and retrieval).

And Deep Blue beating Kasparov. And MCTS still the SOTA search algo in Go etc.

And EURISCO playing Traveller as above.

And Pluribus playing Poker with expert game-playing knowledge.

And the recent neuro-symbolic DeepMind thingy that solves geometry problems from the maths olympiad.

etc. etc. [Gonna stop editing and adding more as they come to my mind here.]

And that's just playing games. As I say in my comment above planning and scheduling, SAT, constraints, verification, theorem proving- those are still dominated by classical systems and neural nets suck at them. Ask Yan LeCun: "Machine learning sucks". He means it sucks in all the things that classical AI does best and he means he wants to do them with neural nets, and of course he'll fail.

adw · 2024-04-23T21:15:31

> And MCTS still the SOTA search algo in Go etc

It's often forgotten that Rich Sutton said the two things which work are learning (the AlphaGo/Leela Zero policy network) and search (MCTS). (I think the most interesting research in ML is around the circumstances in which large models wind up performing implicit search.)

YeGoblynQueenne · 2024-04-24T15:14:06

Well, gradient optimisation is a form of search.

og_kalu · 2024-04-23T17:26:39

That was a figure of speech. I didn't literally mean games (not that GOFAI performs better than NNs in those games anyway). I simply went off your own examples - Vision, Image generation, Translation etc.

>As I say in my comment above planning and scheduling, SAT, constraints, verification, theorem proving- those are still dominated by classical systems

You can use NNs for all these things. It wouldn't make a lot of sense because GOFAI would be perfect and the former would be inefficient but you certainly could which is again more than I can say for GOFAI and the domains you listed.

YeGoblynQueenne · 2024-04-23T17:40:41

I don't understand your comment. Clarify.

As it is, your comment seems to tell me that neural nets are good at neural net things and GOFAI is good at GOFAI things, which is obvious, and is what I'm saying: neural nets can make only very limited use of expert knowledge and so suck in all domains where domain knowledge is abundant and abundantly useful, which are the same domains where GOFAI dominates. GOFAI can make very good use of expert knowledge but is traditionally not as good in domains where only tacit knowledge is available, because we don't understand the domain well enough yet, like in anything to do with pattern recognition, which is the same domains where neural nets dominate. If explicit, expert knowledge was available for those domains, then GOFAI would dominate, and neural nets would fall behind, completely contrary to what Sutton thinks.

So, the bitter lesson is only bitter for those who are not interested in what classical AI systems can do best. For those of us who are, the lesson is sweet indeed: we're making progress, algorithmic progress, progress in understanding, scientific progress, and don't need to burn through thousands of credit to train on server farms to do anything of note. That's even a running joke in my team: hey, do you need any server time? Nah, I'll run the experiment on my laptop over lunch. And then beat the RL algo (PPO) that needs three days training on GPUs. To solve mazes badly.

og_kalu · 2024-04-23T17:50:26

NNs can do the things GOFAI is good at a whole lot better than GOFAI can do the things NNs are good at.

YeGoblynQueenne · 2024-04-23T19:05:39

That's wishful thinking not supported by empirical results.

YeGoblynQueenne · 2024-04-24T16:02:46

Hey, og_kalu, I vouched for your comment but it stays dead. It's not you, it was me who was out of line, with my comment: "wishful thinking"; that's not a very polite thing to say. And my original comment was a bit prissy, too.

To be honest, I'm always a bit jumpy around your comments because I've noticed them all over the place and they're often grayed-out. You kind of tend to go for the jugular. I don't mean that as a good thing. I think others have noticed it too and you get more reaction than you should. That's a shame, because it's clear there's lots of interesting conversations to be had, given you have such strong views and you seem to have done quite a bit of reading; though only on one side of things.

Anyway sorry for starting it this time around and that you got dead'ed, I hope we get to disagree more in the future.

YeGoblynQueenne · 2024-04-23T17:29:50

Addendum:

>> Do you realize how much more data and compute it would take to train a Vanilla RNN to say GPT-3 level performance?

Oh, good point. And what would GPT-3 do with the typical amount of data used to train an LSTM? Rhetorical.

adw · 2024-04-23T17:49:40

Yeah, all of those architectures are _themselves_ hacks to get around having insufficient compute! They absolutely were encoding inductive biases into the network to get around not being able to train enough, and transformers (handwaving hard enough to levitate, the currently-trainable model family with the least inductive bias) have eaten the world in all domains.

This is evidence _for_ the Bitter Lesson, not against it.

YeGoblynQueenne · 2024-04-23T18:01:32

They haven't (eaten the world etc). They just happen to be the models that trend hard right now. I bet if you could compare like for like you'd be able to see some improvement in performance from Transformers, but that 'd be extremely hard to separate from the expected improvement from the constantly increasing amounts of data and compute. For example, you could, today, train a much bigger and deeper Multi-Layered Perceptron than you could thirty years ago, but nodoy is trying because that's so 1990's, and in any case they have the data and compute to train much bigger, much more inefficient (contrary to what you say if I got that right) architectures.

Wait a few years and the Next Big Thing in AI will come along, hot on the heels of the next generation of GPUs, or tensor units or whatever the hardware industry can cook up to sell shovels for the gold rush. By then, Transfomers will have hit the plateau of diminishing returns, there'll be gold in them there other hills and nobody would talk of LLMs anymore because that's so 2020s. We've been there so many times before.

adw · 2024-04-23T21:10:31

> much more inefficient

The tricky part here is that "efficiency" is not a single dimension! Transformers are much more "efficient" in one sense, in that they appear to be able to absorb much more data before they saturate; they're in general less computationally efficient in that you can't exploit symmetries as hard, for example, at implementation time.

Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

Fully-connected NNs are more general than transformers, but they have _so many_ degrees of freedom that the numerical optimization problem is impractical. If someone figures out how to stabilize that training and make these implementable on current or future hardware, you're absolutely right that you'll see people use them. I don't think transformers are magic; you're entirely correct in saying that they're the current knee on the implementability/trainability curve, and that can easily shift given different unit economics.

I think one of the fundamental disconnects here is that people who come at AI from the perspective of logic down think of things very differently to people like me who come at it from thermodynamics _up_.

Modern machine learning is just "applications of maximum entropy", and to someone with a thermodynamics background, that's intuitively obvious (not necessarily correct! just obvious) –in a meaningful sense the _universe_ is a process of gradient descent, so "of course" the answer for some local domain models is maximum-entropy too. In that world view, the higher-order structure is _entirely emergent_. I'm, by training, a crystallographer, so the idea that you can get highly regular structure emerging from merciless application of a single principle is just baked into my worldview very deeply.

Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

YeGoblynQueenne · 2024-04-24T15:26:41

>> Let's talk about that in terms of a concrete example: the big inductive bias of CNNs for vision problems is that CNNs essentially presuppose that the model should be translation-invariant. This works great — speeds up training and makes it more stable – until it doesn't and that inductive bias starts limiting your performance, which is in the large-data limit.

I don't know about that, I'll be honest. Do you have a reference? I suspect it won't disagree with what I'm saying, that neural nets just can't use strong enough bias to avoid overfitting. I didn't say that in so many words, above, but that's the point of having a good inductive bias, that you're not left, as a learner, to the mercy of the data.

>> Someone who comes at things from the perspective of mathematical logic is going to find that worldview very weird, I suspect.

No that's absolutely a standard assumption in logic :) Think of grammars; like Chomsky likes to say, human language "makes infinite use of finite means" (quoting Wilhelm von Humboldt). Chomsky of course believes that human language is the result of a simple set of rules, very much like logical theories. Personally, I have no idea, but Chomsky consistently and even today pisses off all the linguists and all the machine learning people, so he must be doing something right.

Btw, I'm not coming from the perspective of mathematical logic, only. It's complicated, but, e.g. my MSc was in data science and my PhD in a symbolic form of machine learning. See, learning and logic, or learning and reasoning, are not incompatible, they're fundamentally the same.

adw · 2024-04-23T21:22:57

> They haven't (eaten the world etc).

To clarify what I mean on this specific bit: the SOTA results in 2D and 3D vision, audio, translation, NLP, etc are all transformers. Past results do not necessarily predict future performance, and it would be absurd to claim that an immutable state of affairs, but it's certainly interesting that all of the domain-specific architectures have been flattened in a very short period of time.

YeGoblynQueenne · 2024-04-24T15:27:51

Thanks for clarifying. Well, my argument is that the state of the art is more the result of trends in research than of the true capabilities of different approaches.

Take my little rant about Rich Sutton's (a god, btw) Bitter Lesson with respect to RL. So, there's AlphaGo, AlphaZero and μZero, yes? AlphaGo knows the rules of Go and starts with some expert knowledge, and beats very human Go player. AlphaZero knows the rules of Go but has no expert knowledge and it beats AlphaGo. And μZero neither knows the rules of Go, nor has expert knowledge, and it beats AlphaZero, and can also plays chess, shoggi and Atari games, with one hand while eating a banana. Do you know how hard it is to eat a banana with one hand? Unpeeled!

Easy to draw a conclusion from that. Except all those systems were developed and used by DeepMind, and there are very few entities besides DeepMind that can even train them, so all we know is what DeepMind claims and we have no way to check their claims. For example, can I test different configurations of μZero, with and without knowledge of the rules of the game and expert knowledge? Not really. And it's clear to me that DeepMind are pushing very, very hard a form of AI that relies on having gigantic resources, like the ones the just completely coincidentally happen to be among the few entities to have access to. So I remain unconvinced.

(I need to re-read the μZero paper, it's in my pdf buffer. I didn't get it the first time I read it, and it might well be that they did make sufficient ablation studies to convince even me and I just don't remember it).