Hacker News new | past | comments | ask | show | jobs | submit login

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.




This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry


Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.


Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

Asking it to answer questions on its own is much more prone to hallucination.

To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

If I ask it to summarize in bullet points, it doesn't do that.

I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

*In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.


The technique to use is to give the model an “out” for the missing/negative case.

"{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."


Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.


I forget which model (LLaMA 3?) but I heard 95% of the training data was English.


for sure, so my use case for example is

"using the following documentation to guide you {api documentation}, edit this code {relevant code}, with the following objective: Replace uses of {old API calls} in {some function} with with relevant functions from the supplied documentation"

It mostly works, but if the context is a little to long, sometimes it will just spam the same umlaut or number (always umlaut's or numbers) over and over for example. Perhaps some fine-tuning of parameters like temp. or repetition penalty might fix it, time will tell.


Are you using ollama? Devs said there was a bug that occurs when context is full, they're working on it.


That would do it, I am indeed.

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.


Neither, still, the answers it does provide - despite a few hiccups - are truly outstanding. I am really impressed with this model, even with its issues. Though, I am sure the issues such that they are, are a month or two away from being fixed. For what its worth, I haven't played as much with the bigger model, but it seems to not struggle with the same, though take that with a grain of salt, it runs too slow on my hardware for me to rapidly test things.


> TruthfulQA

Wait, people still use this benchmark? I hear there's a huge flaw on it.

For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

[1] https://www.youtube.com/watch?v=efPrtcLdcdM


Couldn't it be that training it on 4chan makes it more truthful for some reason?


Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?


This unironically feels like good research & paper potential.


Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562


>scores better

>very offensive

Any cons?


Looks like a good and useful benchmark.


"Omit that training data..."


The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.


Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?


They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.


Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.


This was the case for Phi-2, it was notoriously rubbish in practical use.


I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.


Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.


I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.


I'm pretty naive so please forgive it's a stupid question.

To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?


Both are saying the same thing: in order for the base model that is phi to perform well as a chat agent, it would need to be tuned for that purpose before its benchmark results could have real-world value.


From this report. Phi-2 was not instruct tuned indeed.

"Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: