Hacker News new | past | comments | ask | show | jobs | submit login

Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.




This was the case for Phi-2, it was notoriously rubbish in practical use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: