Small leaves wiggle room, but it's extremely unlikely trad small, <= 7B, will get there this year even on these evals.
UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.
UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.