At the least, there's multiple benchmarks noted in the paper (21!) and the resul...

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."