Evaluating LLMs with Chat Templates
LLMs are typically finetuned on chat-templated datasets, which convert conversations into strings ready for text generation. Matching this format is standard at prediction time, as deviations are often noted to hurt performance. Here, we test this theory using an instruction-following benchmark and the best open-source LLMs available at this time.