Summary: * We tested LangChain's LLM-assisted evaluators on common tasks to provide guidelines on how to best use them in your practice. * GPT-4 excels in accuracy across various tasks, while GPT-3.5 and Claude-2 lag for tasks requiring complex "reasoning" (when used in a zero-shot setting). C…

How "Correct" are LLM Evaluators?