1st Multilingual Model Workshop – Guardrails and Evaluation of the Jais Arabic-English LLMs
Preslav further discusses evaluation. In addition to perplexity, the team used downstream evaluation, for Arabic and English, covering world knowledge, commonsense reasoning, and misinformation & bias. For evaluating the model in Arabic, the team used pre-existing datasets such as EXAMS, which contained Arabic matriculation questions, the team curated their own dataset covering Arabic literature, and translated English evaluation datasets to Arabic (manually, for MMLU; and automatically, using an in-house system, for the other English datasets). The team further performed generation evaluation using GPT-4 as a judge. Yet, whenever feasible, the team performed human evaluation, which was the most important input that informed many important decisions about building the model.
Add comment