Evaluation

Challenging the Abilities of Large Language Models in Italian: a Community Initiative featured image

Challenging the Abilities of Large Language Models in Italian: a Community Initiative

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of …

Malvina Nissim
,
Danilo Croce
,
Viviana Patti
,
Pierpaolo Basile
,
Giuseppe Attanasio
,
Elio Musacchio
,
Matteo Rinaldi
,
Federico Borazio
,
Maria Francis
,
Jacopo Gili
,
others
BERnaT: Basque Encoders for Representing Natural Textual Diversity featured image

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model …

Ekhi Azurmendi
,
Joseba Fernandez de Landa
,
Jaione Bengoetxea
,
Maite Heredia
,
Julen Etxaniz
,
Mikel Zubillaga
,
Ander Soraluze
,
Aitor Soroa
Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque featured image

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource …

Lukas Arana
,
Julen Etxaniz
,
Ander Salaberria
,
Gorka Azkune
Truth Knows No Language: Evaluating Truthfulness Beyond English featured image

Truth Knows No Language: Evaluating Truthfulness Beyond English

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations …

Blanca Calvo Figueras
,
Eneko Sagarzazu
,
Julen Etxaniz
,
Jeremy Barnes
,
Pablo Gamallo
,
Iria de-Dios-Flores
,
Rodrigo Agerri
GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge featured image

GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge

In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their …

Giulia Pensa
,
Ekhi Azurmendi
,
Julen Etxaniz
,
Begoña Altuna
,
Itziar Gonzalez-Dios
BertaQA: How Much Do Language Models Know About Local Culture? featured image

BertaQA: How Much Do Language Models Know About Local Culture?

Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how …

Julen Etxaniz
,
Gorka Azkune
,
Aitor Soroa
,
Oier Lopez de Lacalle
,
Mikel Artetxe
Lessons from the Trenches on Reproducible Evaluation of Language Models featured image

Lessons from the Trenches on Reproducible Evaluation of Language Models

Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation …

Stella Biderman
,
Hailey Schoelkopf
,
Lintang Sutawika
,
Leo Gao
,
Jonathan Tow
,
Baber Abbasi
,
Alham Fikri Aji
,
Pawan Sasanka Ammanamanchi
,
Sidney Black
,
Jordan Clive
,
Anthony DiPofi
,
Julen Etxaniz
,
Benjamin Fattori
,
Jessica Zosa Forde
,
Charles Foster
,
Jeffrey Hsu
,
Mimansa Jaiswal
,
Wilson Y. Lee
,
Haonan Li
,
Charles Lovering
,
Niklas Muennighoff
,
Ellie Pavlick
,
Jason Phang
,
Aviya Skowron
,
Samson Tan
,
Xiangru Tang
,
Kevin A. Wang
,
Genta Indra Winata
,
François Yvon
,
Andy Zou
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark featured image

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data …

Oscar Sainz
,
Jon Ander Campos
,
Iker García-Ferrero
,
Julen Etxaniz
,
Oier Lopez de Lacalle
,
Eneko Agirre