me

Israfel Salazar

Hi there!

I am an ELLIS PhD fellow at University of Copenhagen, advised by Desmond Elliot. My current research focuses on vision-language understanding and representation. I have broad interests in machine learning, including motion and spatial reasoning, and robotics.

Previously, I completed the M.Sc. in Applied Mathematics (MVA) at ENS Paris-Saclay and the M.Sc. in Electrical Engineering at Université Paris-Saclay. I’ve worked with generative models for image restoration at DxO, Bayesian generative modeling at Inria, and multimodal representation learning at HuggingFace. I worked as a robotics engineer after studying mechanical engineering and applied physics at the University of Chile.

News

  • [2025-11] Presented SPECS at EMNLP 2025! 🇨🇳
  • [2025-08] SPECS accepted to EMNLP 2025. 🥳
  • [2025-08] CaMMT accepted to Findings of EMNLP 2025. 🥳

Publications

Preprints

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

Investigates the bidirectional relationship between compositional training and long-caption understanding in vision-language models, revealing that these capabilities can be jointly learned through training on dense, grounded descriptions.

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Israfel Salazar, Manuel Fernández Burda, [...], Sara Hooker, Marzieh Fadaee

A comprehensive exam benchmark covering 18 languages and 14 subjects with 20,911 multiple-choice questions for massively multilingual vision-language model evaluation.

Conference Papers

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva

EMNLP, 2025

A reference-free metric for evaluating long image captions that emphasizes specificity by rewarding correct details and penalizing incorrect ones.

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Emilio Villa-Cueva, Sholpan Bolatzhanova, [...], Atnafu Lambebo Tonja, Thamar Solorio

Findings EMNLP, 2025

A benchmark corpus with over 5,800 triples across 19 languages investigating whether images can act as cultural context in multimodal translation.