The system fuses multiple textual sources (caption, body, headline, lead) from news articles using a hierarchical attention mechanism to retrieve relevant images. It utilizes subword embeddings and self-attention to better encode entities and capture important keywords within texts. The model is trained on a large-scale multimodal multilingual dataset of over 500k German and French news article-image pairs in a weakly-supervised manner.
This page was last edited on 2024-03-19.
This page was last edited on 2024-03-19.