Pseudo-aligned multilingual corpora

F. Diaz and D. Metzler
IJCAI 2007
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical-as opposed to exact-correspondences between documents in different languages. We apply semi-supervised methods to pseudo-align multilingual corpora. Specifically, we construct a topic-based graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of exact correspondences, an on-topic correspondence occurs within the top 5 foreign neighbors in the lower-dimensional space while the exact correspondence occurs within the top 10 foreign neighbors in this this space. We also show how to substantially improve these results with a novel method for incorporating language-independent information.

bibtex

Copied!
@inproceedings{diaz:pseudoaligned-corpora, year = {2007}, title = {Pseudo-Aligned Multilingual Corpora}, pages = {2727-2732}, editor = {Manuela M. Veloso}, booktitle = {IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence}, author = {Fernando Diaz and Donald Metzler} }