Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes
Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes
This chapter presents an approach to using multiple preprocessing (tokenization) schemes to improve statistical word alignments. In this approach, the text to align is tokenized before statistical alignment, and then remapped to its original form afterwards. Multiple tokenizations yield multiple remappings (remapped alignments), which are then combined using supervised machine learning. The remapping strategy improves alignment correctness by itself. The combination of multiple remappings also improves measurably over a commonly used state-of-the-art baseline. A relative reduction of alignment error rate of about 38% is obtained on a blind test set.
Keywords: multiple preprocessing, tokenization, word alignment, remapping, machine learning
MIT Press Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.
Please, subscribe or login to access full text content.
If you think you should have access to this title, please contact your librarian.
To troubleshoot, please check our FAQs, and if you can't find the answer there, please contact us.