|
Machine Translation Lab in the winter term 2012/2013In this course you will implement a simple translation system for natural languages using a programming language of your choice. You do not need any previous knowledge about natural language processing; all necessary concepts are introduced in the course. Nevertheless, this course is a good addition to the lecture „Maschinelles Übersetzen natürlicher Sprachen“. Schedule
Task 1First, check if you received a welcome message from the course’s mailing list “MT-Lab-WS12”. Please contact Toni Dietze as soon as possible, if you did not receive such a message. Second, the actual task: Implement the training procedure for the bigram model, length model, and dictionary, as discussed in the first meeting. Adhere to the presented command line syntax as good as possible, and document any deviations if necessary. Also, document the prerequisites for compiling your program (e.g. needed compiler and libraries) and the commands to compile it. You shall submit your solution until 2012-11-07 to Toni Dietze. Include the training result with the corpus europarl-v7.de-en.tokenizer.lowercase.clean.nubbed.4096.txz in your submission. Keep in mind that you have to submit a small report about your work at the end of the semester. So it might be a good idea to take notes about design decisions or the like already. Task 2Implement a decoding procedure loosely based on IBM Model 1. You can use proposals from the slides or your own ideas. Experimenting is highly encouraged! The goal is to get good translations. Your program shall read the sentences from standard input and print their translation on standard output. The sentences are newline separated and their words are whitespace separated; you shall follow this format also for the output. Use standard error for any other output like status or debug messages. Keep in mind to handle the end-of-file on standard input correctly. You shall submit a first implementation until 2012-12-05, including a small (i.e. e-mail compatible) model instance for the translation from German to English. Again, keep in mind that you have to submit a small report about your work at the end of the semester. So it might be a good idea to take notes about your ideas already. Also, note the preliminary plan for the competition. Task 3Write your report and polish your implementation (cf. slides). The ReportYou shall document how to compile and execute your program. You shall also give a short description of the files your program generates and their formats. Additionally, you shall document the steps you have taken to write your program, e.g. design decisions and optimizations. This also includes approaches which did not work out in the end. Try to explain why these approaches failed. The CompetitionThe plan for the competition is still subject to change! You get some German sentences, which you shall translate to English using your decoder with a model instance trained at home with data of your choice. The translation shall take at most 30 seconds per sentence. Your translations are then compared to some human generated translations by BLEU resulting in a score. Then you get a parallel corpus for English and an unknown language (comparable in size to europarl-v7.de-en.tokenizer.lowercase.clean.nubbed.4096.txz). You have 10 minutes to train a model instance using this corpus for translating from the unknown language to English. After that you shall guess the meaning of some sentences in the unknown language using your decoder. The participant who guesses the correct meaning for a sentence at first gets a point. The points are added to the score from the first competition to determine the winner. The sentences for both translation tasks will be sorted by their lengths, respectively, so we can easily skip the long sentences if the decoders are too slow for them. While the decoders are working, you shall describe your implementation, i.e. the decoding strategy used, optimizations, encountered problems, etc. (cf. your report). CorporaThe corpora are packed in xz compressed tar archives. Under GNU/Linux you can unpack them with the
|
Contact
Prof. Dr.-Ing. habil. Dr. h.c./Univ. Szeged
Heiko Vogler Phone: +49 (0) 351 463-38232 Fax: +49 (0) 351 463-37959 ![]() Sorry — there was an error in gathering the desired information |