TUD Logo

TUD Home » ... » Teaching » Winter term 2012/2013 » Machine Translation Lab

Chair of Foundations of Programming

Machine Translation Lab in the winter term 2012/2013

In this course you will implement a simple translation system for natural languages using a programming language of your choice. You do not need any previous knowledge about natural language processing; all necessary concepts are introduced in the course. Nevertheless, this course is a good addition to the lecture „Maschinelles Übersetzen natürlicher Sprachen“.

Schedule

Date Location Event Material
2012-10-15 13:00 INF/E005 First meeting, introduction, first task. Slides (updated 2012-11-06 15:45)
2012-11-07 Submission of training implementation. (cf. Task 1)
2012-11-12 13:00 INF/E005 Introduction of decoding, discussion of training. Slides
2012-12-05 Submission of first decoding implementation. (cf. Task 2)
2012-12-10 13:00 INF/E005 Discussion of decoding. Slides
2013-01-10 Submission of a preliminary version of your report. (cf. Task 3)
2013-01-23 Submission of final implementation. (cf. Task 3)
2013-01-28 13:00 INF/E069 Competition.
2013-02-01 Submission of final report.

Task 1

First, check if you received a welcome message from the course’s mailing list “MT-Lab-WS12”. Please contact Toni Dietze as soon as possible, if you did not receive such a message.

Second, the actual task: Implement the training procedure for the bigram model, length model, and dictionary, as discussed in the first meeting. Adhere to the presented command line syntax as good as possible, and document any deviations if necessary. Also, document the prerequisites for compiling your program (e.g. needed compiler and libraries) and the commands to compile it.

You shall submit your solution until 2012-11-07 to Toni Dietze. Include the training result with the corpus europarl-v7.de-en.tokenizer.lowercase.clean.nubbed.4096.txz in your submission.

Keep in mind that you have to submit a small report about your work at the end of the semester. So it might be a good idea to take notes about design decisions or the like already.

Task 2

Implement a decoding procedure loosely based on IBM Model 1. You can use proposals from the slides or your own ideas. Experimenting is highly encouraged! The goal is to get good translations.

Your program shall read the sentences from standard input and print their translation on standard output. The sentences are newline separated and their words are whitespace separated; you shall follow this format also for the output. Use standard error for any other output like status or debug messages. Keep in mind to handle the end-of-file on standard input correctly.

You shall submit a first implementation until 2012-12-05, including a small (i.e. e-mail compatible) model instance for the translation from German to English.

Again, keep in mind that you have to submit a small report about your work at the end of the semester. So it might be a good idea to take notes about your ideas already. Also, note the preliminary plan for the competition.

Task 3

Write your report and polish your implementation (cf. slides).

The Report

You shall document how to compile and execute your program. You shall also give a short description of the files your program generates and their formats.

Additionally, you shall document the steps you have taken to write your program, e.g. design decisions and optimizations. This also includes approaches which did not work out in the end. Try to explain why these approaches failed.

The Competition

The plan for the competition is still subject to change!

You get some German sentences, which you shall translate to English using your decoder with a model instance trained at home with data of your choice. The translation shall take at most 30 seconds per sentence. Your translations are then compared to some human generated translations by BLEU resulting in a score.

Then you get a parallel corpus for English and an unknown language (comparable in size to europarl-v7.de-en.tokenizer.lowercase.clean.nubbed.4096.txz). You have 10 minutes to train a model instance using this corpus for translating from the unknown language to English. After that you shall guess the meaning of some sentences in the unknown language using your decoder. The participant who guesses the correct meaning for a sentence at first gets a point.

The points are added to the score from the first competition to determine the winner.

The sentences for both translation tasks will be sorted by their lengths, respectively, so we can easily skip the long sentences if the decoders are too slow for them.

While the decoders are working, you shall describe your implementation, i.e. the decoding strategy used, optimizations, encountered problems, etc. (cf. your report).

Corpora

The corpora are packed in xz compressed tar archives. Under GNU/Linux you can unpack them with the tar command line tool; under Windows you can use 7-Zip.

Last modified: 15th Jan 2013, 9.49 AM
Author: Dipl.-Inf. Toni Dietze

Contact
Prof. Dr.-Ing. habil. Dr. h.c./Univ. Szeged
Heiko Vogler

Phone: +49 (0) 351 463-38232
Fax: +49 (0) 351 463-37959
e-mail contact form

Sorry — there was an error in gathering the desired information