Forum

Please register here.

Schedule

Date Topic Slides References Instructor Notes
Jan. 8th 2018 Introduction & Language Modelling intro, language modeling Jurafsky & Martin chapter 4, ngram generation Matthias Gallé
Jan. 15th 2018 Continous word representation slides Jurafsky & Martin: chapter 15 and 16 Matthias Gallé Exercice #1
Jan. 22nd 2018 POS-tagging and Named Entity Recognition slides Jurafsky & Martin: POS-tags and HMM. CRF. Matthias Gallé
Jan. 29th 2018 Parsing slides Salah Ait-Mokhtar
Feb. 5th 2018 Social Media Analysis & Opinion Mining slides Caroline Brun Exercice #2
Feb, 26th 2018 Machine translation slides Marc Dymetman
March, 5th 2018 Machine Reading slides Julien Perez Exercise #3
March, 12th 2018 Dialogue slides Julien Perez

Exercises

Exercise #1

The goal of the first exercise is to implement skip-gram with negative-sampling from scratch.
You will be provided with a template (python), where you should fill in the missing functions [link].
Please use the following formatting conventions:
Submission:

Questions
Use the forum to ask questions. Individual requests will not be answered in general.

Evaluation
It will be done semi-automatically, training your code on a fixed data-set (text file containing English sentences, one per line). Evaluation will be done measuring the correlation between the similarity that you compute between two words with respect to human evaluation. Test data will be a tab-separated csv file, with one header line containing the words under columns 'word1' and 'word2' respectively, and the (human-annotated) similarity score as a float under column 'similarity'. Training time will also be taken into account (giving bonus/penality points). The code might be inspected.
For the grade, the starting point will be the difference of the obtained score with respect to some "gold" implementation. The exact f(difference) is to be defined. Bonus/penalities will apply for considerable difference in the time/memory consumption, and code quality (readability, comments).

References
Derivation of likelihood function Goldbery & Levy. word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. Tech Report.
Impact of hyper-parameters Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.
Original paper Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Exercise #2

First, read the exercise description and then download the initial source code and data distribution .

Questions
Please use the forum to ask questions.

Exercise #3

The subject of the exercise is Natural Language Generation, using deep learning models.
Deadline: 15/04/18
Team: 2-4 people

The purpose of this exercise is to design and develop a deep learning model and a training algorithm over a training set for Natural Language Generation, and to evaluate its performance on a test set.
The dataset that you have to use for the exercise can be downloaded here

You are allowed to use one of the three following frameworks: Keras, Tensorflow and Pytorch.
Python is the unique language of development.
Each group will have to produce:

  1. a code to train a model using the provided training dataset.
  2. a code to generate a response file over the test dataset.

The training code should have the following command line:
python learn_model.py –train_dataset <pathname_to_train_dataset> –output_model_file <pathname_to_model_file>
python test_model.py –test_dataset <pathname_to_test_dataset> –ouput_test_file <pathname_to_results_testfile>

As a generation task, the result file (consisting of utterances (utt)) produced by the trained model will be evaluated by the examiners using (some of) the metrics described in https://github.com/tuetschek/e2e-metrics .

The examiners will first launch the training script and then evaluate the trained model over the test set.
Note that the pathname_to_train_dataset will be replaced by the examiners by a pathname on their own servers to the file trainset.csv provided in the E2E datasets, and the pathname_to_test_dataset by a pathname to the file testset.csv also provided in these datasets. So it is the responsibility of the students to ensure that their scripts can be applied directly to these files. You can use the evaluation script of the challenge to ensure that your results parse correctly.
Finally the quality (code readability, code documentation and code optimization) of the code will bring bonus/malus points. It is also recommended that the students provide a short description of their implementation along with the code.

References for the course

Speech and Language Processing (3rd ed. draft). Dan Jurafsky and James H. Martin