|Jan. 8th 2018||Introduction & Language Modelling||intro, language modeling||Jurafsky & Martin chapter 4, ngram generation||Matthias Gallé|
|Jan. 15th 2018||Continous word representation||slides||Jurafsky & Martin: chapter 15 and 16||Matthias Gallé||Exercice #1|
|Jan. 22nd 2018||POS-tagging and Named Entity Recognition||slides||Jurafsky & Martin: POS-tags and HMM. CRF.||Matthias Gallé|
|Jan. 29th 2018||Parsing||slides|| ||Salah Ait-Mokhtar|
|Feb. 5th 2018||Social Media Analysis & Opinion Mining||slides|| || Caroline Brun
|| Exercice #2
|| Feb, 26th 2018
|| Machine translation
|Derivation of likelihood function||Goldbery & Levy. word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. Tech Report.|
|Impact of hyper-parameters||Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.|
|Original paper||Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.|
The purpose of this exercise is to design and develop a deep learning model and a training algorithm over a training set for Natural Language Generation, and to evaluate its performance on a test set.
The dataset that you have to use for the exercise can be downloaded here
You are allowed to use one of the three following frameworks: Keras, Tensorflow and Pytorch.
Python is the unique language of development.
Each group will have to produce:
The training code should have the following command line:
python learn_model.py –train_dataset <pathname_to_train_dataset> –output_model_file <pathname_to_model_file>
python test_model.py –test_dataset <pathname_to_test_dataset> –ouput_test_file <pathname_to_results_testfile>
As a generation task, the result file (consisting of utterances (utt)) produced by the trained model will be evaluated by the examiners using (some of) the metrics described in https://github.com/tuetschek/e2e-metrics .
The examiners will first launch the training script and then evaluate the trained model over the test set.
Note that the pathname_to_train_dataset will be replaced by the examiners by a pathname on their own servers to the file trainset.csv provided in the E2E datasets, and the pathname_to_test_dataset by a pathname to the file testset.csv also provided in these datasets. So it is the responsibility of the students to ensure that their scripts can be applied directly to these files. You can use the evaluation script of the challenge to ensure that your results parse correctly.
Finally the quality (code readability, code documentation and code optimization) of the code will bring bonus/malus points. It is also recommended that the students provide a short description of their implementation along with the code.