|
DATE15/EDA132 - Spring 2009
(Applied) Artificial Intelligence
Assignment #5: Dependency parsing using machine learning techniques
The objectives of this assignment are to:
- Write Nivre's parser with a guiding predicate that parses an annotated dependency graph
- Learn parsing actions from an annotated corpus
- Extract feature vectors and train a classifier
- Write a statistical dependency parser
- Understand how to design parameter sets
Organization
Each group will have to:
- Write and train a machine learning program to parse dependencies
- Use different parameter sets
- Evaluate the results on a corpus and comment them briefly
Programming
This assignment is inspired by the shared task of the Tenth conference on computational natural language learning, CONLL-X, and uses similar data. The conference site contains a description of multilingual dependency parsing, reference papers, training and test sets for a variety of languages, as well as evaluation programs. See also CONLL 2007, on the same topic.
In this session, you will implement and test a dependency parser for Swedish using machine learning techniques. You can optionally report your results for other available corpora.
Choosing a training and a test sets
The CONLL-X annotated corpora and annotation scheme are available here. The Swedish corpus called Talbanken was originally collected and annotated in Lund and modified by Joakim Nivre. Read details on the corpus and references here.
Parsing an annotated corpus (Gold standard parsing)
For each sentence with a projective dependency graph, there is an action sequence that enables Nivre's parser to generate this graph. Gold standard parsing corresponds to the sequence of parsing actions, left-arc (la), right-arc (ra), shift (sh), and reduce (re) that produces the manually-obtained, gold standard graph.
- Discuss how to extend Nivre's parser to carry out a gold standard parsing. Given a manually-annotated dependency graph, what are the conditions on the stack and the current input list -- the queue -- to execute left-arc, right-arc, shift, or reduce? Start with left-arc and right-arc, which are the simplest ones.
- Read and run this program to carry out gold standard parsing [1]. Use the -train option.
Extracting features
Action sequences can be trained from an annotated corpus, or more precisely the next action can be trained from the parsing context. To be able to predict the next action, gold standard parsing must also extract feature vectors at each step of the parsing procedure. The simplest parsing context corresponds to words' part of speech on the top of the stack and head of the input list. Once the data are collected, the training procedure will produce a 4-class classifier that you will embed in Nivre's parser to choose the next action. During parsing, Nivre's parser will call the classifier to choose the next action in the set {la, ra, sh, re} using the current context.
Modify the program to extract features [1]. The output file will use the ARFF format of the Weka machine-learning toolkit. You will have to write the extractFeatures() method in the ReferenceParser class and to complement the saveFeatures() method in the ARFFData class. The places where to add code are marked with a "TO DO" comment.
- As first feature set, you will use a simple model: The top of the stack and the first word of the queue (input list).
- To complement the feature set, you can encode action constraints. Think of Boolean features.
Training a first classifier
Learn the decision tree corresponding to your first data set using either the ID3 program you have developed as a previous assignment or Weka and produce the corresponding model from your training file.
If you use Weka:
- You will need to append a header to your data set. Here is an example corresponding to the simple feature set [2].
- You will load your data by selecting the Preprocess button
- You will choose and create a classifier by pressing the Classify button and then the Choose button. Use the J48 decision trees.
- You will save the model by right-clicking on the item in the Result list
Parsing the corpus and evaluating the results
Once you have generated your first model, you will embed it in Nivre's parser and compute its efficiency.
- You will merge your ID3 program with the one supplied for this assignment or embed the class produced by Weka. You will have to write a GuideX class similar to Guide2 to be compatible with your extracted features and create the appropriate instance in the Nivre.java program. (Marked with TO DO).
- If you select Weka, you can use the WekaGlue class to interface your program (written by Richard Johansson).
- You will run the parser on the Swedish blind test set of the CONLL data contained in the data folder. Use the -parse option.
- You will measure the accuracy using the eval.pl program supplied in the distribution. This program compares the reference annotation of the test set with the one produced by your parser.
- You will compare your results with those of the other teams published here (unlabeled attachment scores).
Extracting more features
You will now extract more complex parameter sets and improve the efficiency of your parser.
- You will modify the program to extract more features.
- Using your ID3 program or Weka, learn the decision tree corresponding to improved data sets and generate new models.
- Rerun the programs from the previous section "Parsing the corpus and evaluating the results" to obtain the best results you can. The objective is to be as close as possible to the state of the art.
- You will need to report an unlabeled attachment score greater than 70 to pass the assignment.
Complement (Optional)
Read the text Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines by Joakim Nivre et al. (2006) [pdf]. Read the slides here.
|