15th NODALIDA, Joensuu, May 20-21, 2005

Sven Karlsson, Pierre Nugues: Writing assistance using language models derived from the web

Collocations are frequent word associations. Awareness of them is of primary importance in language teaching or to assist writers use and write 'natural' phrases. Collocations are usually extracted from corpora using a set of statistical measures.

The world wide web is the largest amount of texts ever available to man. For users, the easiest and preferred access to these texts is through search engines, which have collected them, indexed their words, and classified them. Search engines let users find specific texts by typing words or phrases they contain. Some of the engines provide Application Programming Interfaces (API) that enable programmers to carry out more sophisticated queries.

This article describes the integration of collocation measurement methods to a text processor. The resulting program uses the web as a corpus and is intended to assist non-native writers. It carries out corpus queries through the Google API and flags unexpected word associations. The visual display of the text has been designed to make it easy to compare the different measurements.

Spelling and grammatical checkers are now ubiquitous in text processors and hundred millions of people use them every day. Spelling checkers are based on computerized dictionaries and remove most misspellings that occur in documents. Grammar checkers are generally based on syntactic parsers and use rules to detect common grammar and style errors. Although not perfect, grammar checkers have improved considerably in the recent past years.

However, text critiquing is not just getting one word right. It is also about getting the words in the right order or in right pairs. Let's consider the phrasal verb {\em depends on} and the verb preposition pair {\em consists of}.

These phrases are easy and natural to native speakers. They are frequent too. A Google query on September 23, 2004, reports 6,040,000 {\em depends on} and 5,060,000 {\em consists of}. However, the association of a verb and a preposition is not as easy for non-native speakers. For {\em consist}, the {\em Oxford Advanced Learner's dictionary} as well as the {\em Longman Dictionary of Contemporary English} list two possible prepositions after it: {\em of} as in {\em It consists of three parts} and {\em in} as in {\em Living consists in working}. Using Google again, we can find other pairs: {\em It consists on}, popular in Spanish sites, or {\em It consists about}, from Eastern Europe sites, {\em It consists at}... They are less numerous than the correct expressions, 2,370, 48, 379, respectively, compared with 940,000 {\em It consists of}.

Word ordering is also a problem to non-natives. The phrase {\em It depends generally on} does not sound as natural as {\em It generally depends on}. Google confirms our intuition and lists 817 occurrences of the second one vs. 20 of the first one. The first three sites to use the adverb between the verb and the preposition are: {\tt www.finances.gouv.fr}, {\tt www.hinduunity.org} and {\tt www-li5.ti.uni-mannheim.de} All non-natives...

Using the Web
A sentence can be grammatical yet not fit the way native speakers would formulate it. Wrong constructions that come to mind can be due to an unnatural word order, homophones, and bad pairs verb-particle or verb-preposition. These are problems that regular language checkers fail to address with a lexical syntactic approach. They seem to be easier to solve with a statistical approach and a large corpus.

The Web gives us access to huge corpora in a number of languages. We designed and implemented a writing assistant to integrate it as reference of correct collocation usage. We wrote a simple text processor where we focused on the verb-particle and verb-preposition pairs. The processor measures the pair relevance by computing their collocation strength. We derive statistics from the Web that we access from the typing window using the Google API. When an association is deemed not relevant, we flag it by enlarging the font the words or using a different color.

Measuring Collocations
The measurements used in this article are unigrams/bigrams, bigrams/trigrams, mutual information, T-scores, and log likelihood.

These five methods range from very simple to relatively complex. Their common feature is the use of unigram, bigram, and trigram counts. In all the methods we used, the scores and the resulting ranking and discarded the level of significance.

The implementation consists of tree parts the web connection (Google API), calculations and GUI/displaying. Google provides a SOAP API, which makes it possible to access the search engine from your program and gives an easy way to retrieve the result. A search via the API should give you the same result as normal web search.

Once we have obtained the collocation measures, we display the typed text using fonts of different size so that unexpected associations are shown either in large or small characters depending on the scale. The scaling model can be set in different ways linear, logarithmical or preset scale; they can either be self-adjusting or have fixed minimum and maximum.

Although we did not conduct extensive experiments and testing, we could flag of unnatural associations in texts and display them so that the writer can revise them. As a preliminary conclusion, this shows that it is possible to use web tools to detect the correct usage of pairs of words. We believe that collocation measures derived on-line and in real time from the web in the way we have presented can be a useful help to non-native writers.

NB. This work was done while Sven Karlsson was student at the Lund University.

Last modified: Fri Apr 8 22:25:02 EEST 2005