Master Thesis presentation: "Using Material From Internet as a Corpus"

Date: October 05, 2009 (Monday) at 13:15

Christan Larsson presents his Master Thesis "Using Material From Internet as a Corpus".

Abstract
The purpose of this master’s thesis was to create a large corpus of the Swedish language
by using the Internet as a source. There are only small corpora available today, and to
bring language research forward, it is important to have a big Swedish corpus.
Using material from the Internet as a corpus requires working through a number of steps
such as to
• Download large amount of web pages from the Internet
• Extract the text from the web pages
• Identify Swedish text, and remove all text that isn’t Swedish.
• Divide the text into sentences
Conclusion: This thesis resulted in two corpora, one from letting a crawler download
web pages, and one from parsing the Wikipedia XML dump. The Wikipedia dump is of
high quality and contains 2 686 698 sentences, corresponding to 44 395 946 words. The
corpus that were created by crawling resulted in a corpus with 8 342 918 sentences,
corresponding to 119 500 499 words. The crawled corpus is considered to be of a lower
quality since it may contain some entries with foreign text, and some sentences may not
be real sentences. It was also created unigram, bigram, and trigram statistics where we
looked at frequency statistics over occurrences of word sequences in the both corpora.

Room: E:4130

Last modified Dec 9, 2011 12:57 pm

0159

Page Manager: Jonas Wisbrant
Publisher: Department of Computer Science

Department of Computer Science