Capturing Out-of-Vocabulary Words in Arabic Text

Date

2006-7

Type

Conference paper

Conference title

Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Author(s)

Abdusalam Alfitory Ahmad Nwesri
S.M.M. Tahaghoghi
Falk Scholer

Pages

258 - 266

Abstract

The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, and so before any stemming, foreign words need to be identified. In this paper, we investigate three approaches for the identification of foreign words in Arabic text: lexicons, language patterns, and n-grams and present that results show that lexicon-based approaches outperform the other techniques.

Publisher's website

View