The tension between symbolic and statistical methods has been apparent in natural language processing (NLP) for some time

1 Introduction

The tension between symbolic and statistical methods has been apparent in natural language processing (NLP) for some time. Though some believe that the statistical methods have rendered linguistic analysis unnecessary, this is in fact not the case. Modern statistical NLP is crying out for better language models (Charniak 2001). At the same time, while ‘deep’ (linguistically precise) processing has now crossed the industrial threshold (Oepen et al. 2000) and serves as the basis for ongoing product development in a number of application areas (e.g. email autoresponse), it is widely recognized that deep analysis must come to grips with two key problems, if linguistically precise NLP is to become a reality.

The first of these is disambiguation. Paradoxically, linguistic precision is inversely correlated with degree of sentence ambiguity. This is a fact of life encountered by every serious grammar development project. Though knowledge representation, once thought to hold the key to the problem of disambiguation, it has largely failed to provide completely satisfactory solutions. Most research communities we are aware of that are currently developing large scale, linguistically precise, computational grammars are now exploring the integration of stochastic methods for ambiguity resolution. The second key problem facing the deep processing program – the problem of multiword expressions – is underappreciated in the field at large. There is insufficient ongoing work investigating the nature of this problem or seeking computationally tractable techniques that will contribute to its solution.We define multiword expressions (MWEs) very roughly as “idiosyncratic interpretations that cross word boundaries (or spaces)”. As Jackendoff (1997:156) notes, the magnitude of this problem is far greater than has traditionally been realized within linguistics. He estimates that the number of MWEs in a speaker’s
lexicon is of the same order of magnitude as the number of single words. In fact, it seems likely that this is an underestimate, even if we only include lexicalized
phrases. In WordNet 1.7 (Fellbaum 1999), for example, 41% of the entries are multiword. For a wide coverage NLP system, this is almost certainly an
underestimate. Specialized domain vocabulary, such as terminology, overwhelmingly consists of MWEs, and a system may have to handle arbitrarily many
such domains. As each new domain adds more MWEs than simplex words, the proportion of MWEs will rise as the system adds vocabulary for new domains. MWEs appear in all text genres and pose significant problems for every kind of NLP. If MWEs are treated by general, compositional methods of linguistic
analysis, there is first an overgeneration problem. For example, a generation system that is uninformed about both the patterns of compounding and the particular collocational frequency of the relevant dialect would correctly generate telephone booth (American) or telephone box (British/Australian), but might also generate such perfectly compositional, but unacceptable examples as telephone cabinet, telephone closet, etc. A second problem for this approach is what we will call the idiomaticity problem: how to predict, for example, that an expression like kick the bucket, which appears to conform to the grammar of English VPs, has a meaning unrelated to the meanings of kick, the, and bucket.
Syntactically-idiomatic MWEs can also lead to parsing problems, due to nonconformance
with patterns of word combination as predicted by the grammar
(e.g. the determinerless in line).
Many have treated MWEs simply as words-with-spaces, an approach with serious limitations of its own. First, this approach suffers from a flexibility problem. For example, a parser that lacks sufficient knowledge of verb-particle constructions might correctly assign look up the tower two interpretations (“glance up at the tower” vs. “consult a reference book about the tower”), but fail to treat the subtly different look the tower up as unambiguous (“consult a reference book . . . ” interpretation only). As we will show, MWEs vary considerably with respect to this and other kinds of flexibility. Finally, this simple approach to MWEs suffers from a lexical proliferation problem. For example, light verb constructions often come in families, e.g. take a walk, take a hike, take a trip, take a flight. Listing each such expression results in considerable loss of generality and lack of prediction. Many current approaches are able to get commonly-attested MWE usages right, but they use ad hoc methods to do so, e.g. preprocessing of various kinds and stipulated, inflexible correspondences. As a result, they handle variation badly, fail to generalize, and result in systems that are quite difficult
to maintain and extend.

In this paper we hope to have shown that MWEs, which we have classifiedin terms of lexicalized phrases (made up of fixed, semi-fixed and syntacticallyflexible expressions)  and institutionalized phrases, are far more diverse and interesting than is standardly appreciated. Like the issue of disambiguation, MWEs constitute a key problem that must be resolved in order for linguistically precise NLP to succeed. Our goal here has been primarily to illustrate the diversity of the problem, but we have also examined known techniques — listing words with spaces, hierarchically organized lexicons, restricted combinatoric rules, lexical selection, idiomatic constructions, and simple statistical affinity. Although these techniques take us further than one might think, there is much descriptive and analytic work on MWEs that has yet to be done. Scaling grammars up to deal with MWEs will necessitate finding the right balance among the various analytic techniques. Of special importance will be finding the right balance between
symbolic and statistical techniques.

the source: