Producing robust components to process human language as part of applications software requires attention to the engineering aspects of their construction

1 Introduction

Producing robust components to process human language as part of applications software requires attention to the engineering aspects of their construction. This paper reports work on GATE1 , an infrastructure for language processing software development that contributes on several fronts to this type of predictability:

• The system is designed to separate cleanly low-level tasks such as data storage, data visualisation, location and loading of components and execution of processes from the data structures and algorithms that actually process human language.

• Automating measurement of performance of language processing components.

• Reducing integration overheads by providing standard mechanisms for components to communicate data about language, and using open standards such as Java and XML as the underlying platform.

• Providing a baseline set of language processing components that can be extended and/or replaced by users as required.

The rest of the paper is structured as follows. We first describe the GATE architecture in Section 2, and then give details of some of the applications we have built using GATE in Section 3. Section 4 describes the processing resources available within GATE, while Section 5 describes the language resources. In Section 6 we discuss the mechanisms for evaluation. Finally, Section 7 puts this work in the context of some previous work and Section 8 discusses future directions.

2 A framework for robust tools and applications

GATE (Cunningham, 2002) is an architecture, a framework and a development environment for LE (Language Engineering)2 . As an architecture, it defines the organisation of an LE system and the assignment of responsibilities to different components, and ensures that the component interactions satisfy the system requirements. As a framework, it provides a reusable design for an LE software system and a set of prefabricated software building blocks that language engineers can use, extend and customise for their specific needs. As a development environment, it helps its users to minimise the time they spend building new LE systems or modifying existing ones, by aiding overall development and providing a debugging mechanism for new modules. Because GATE has a componentbased model, this allows for easy coupling and decoupling of the processors, thereby facilitating comparison of alternative configurations of the system or different implementations of the same module (e.g., different parsers).

The availability of tools for easy visualisation of data at each point during the development process aids immediate interpretation of the results. The GATE framework comprises a core library (analogous to a bus backplane) and a set of reusable LE modules. The framework implements the architecture and provides (amongst other things) facilities for processing and visualising resources, including representation, import and export of data. The reusable modules provided with the backplane are able to perform basic language processing tasks such as POS tagging and semantic tagging. This eliminates the need for users to keep recreating the same kind of resources, and provides a good starting point for new applications. The modules are described in more detail in Section 4. Applications developed within GATE can be deployed outside its Graphical User Interface (GUI), using programmatic access via the GATE API (see In addition, the reusable modules, the document and annotation model, and the visualisation components can all be used independently of the development environment.

GATE components may be implemented by a variety of programming languages and databases, but in each case they are represented to the system as a Java class. This class may simply call the underlying program or provide an access layer to a database; alternatively it may implement the whole component. In the rest of this section, we show how the GATE infrastructure takes care of the resource discovery, loading, and execution, and briefly discuss data storage and visualisation.

2.1 Algorithms + data + GUI = applications

The title expresses succinctly the distinction made in GATE between data, algorithms, and ways of visualising them. In other words, GATE components are one of three types:

• LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies;

• ProcessingResources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers;

• VisualResources (VRs) represent visualisation and editing components that participate in GUIs.

These resources can be local to the user’s machine or remote (available via HTTP), and all can be extended by users without modification to GATE itself. One of the main advantages of separating the algorithms from the data they require is that the two can be developed independently by language engineers with different types of expertise, e.g. programmers and linguists. Similarly, separating data from its visualisation allows users to develop alternative visual resources, while still using a language resource provided by GATE. Collectively, all resources are known as CREOLE (a Collection of REusable Objects for Language Engineering), and are declared in a repository XML file, which describes their name, implementing class, parameters, icons, etc. This repository is used by the framework to discover and load available resources. A parameters tag describes the parameters which each resource needs when created or executed. Parameters can be optional, e.g. if a document list is provided when the corpus is constructed, it will be populated automatically with these documents. When an application is developed within GATE’s graphical environment, the user chooses which processing resources go into it (e.g. tokeniser, POS tagger), in what order they will be executed, and on which data (e.g. document or corpus). The execution parameters of each resource are also set there, e.g. a loaded document is given as a parameter to each PR. When the application is run, the modules will be executed in the specified order on the given document. The results can be viewed in the document viewer/editor.

3 Conclusions

In this paper we have described an infrastructure for language engineering software which aims to assist the develeopment of robust tools and resources for NLP. One future direction is the integration of processing resources which learn in the background while the user is annotating corpora in GATE’s visual environment. Currently, statistical models can be integrated but need to be trained separately. We are also extending the system to handle language generation modules, in order to enable the construction of applications which require language production in addition to analysis, e.g. intelligent report generation from IE data.

the source: