A Distributed System for Semantic Annotation of Text in OGSA-DAI

Student: Ming Zhang
Institution: School of Informatics, University of Edinburgh
Supervisor(s): Ewan Klein (School of Informatics)
Date: September 2006

Abstract:

Semantic annotation of written text is a method of making useful semantic information (e.g., about entities and relations) formally explicit. It may use some kind of natural language processing (NLP) tools for recognizing entities and relations, or perhaps for doing more extensive parsing. It has been widely used in Text Mining process to deal with text processing for online literature. One can treat the text mining process as the collaboration of the multiple NLP tools in pipeline. However, in the general case, the NLP tools may reside on different sites and be maintained as independent resources by different parties in heterogeneous environment. Consequently it is desirable to allow them to interoperate as distinct services. How to seamlessly assemble them and make them work efficiently in a distributed environment become important issues.

In this research project, I propose to integrate NLP annotation tools into OGSA-DAI in order to provide the distributed text mining system as Grid service. This project will take as its starting point an existing geo-parsing web service that uses the LT-XML processing framework. The framework adopts a processing model in which XML documents are streamed through a series of Unix pipes, and each component modifies the XML mark-up of its input in specified ways. However, rather than wrapping the NLP components into a monolithic web service (as currently done for the geo-parser), I will deploy them as OGSA-DAI data services using the Java Runtime Method, exec(). The activities in an OGSA-DAI Perform Document will be assembled into a pipeline in order to allow complicated data access tasks.

Requirements for the proposed system include interoperability and modularity of the NLP components; easy deployment and configuration; and scalability to large document sets. System evaluation will compare the OGSA-DAI version system with two baselines: (i) the NLP pipeline running natively as Unix executables, and (ii) the pipeline running as a web service (or composition of web services). It is expected that system performance will benefit from features of OGSA-DAI in the following respects: avoiding unnecessary communication between client and server, the user will be able to compose the NLP processes in a single performance document and submit once; avoiding unnecessary data movement, the annotated output text can be directly streamed to the next NLP activity.