IS 2004: Information Integration

Institute for Informatics
Georg-August-Universität Göttingen

Information Systems 29(1), pp. 59-91, 2004.

A Uniform Framework for Integration of Information from the Web

Abstract:

We discuss a system that implements an integrated framework for Web exploration, wrapping, data integration, and querying. Here, the "integration" applies in three aspects: the data model and the functionality, and the architecture. The core of the approach is a unified framework - i.e., data model and language - in which all tasks are performed. We regard the Web and its contents as a unit, represented in a semi-structured, object-oriented data model: the Web structure, given by its hyperlinks, the parse-trees of Web pages, and its contents are all included in the internal world model of the system. Additionally, the application-level model is immediately generated as an overlay of this source-level model. The model is complemented by a rule-based object-oriented language which is extended by Web accessing capabilities and structured document analysis. This language is implemented by a central reasoning engine. The advantage of our unified approach is that the same data manipulation and query language can be used for all tasks, i.e., accessing Web pages, wrapping, data integration, and querying information. Thus, these tasks are not necessarily separated, but can be closely intertwined. Additionally, by reusing the source-level model for generating the application-level model, there is no overhead for communication and mapping between different data formats. In particular, we present a methodology for reusing generic rule patterns for typical extraction, integration, and restructuring tasks. In an abstract sense, the system contains a universal wrapper, which can be applied to arbitrary Web pages that the system considers during information processing. Equipped with suitably intelligent rules, the system can potentially explore initially unknown parts of the Web, thus coping with the steady growth of the Web. We show the practicability of our approach by using the FLORID system.