We discuss a system that implements an integrated framework for Web
exploration, wrapping, data integration, and querying. Here, the
"integration" applies in three aspects: the data model and the
functionality, and the architecture. The core of the approach is a
unified framework - i.e., data model and language - in which all
tasks are performed. We regard the Web and its contents as a unit,
represented in a semi-structured, object-oriented data model: the
Web structure, given by its hyperlinks, the parse-trees of Web
pages, and its contents are all included in the internal world model
of the system. Additionally, the application-level model is
immediately generated as an overlay of this source-level model. The
model is complemented by a rule-based object-oriented language which
is extended by Web accessing capabilities and structured document
analysis. This language is implemented by a central reasoning
engine.
The advantage of our unified approach is that the
same data manipulation and query language can be used for all tasks,
i.e., accessing Web pages, wrapping, data integration, and querying
information. Thus, these tasks are not necessarily separated, but
can be closely intertwined. Additionally, by reusing the
source-level model for generating the application-level model, there
is no overhead for communication and mapping between different data
formats.
In particular, we present a methodology for reusing
generic rule patterns for typical extraction, integration, and
restructuring tasks. In an abstract sense, the system contains a
universal wrapper, which can be applied to arbitrary Web pages that
the system considers during information processing. Equipped with
suitably intelligent rules, the system can potentially explore
initially unknown parts of the Web, thus coping with the steady
growth of the Web.
We show the practicability of our approach by using the
FLORID
system.