Technical Report 136,
Institut für Informatik,
Universität Freiburg
,
March 2000.
Information Extraction from the Web
Wolfgang May, Georg Lausen
Abstract:
The goal of information extraction from the Web is to provide an
integrated view on data from autonomous, heterogeneous information
sources. The main problem with current wrapper/mediator approaches is
that they rely on very different formalisms and tools for wrappers
and mediators, thus leading to an "impedance mismatch" between the
wrapper and mediator level. Additionally, most approaches nowadays
are restricted to access information only from a fixed set of
sources. On the other hand, generic Web querying
approaches are restricted to pure syntactical and structural queries
and do not deal with semantical issues.
In this paper, we discuss an integrated architecture for Web
exploration, wrapping, mediation, and querying. Our system is based
on a unified framework - i.e., data model and language - in which
all tasks are performed. We regard the Web and its contents as a
unit, represented in an object-oriented data model: the Web
structure, given by its hyperlinks, the parse-trees of Web pages,
and its contents are all included in the internal world model of the
system. The advantage of this unified view is that the same data
manipulation and querying language can be used for the Web structure
and the application-level model: The model is complemented by a
rule-based object-oriented language which is extended by Web access
capabilities and structured document analysis. Thus, accessing Web
pages, wrapping, mediating, and querying information can be done
using the same language.
This integration also allows for data-driven Web exploration which
is independent from a given network of individual predefined
wrappers and mediators. Thus, in addition to the classical wrapper
and mediator functionality, a system with this architecture can be
equipped with Web navigation and exploration
functionality. Queries to existing Web indexing and searching
engines can also be integrated.
In particular, we present a methodology for reusing generic rule
patterns for typical extraction, integration, and restructuring
tasks using this framework. In an abstract sense, the system
contains a universal wrapper, which can be applied to arbitrary
Web pages that the system learns about during information
processing. Equipped with suitably intelligent rules, the system
can potentially explore initially unknown parts of the Web, thus
coping with the steady growth of the Web.
We show the practicability of our approach by using the
FLORID system.
The approach is illustrated by two case-studies.
[ps-File]
Excerpts of this work have been published in
- Modeling and Querying
Structure and Contents of the Web,
International Workshop on Internet Data Management (IDM'99),
Florence, Sept. 2, 1999,
-
A
Unified Framework for Wrapping, Mediating and Restructuring
Information from the Web,
International Workshop on the World-Wide Web and Conceptual Modeling
(WWWCM'99), Paris, Nov. 15-18, 1999,
-
An Integrated Architecture for
Exploring, Wrapping,
Mediating and Restructuring Information from the Web,
Australasian Database Conference (ADC 2000), Canberra,
Jan. 31 -Feb. 3, 2000,
-
Slides have been
presented at Dagstuhl-Seminar
"Declarative Data Access on the Web",
Sept. 12-17, 1999, Schloss Dagstuhl, Germany.
-
The
MONDIAL
Case Study describes a practical application.
A journal version
appeared with
Informations Systems, 2004.
|