Projektseminar
Web Data Integration and Data Management
Summer 2015
Technical Data
- Advanced Bachelor or Master/Diploma in
Applied Computer Science or Information Systems (Wirtschaftsinformatik)
- Prerequisites/Vorbedingungen: Basic Knowledge in e.g. XML and/or RDF
- 6 ECTS
- Number of participants: max. 16-20 (about 8-10 talks of 1 or 2 persons)
- Language: German and english are allowed. Reading of english text/documentation
is required.
Time Schedule
- first meeting at the beginning of the semester:
Monday 20.4. 14h c.t. SR 2.101, IFI: First Meeting
Assignment of topics and papers.
- April-June: preparation of case studies and presentations, individual meetings
-
Registration/Deregistration in FlexNever is open until 30.6.
- Talks: the talks take place on Friday, August 7th:
- 11:00 Sven Jaeger: Hybrid OWL Reasoning - Pagoda
(= mapping OWL to Logic Programming as far as possible, and coupling
it for disjunction and Open World negation with an OWL reasoner)
Presentation
- Lunch Break
- 13:30 or 14:00 (to be determined then)
Azadeh Amiri and Dorna Amiri: DBPedia and Yago. These are two
"large" RDF datasets that have been scrapped from Wikipedia pages.
First, there was DBPedia, and then Yago started as a
competing project.
DBpedia and YAGO - an Evaluation
Contents
There is a lot of data available in the Web and in the Semantic Web. Web data is usually
provided in a human-readable form of Web pages (including forms, the so-called Deep Web),
while it cannot be processd in a database-style way by users. Data Extraction,
e.g. from the CIA World Factbook or from Wikipedia, is thus a neverending "hot topic".
Apart from pattern-based approaches, also Natural Language Processing Approaches are
used.
The Semantic Web (cf. lecture Semantic Web) makes
some attempts to provide, extend and/or annotate Web Data towards a machine-readable way.
For this, the RDF data format is used, together with the OWL ontology language for
describing metadata.
Potential Topics
- RDF database generated from wikipedia: DBpedia
Comment: describe the project, evaluate its public interface and data quality, describe how its data is
obtained, storage etc.
- RDF database/ontology from wikipedia and geonames: YAGO/YAGO2
Comment: describe the project, evaluate its public interface and data quality, describe how its data is
obtained, storage etc.
Starting Paper: Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum:
YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia. WWW 2007: 697-706
- WebScrapping via Browser Automation with Selenium
Comment: Evaluate, do a case study, and describe. Neither XML nor RDF knowledge required,
but practical competence obviously needed.
- WebScrapping with OXPath/Diadem (Oxford Univ.)
Comment: Evaluate, do a case study, and describe. Based on XPath/XML. No RDF knowledge required, but XML/XPath
knowledge required.
-
Pagoda (Oxford Univ.) : PAGOdA exploits a hybrid approach to answering conjunctive queries over OWL 2 ontologies
that combines a datalog reasoner with a fully-fledged OWL 2 reasoner.
Pagoda Project
- The "Property-Graph" Data Model and the "Gremlin" query language.
The Property-Graph Data Model extends RDF such that edges can also have properties (cf. the problem
of reification of annotated edges in RDF, like "Russia is located (20%) in Europe"). Gremlin
is a semi-declarative query language (cf. XQuery which is declarative, but looks imperative;
Gremlin is actually imperative, but still in a very declarative way). Gremlin is e.g.
implemented in the Neo4j, OrientDB, and TinkerGraph databases.
(See e.g.
here or
here)
- A theoretical paper that presents a true algebra over RDF data.
Leonid Libkin, Juan L. Reutter, Domagoj Vrgoc:
Trial for RDF: adapting graph query languages for RDF data. PODS 2013: 201-212
- Open World, Incompleteness and Negation.
Papers:
Boris Motik, Ian Horrocks, Ulrike Sattler: Bridging the Gap between OWL and Relational
Databases. WWW 2007.
Alon Levy: Obtaining Complete Answers from Incomplete Databases. VLDB 1996.
- RDF Annotations in HTML: RDF-A
Note: Papers can be found via the DBLP
http://www.dblp.org
(originally, DBLP meant "Databases and Logic Programming", but by now it covers all topics in
Computer Science),
or simply by searching for the paper title with google (this often yields the pdf directly).
A list of other papers of the same authors can then be found via DBLP.
Form of the Seminar
The intention of the seminar is to get an overview of the state of the art in data integration
from the Web and background data management.
For each topic, the following has to be done:
- a written tutorial-style paper that gives an overview of an
approach,
- evaluate some tools, write a report (installation, functionality,
usability, ...) [optionally german or english]
- prepare an illustrative medium-size case study using one or more tools
(optionally: comparatively)
- a presentation giving the tutorial and showing a demo of how to
use it (about 90 minutes incl. discussion; optionally german or english).
|