Thursday, February 16, 2006

Topic Maps and RDF scutters (assertion spidering bots): State of the art?

I am looking for the state of the art in Topic Maps and RDF scutters (assertion spidering bots).
Do you have any useful pointers/hints?

The growing use of semantic knowledge technologies like RDF and Topic Maps should result in larger collections of represented assertions available on the internet. Scutters (information agents spidering such assertions) could collect and integrate them.

One example of such an assertion scutter is:

I have sketched the idea in my blog entry dated 24th Nov., 2005.

Do you have any information concerning the following four questions:

Which available collections of statements/assertions do you know and can you recommend to me for an aggregation scenario?

I want to use them in a content aggregation scenario where statements about the same subjects are collocated. Ideally, such a collection would use Published Subjects (or subject indicators). I am particularly looking for topic map data, but would also like to know about RDF data, since accorcing to the latest guidelines in semantic interoperability, useful mappings are possible between Topic Maps and RDF.

MusicBrainz is one example for a semantic web service with RDF.
There are also e.g. approaches for converting genealogical data (GEDCOM) to RDF FOAF,
or one might use DMOZ RDF data.

Which scutters (spidering information agents for RDF and/or topic maps
(or fragments) do you know/can you recommend?

I am planning to use:
a LWP agent based on RDF::Redland.

Does something similar already exist for Topic Maps?

I know of some Java agents,
in particular the CC-licenced
Slug: A Simple Semantic Web Crawler (December 09, 2004)

SECO contains an
RDF Crawler: Scutter (Bash and Pyhton for Scuttering)
SECO: mediation services for semantic Web data Harth A IEEE Intelligent Systems, (USA) May/Jun 2004, Vol 19 No 3, 66ff.
Harth and Gassert describe a 103 MB test data set they compiled:
On Searching and Displaying RDF Data from the Web

In addition, researchers in SNA (Social Network Analysis) write scutters.
The data set compiled e.g. by PhD student Peter Mika is impressive:
Social Networks and the Semantic Web

There exists a Redfoot-RDF-Scutter in Python:
for which a REST interface has been proposed:
Sun, 29 Jan 2006
A RESTful Scutter Protocol for Redfoot Kernel

There is a Javascript extension for Mozilla:
Scuttering Composite RDF Datasource

In his research proposal "Mining the Semantic Web", Ajay Chakravarthy in section 2.4 names some existing tools
(Ontotext, Hackdiary, others with poor performance)
HyperSpider - HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.
(I could export website interlinkings with this, but this is formal metadata)

A List of RDF Crawlers
(4 entries)
* RDF Crawler (in Java) from Institute AIFB, University of Karlsruhe, Germany
* Decentralised and reliable resource discovery using RDF metadata (also known as Fydra)
* DAML Crawler
* RDF Crawling Services - RDF Gateway
LuMriX which is topic map-based contains a crawler, but I know not enough about it.

Which sites freely offer semantic web services?
I want to retrieve assertions, i.e. fragments of knowledge networks realized with Topic Maps or RDF.
Preferably with a possibility to retrieve by published subject (or subject indicator).
Indirect search by name where I assert the identity of the subject might do for the moment.

Do you know of demo sites which can be externally queried with TMRAP 0.2 (or higher: 1.0, 2.0)?

Scratchpad of additional references:
(not yet checked)

Current State of Semantic Web Mining
Check starting slide 38, but not so useful for this purpose
Ontobroker, which includes
an ontology-based web-crawler


RDFWeb notebook: aggregation strategies
(describing Swoogle)

Finding and Ranking Knowledge on the Semantic Web

Search on the Semantic Web

JNotes. Automatic Generation of Semantic Networks

[xtm-wg] Sketch of a Possible Algorithm for Fragment Grabbing (2000)

Pragmatic applications of the Semantic Web using SemTalk.
The agents are supported by crawlers searching proactively or after request for existing models to generate index files for the agents. The crawlers do not only look in the local filesystem, but also in the Semantic Web, for available knowledge sources in the RDFS format.

Metadata-based Web Querying

Perl/C RDF storage and API

CARA is an RDF API written in Perl


TMRA 2006: International Conference on Topic Maps Research and Applications, Leipzig (DE)

TMRA 2006 - International Conference on Topic Maps Research and Applications"
Leveraging the Semantics"
Leipzig, Germany, 11-12 October 2006

Full disclosure: I am co-chair of the program committee