Thursday, February 16, 2006

Topic Maps and RDF scutters (assertion spidering bots): State of the art?

I am looking for the state of the art in Topic Maps and RDF scutters (assertion spidering bots).
Do you have any useful pointers/hints?

The growing use of semantic knowledge technologies like RDF and Topic Maps should result in larger collections of represented assertions available on the internet. Scutters (information agents spidering such assertions) could collect and integrate them.

One example of such an assertion scutter is:
http://rdfweb.org/topic/Scutter
http://rdfweb.org/topic/ScutterVocab

I have sketched the idea in my blog entry dated 24th Nov., 2005.
http://asigel.blogspot.com/2005/11/ideas-for-aggregation-of-distributed.html

Do you have any information concerning the following four questions:

(1)
Which available collections of statements/assertions do you know and can you recommend to me for an aggregation scenario?


I want to use them in a content aggregation scenario where statements about the same subjects are collocated. Ideally, such a collection would use Published Subjects (or subject indicators). I am particularly looking for topic map data, but would also like to know about RDF data, since accorcing to the latest guidelines in semantic interoperability, useful mappings are possible between Topic Maps and RDF.

MusicBrainz is one example for a semantic web service with RDF.
There are also e.g. approaches for converting genealogical data (GEDCOM) to RDF FOAF,
or one might use DMOZ RDF data.

(2)
Which scutters (spidering information agents for RDF and/or topic maps
(or fragments) do you know/can you recommend?

I am planning to use:
http://search.cpan.org/~kjetilk/RDF-Scutter/
a LWP agent based on RDF::Redland.

Does something similar already exist for Topic Maps?

I know of some Java agents,
in particular the CC-licenced
Slug: A Simple Semantic Web Crawler (December 09, 2004) http://www.ldodds.com/blog/archives/000167.html
http://aloo.gnomehack.com/~ldodds/projects/slug/javadoc/

SECO contains an
RDF Crawler: Scutter (Bash and Pyhton for Scuttering) http://triple.semanticweb.org/svn/aharth/2004/wwwnyc/seco-talk.html
http://www.harth.org/andreas/2004/ieeeis/
SECO: mediation services for semantic Web data Harth A IEEE Intelligent Systems, (USA) May/Jun 2004, Vol 19 No 3, 66ff.
Harth and Gassert describe a 103 MB test data set they compiled:
On Searching and Displaying RDF Data from the Web http://sw.deri.org/2004/12/derisearch/Eswc2005Demo.pdf

In addition, researchers in SNA (Social Network Analysis) write scutters.
The data set compiled e.g. by PhD student Peter Mika is impressive:
Social Networks and the Semantic Web
http://doi.ieeecomputersociety.org/10.1109/WI.2004.10039

There exists a Redfoot-RDF-Scutter in Python:
http://redfoot.net/scutter/
for which a REST interface has been proposed:
Sun, 29 Jan 2006
A RESTful Scutter Protocol for Redfoot Kernel http://copia.ogbuji.net/blog/2006-01-29/A_RESTful_

There is a Javascript extension for Mozilla:
Scuttering Composite RDF Datasource
http://nachbaur.com/software/mozilla/objects/index.xhtml

In his research proposal "Mining the Semantic Web", Ajay Chakravarthy in section 2.4 names some existing tools http://www.dcs.shef.ac.uk/~ajay/reports/Research%20Proposal.pdf
(Ontotext, Hackdiary, others with poor performance)
HyperSpider - HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.
http://hyperspider.sourceforge.net/
(I could export website interlinkings with this, but this is formal metadata)

A List of RDF Crawlers
http://www.dbis.informatik.uni-frankfurt.de/~tolle/RDF/RDFReferences.html
(4 entries)
* RDF Crawler (in Java) from Institute AIFB, University of Karlsruhe, Germany
* Decentralised and reliable resource discovery using RDF metadata (also known as Fydra)
* DAML Crawler
* RDF Crawling Services - RDF Gateway
LuMriX which is topic map-based contains a crawler, but I know not enough about it.
http://www.lumrix.de/xmlsearch_keyfacts.php

(3)
Which sites freely offer semantic web services?
I want to retrieve assertions, i.e. fragments of knowledge networks realized with Topic Maps or RDF.
Preferably with a possibility to retrieve by published subject (or subject indicator).
Indirect search by name where I assert the identity of the subject might do for the moment.

(4)
Do you know of demo sites which can be externally queried with TMRAP 0.2 (or higher: 1.0, 2.0)?


Scratchpad of additional references:
(not yet checked)
------------------------------------

Current State of Semantic Web Mining
http://www.fernuni-hagen.de/DVT/Aktuelles/zhao_yi.pdf
Check starting slide 38, but not so useful for this purpose
Ontobroker, which includes
an ontology-based web-crawler

DefineCrawler
http://www.lalic.paris4.sorbonne.fr/stic/octobre/octobre/apr/Nauer.pdf

RDFWeb notebook: aggregation strategies
http://rdfweb.org/2001/01/design/smush.html
(describing Swoogle)

Finding and Ranking Knowledge on the Semantic Web http://www.cs.umbc.edu/~ypeng/Publications/2005/iswcLiDing.pdf

Search on the Semantic Web
http://www.cs.umbc.edu/~ypeng/Publications/2005/IeeeSemanticWebSearch.pdf

JNotes. Automatic Generation of Semantic Networks
http://www.jnotes.de/JNotes/jnotes_webware.nsf/0/2DC6FB39AE566557C12570EC00307C3B?openDocument

[xtm-wg] Sketch of a Possible Algorithm for Fragment Grabbing (2000) http://lists.oasis-open.org/archives/topicmaps-comment/200007/msg00018.html

Pragmatic applications of the Semantic Web using SemTalk.
The agents are supported by crawlers searching proactively or after request for existing models to generate index files for the agents. The crawlers do not only look in the local filesystem, but also in the Semantic Web, for available knowledge sources in the RDFS format.
http://www.semtalk.com/pub/KnowTech2001.htm

Metadata-based Web Querying
http://www.cs.bilkent.edu.tr/~ismaila/research_projects.htm

RDFStore
Perl/C RDF storage and API
http://rdfstore.sourceforge.net/

CARA is an RDF API written in Perl
http://cara.sourceforge.net/

---

TMRA 2006: International Conference on Topic Maps Research and Applications, Leipzig (DE)

TMRA 2006 - International Conference on Topic Maps Research and Applications"
Leveraging the Semantics"
Leipzig, Germany, 11-12 October 2006
http://www.informatik.uni-leipzig.de/~tmra/2006/

Full disclosure: I am co-chair of the program committee