view CWEP-003.rst @ 6:f00e0e8cdeaf

CWEP-003: proof-reading
author Nicolas Chauvat <nicolas.chauvat@logilab.fr>
date Thu, 13 Feb 2014 17:11:45 +0100
parents CWEP-0003.rst@61eae68af347
children ca60c294e8b7
line wrap: on
line source
=======================
 CWEP-0003 Clause FROM
=======================

:champion: Vincent Michel
:reviewers: COPIL
:last update: 2014/02/05
:status: draft


Introduction
============

A recurrent need in some CubicWeb projects is to write RQL queries that make use
of data published by remote databases or web services.

Let us call:

* *local instance* the CubicWeb instance to which the RQL query is sent ;

* *remote service* a web service, a SPARQL endpoint or a CubicWeb instance, that
  is distinct from the *local instance*.

Use cases
=========

UC1
---

A *local instance* contains data about medical studies, where patients are
linked to diseases defined by a single attribute, an URI::

     P is Patient, P related_diseases D, D is Disease, D uri "http://dbpedia.org/resource/Epilepsy"

A *remote service* contains detailed data about diseases, e.g. the mortality rate
of diseases. We would like to filter the patients of the *local instance* based on the mortality rate
retrieved from the distance instance::

    Any P WHERE P related_diseases D
                WITH D BEING (Any X WHERE X mortality_rate > 0.5)
                       FROM "http://remote.service.com"

UC2
---

A *local instance* contains information about authors, and we want to query the
thumbnails of the authors with a name that starts with "Victor"::

    Any T WHERE A name N, A name LIKE "Victor %"
                WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                       FROM "http://remote.service.com"

UC3
---

A *local instance* contains information about news articles, with links to
objects cited in the text of the article::

   X is NewsArticle, X recognized_entities E, E uri "http://dbpedia.org/resource/Barack_Obama"

We want to select the news articles that mention any President of the USA, retrieving
the data from Dbpedia's SPARQL endpoint::

    SELECT DISTINCT ?c WHERE
    { ?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States> }

i.e.::

    Any X WHERE X is NewsArticle, X recognized_entities E
                WITH E FROM "http://dbpedia.org/sparql?q=SELECT DISTINCT ?c WHERE
                            {?c dcterms:subject
                             <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}"


UC4
---

Same as the previous case, but with an API::

    Any X, E WHERE X is NewsArticle, X recognized_entities E
                 WITH E FROM "http://remote.service.com/api/presidents"



Specifications
==============

API Specifications
~~~~~~~~~~~~~~~~~~

From these examples, we can define the following specifications for the API:

* Retrieve data from a CubicWeb instance using a RQL query defined in the
  `BEING` clause and a base url defined in the `FROM` clause (use case 1);

* Retrieve data from Sparql endpoint using a SPARQL query defined as a string in
  the `BEING` clause and a base url defined in the `FROM` clause (use case 3);

* Be able to join the data in the *local instance* (use cases 1 and 3) or in the
  *remote service* (use case 2). Indeed, in the use case 2, we do not want to
  get ALL the thumbnails for all the authors from the remote instance, but we
  would rather only send all the names that verify "Victor %" to the *remote
  service*.


* Sources may be defined:

  * by an url, e.g. "http://www.cubicweb.org" or "http://www.dbpedia.org/sparql";

  * by an appid, if running on the same machine, e.g. "dbpedia", "geonames".
    In this case, this could be the name of specific CWSource.


Implementation specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following specifications come from different use cases:

* The request must be **fully dynamic**. We don't want to store any data of a
  *remote service* on the *local instance* (for memory sake, or because we want
  a query to reflect the last changes in a *remote service*).

* The *local* and *remote* instances should be fully separated. We don't want
  any dependancies between them, in terms of code (e.g. in the schema) or
  availability of the instances (i.e. if a *remote service* is down, the *local
  instance* should still work).

* The *local instance* will only work with base types for data from *remote
  services*, i.e. the notions of EType is not transferable (at least, the eid
  will be an integer). I.e. if the description of the rset on the remote instance
  is::

   [['Project'], ['Person'], ['Float'], ['String'],]


The description that will be used by the local instance is::

   [['Int'], ['Int'], ['Float'], ['String'],]

It could be interesting to think about a potential (and NOT mandatory) cache on
some requests (Postgres temporary table, more high-level cache on HTTP requests,
etc.)

Foreign Data Wrapper are not directly usable for this CWEP, as they require the
definition of a specific table schema, and we do not know it before executing
the query, especially in the case of SPARQL / API.  Moreover, this may be
too use too much memory on the local instance.


Data join
~~~~~~~~~

Two joins are possible, on attributes or on etypes:

* the join on attributes is straitghforward::

      Any T WHERE A name N, A name LIKE "Victor %"
                  WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                         FROM "http://another.instance.com"

  In this case, we join on the value of N. For now, this is not supported by RQL
  (see "Known limitations" below).


* the join on entities is a syntatic sugar based  on URI.
  It could be interesting to automatically join the entities based on their `cwuri` if the variable
  use to join is an etype::

      Any T WHERE A name "Victor %", A same_as Y
                  WITH T BEING (Any T WHERE Y thumbnails T)
                         FROM "http://another.instance.com"

  In this case, we send to the *remote service* all the uris of entities similar
  to entities having a name matching "Victor %". The uris will be used in the *remote* query.
  This could be rewritten as::

      Any T WHERE A name "Victor %", A same_as X, X cwuri U
                  WITH T BEING (Any T WHERE Y cwuri U, Y thumbnails T)
                         FROM "http://another.instance.com"



Changes to RQL
~~~~~~~~~~~~~~

Simple cases (pure remote query)::

     Any N WITH N BEING (Any X WHERE X is Project, X name N) FROM "http://www.cubicweb.org"

     Any X WITH X BEING (Any X WHERE X is Dbpediapage) FROM "dbpedia"


Join in query::

     Any X,CC WHERE X same_as Y WITH CC BEING (Any CC WHERE Y country CC) FROM "geonames"

     Any N, M WITH N BEING (Any X WHERE X is Project, X name N)
                     FROM "http://www.cubicweb.org"
                   M BEING (Any X WHERE X is Project, X name N)
                     FROM "http://www.cubicweb.org"


Remote join in query::

      Any T WHERE A name N, A name "Victor %"
                  WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
                         FROM "http://remote.service.com" JOINBY N


      Any T WHERE A name "Victor %", A same_as Y
                  WITH T BEING (Any T WHERE Y thumbnails T)
                         FROM "http://remote.service.com" JOINBY T



Implementation
==============


In PostgreSQL, it is possible to execute::

 SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
                                               ...
                                               ('pyqonsole'), ('pyreverse'), ('yams'))
                      AS _X (c0)) AS _T0;


Thus it is possible to let CubicWeb do all the join logic by simply replacing the function
that generates the SQL statement for the subqueries.


POC
~~~

A proof-of-concept implementation is available from:

* http://hg.logilab.org/users/vmichel/cubicweb

* http://hg.logilab.org/users/vmichel/rql


For RQL, it implements:

* the grammar/tests for ``FROM`` in RQL

* the logic of annotations/checking of the RQL syntax tree


For CubicWeb, it implements:

* the execution of the remote query and the join with the local query



Known limitations
~~~~~~~~~~~~~~~~~

The `WITH` subqueries in CubicWeb create a new variables scope, which the join
on *remote services*.  This is currently the most important limitation, as the
join on *remote services* is one of the biggest use case of the `FROM` clause.


Possible improvements
~~~~~~~~~~~~~~~~~~~~~

In the specific case of a *remote service* based on RQL, it could be interesting
to implement a specific API used by the *local* querier to retrieve statistics
on the *remote* execution plan, thus allowing to decide if the join by be done
on the *local* instance or the *remote* one.


References
~~~~~~~~~~

Some exemples and references:

* Foreign Data Wrapper::

      CREATE FOREIGN TABLE pw
      (username text, pw text, uid int, gid int,
      comments text, homedir text, shell text)
      SERVER file_fdw
      OPTIONS (format 'csv', delimiter ':',
      filename '/etc/passwd');

  For more details, see:

  * http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper

  * http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf


* SPARQL Service::

      PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
      SELECT ?name
      FROM <http://example.org/myfoaf.rdf>
      WHERE
      {
         <http://example.org/myfoaf/I> foaf:knows ?person .
         SERVICE <http://people.example.org/sparql> {
           ?person foaf:name ?name . }
      }

   For more details, see::

   * http://www.w3.org/TR/sparql11-federated-query/