view CWEP-003.rst @ 19:d06afae95fe4

CWEP 0004 initial draft
author Christophe de Vienne <christophe@unlish.com>
date Wed, 22 Oct 2014 20:24:55 +0200
parents ce4b5e49bb6b
children
line wrap: on
line source
=======================
 CWEP-0003 Clause FROM
=======================

:champion: Vincent Michel
:reviewers: COPIL
:last update: 2014/02/05
:status: draft


Introduction
============

A recurrent need in some CubicWeb projects is to write RQL queries that make use
of data published by remote databases or web services.

Let us call:

* *local instance* the CubicWeb instance to which the RQL query is sent ;

* *remote service* a web service, a SPARQL endpoint or a CubicWeb instance, that
  is distinct from the *local instance*.


Use cases
=========

Use case #1
-----------

A *local instance* contains data about medical studies, where patients are
linked to diseases defined by a single attribute, an URI::

     P is Patient, P related_diseases D, D is Disease, D uri "http://dbpedia.org/resource/Epilepsy"

A *remote service* contains detailed data about diseases, e.g. the mortality rate
of diseases. We would like to filter the patients of the *local instance* based on the mortality rate
retrieved from the distance instance::

    Any P WHERE P related_diseases D
          WITH D FROM RQL("http://remote.service.com", 'Any X WHERE X mortality_rate > 0.5')

Use case #2
-----------

A *local instance* contains information about authors, and we want to query the
thumbnails of the authors with a name that starts with "Victor"::

    Any T WHERE A name N, A name LIKE "Victor %"
          WITH T FROM ("http://remote.service.com", 'Any T WHERE Y name N, Y thumbnails T') JOIN ON (N)

Use case #3
-----------

A *local instance* contains information about news articles, with links to
objects cited in the text of the article::

   X is NewsArticle, X recognized_entities E, E uri "http://dbpedia.org/resource/Barack_Obama"

We want to select the news articles that mention any President of the USA, retrieving
the data from Dbpedia's SPARQL endpoint::

    SELECT DISTINCT ?c WHERE
    { ?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States> }

i.e.::

    Any X WHERE X is NewsArticle, X recognized_entities E
          WITH E FROM SPARQL("http://dbpedia.org/sparql",
                             'SELECT ?c WHERE {?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}')

Use case #4
-----------

Same as the previous case, but with an API::

    Any X, E WHERE X is NewsArticle, X recognized_entities E
             WITH E FROM MYAPI("http://remote.service.com/api/presidents") JOIN ON (E)

where 'MYAPI' is a specific function defined in the cube.


Conclusion
----------
The following specifications come from different use cases:

* The request must be **fully dynamic**. We don't want to store any data of a
  *remote service* on the *local instance* (for memory sake, or because we want
  a query to reflect the last changes in a *remote service*).

* The *local* and *remote* instances should be fully separated. We don't want
  any dependancies between them, in terms of code (e.g. in the schema) or
  availability of the instances (i.e. if a *remote service* is down, the *local
  instance* should still work).

Foreign Data Wrapper are not directly usable for this CWEP, as they require the
definition of a specific table schema, and we do not know it before executing
the query, especially in the case of SPARQL / API.  Moreover, this may use too
much memory on the local instance.


.. XXX influence on tree solutions is depending on each function implementation. Rather Talk about that, related api, and restriction of pre-existing function (RQL, SPARQL)

.. * The *local instance* will only work with base types for data from *remote
..   services*, i.e. the notions of EType is not transferable (at least, the eid
..   will be an integer). I.e. if the description of the rset on the remote instance
..   is::

..     [['Project'], ['Person'], ['Float'], ['String'],]


..   The description that will be used by the local instance is::

..     [['Int'], ['Int'], ['Float'], ['String'],]


Specifications
==============

RQL langage specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~

From these examples, we can define the following specifications for the API:

* Retrieve data from a custom function that will be called in a `FROM` clause (use case 1) ;

* The function is responsible to define how it behave on which kind of source
  (cubicweb instance, Sparql enpoint, etc.) ;

* Ability to join the data in the *local instance* (use cases 1 and 3) or in the
  *remote service* (use case 2), in which case we have to use the new `JOIN ON`
  RQL keyword. Indeed, in the use case 2, we do not want to get ALL the
  thumbnails for all the authors from the remote instance, but we would rather
  only send all the names that verify "Victor %" to the *remote service*. This
  keywork may be used later to change the current behaviour of `BEING` clauses.


`JOIN ON` should have the following syntax::

   FROM FUNC(...) JOIN ON (V1, V2, ...)

in order to allow multiple subqueries with specific join variables, e.g.::

   WITH X FROM FUNC(...) JOIN ON (V1, V2), Y FROM FUNC(...) JOIN ON (U1), Z BEING ...




API specifications
~~~~~~~~~~~~~~~~~~

The `FROM` functions are registered in a dictionnary defined in server/clausefrom.py.
We have the following API:

  * in rql.utils we have the `register_from_function` that could be used
    to register a `FROM` function;

The base `FROM` function should probably go in a specific cube to avoid automatic registration
of these functions. CubicWeb should only provide support for a `FROM` function.

Substitution is delegated to the `FROM` functions. They may deal with joins in different ways,
CubicWeb only gives the join variables name and the join variables cache (i.e. a dictionnary (variable name, variable values) built in the local query).


Introduced FROM functions
~~~~~~~~~~~~~~~~~~~~~~~~~

A `FROM` function has the following API:

 * a *name* use for registration;

 * a possible *min_arguments* for arguments checking;

 * a possible *allow_join* boolean for join management;

 * should define a *remote_execute* function that takes both parameters
   and join variables and that generates the remote rset;


The *generate* function takes both parameters and join variables,
and return the subquery sql.


We define two examples of `FROM` function:

 * `RQL` takes an endpoint and a RQL query;

 * `SPARQL` takes an endpoint and a SPARQL query. Join should be done
   with named substitutions.



Implementation
==============


In PostgreSQL, it is possible to execute::

 SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
                                               ...
                                               ('pyqonsole'), ('pyreverse'), ('yams'))
                      AS _X (c0)) AS _T0;


Thus it is possible to let CubicWeb do all the join logic by simply replacing the function
that generates the SQL statement for the subqueries.

It could be interesting to think about a potential (and NOT mandatory) cache on some requests (Postgres temporary table, more high-level cache on HTTP requests, etc.)

For now, the implementation is PostgreSQL dependant, but using a temporary table could be a solution to extend the FROM clause to other SQL backends. However, this make the temorary table cleanup/management more complicated (temporary table does not exist in SQLite).




POC
~~~

A proof-of-concept implementation is available from:

* http://hg.logilab.org/cubicweb

* http://hg.logilab.org//rql


For RQL, it implements:

* the grammar/tests for ``FROM`` in RQL;

* the grammar/tests for ``JOIN ON`` in RQL;

* the logic of annotations/checking of the RQL syntax tree


For CubicWeb, it implements:

* the execution of the remote query and the join with the local query;

* the planning of the ``JOIN ON`` variables fetching;




Possible improvements
~~~~~~~~~~~~~~~~~~~~~

In the specific case of a *remote service* based on RQL, it could be interesting
to implement a specific API used by the *local* querier to retrieve statistics
on the *remote* execution plan, thus allowing to decide if the join by be done
on the *local* instance or the *remote* one.

Perhaps add a function to deal with microschema in HTML.


Know limitations
~~~~~~~~~~~~~~~~

The known limitations are:

 * how to deal with an empty RSET from a ``FROM`` clause ? For now, the query::

    SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES(

   require at least one value. Some safety belts should be added in ssplaner and rql2sql.

 * how to detect loops in request ? E.g.::

   Any X WHERE X name N WITH N FROM ... JOIN ON (Y), Y FROM ... JOIN ON (X)

 * RQL analysis should change ('Any') to the correct restrictions
   after it gets it results from the `FROM` clause.
   For now, we use *nodes.CONSTANT_TYPES* in analyze.py but we should
   change it to 'Any'.


References
~~~~~~~~~~

Some exemples and references:

* Foreign Data Wrapper::

      CREATE FOREIGN TABLE pw
      (username text, pw text, uid int, gid int,
      comments text, homedir text, shell text)
      SERVER file_fdw
      OPTIONS (format 'csv', delimiter ':',
      filename '/etc/passwd');

  For more details, see:

  * http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper

  * http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf


* SPARQL Service::

      PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
      SELECT ?name
      FROM <http://example.org/myfoaf.rdf>
      WHERE
      {
         <http://example.org/myfoaf/I> foaf:knows ?person .
         SERVICE <http://people.example.org/sparql> {
           ?person foaf:name ?name . }
      }

   For more details, see::

   * http://www.w3.org/TR/sparql11-federated-query/