CWEP-003.rst
author Nicolas Chauvat <nicolas.chauvat@logilab.fr>
Thu, 13 Feb 2014 17:11:45 +0100
changeset 6 f00e0e8cdeaf
parent 3 CWEP-0003.rst@61eae68af347
child 10 ca60c294e8b7
permissions -rw-r--r--
CWEP-003: proof-reading
     1 =======================
     2  CWEP-0003 Clause FROM
     3 =======================
     4 
     5 :champion: Vincent Michel
     6 :reviewers: COPIL
     7 :last update: 2014/02/05
     8 :status: draft
     9 
    10 
    11 Introduction
    12 ============
    13 
    14 A recurrent need in some CubicWeb projects is to write RQL queries that make use
    15 of data published by remote databases or web services.
    16 
    17 Let us call:
    18 
    19 * *local instance* the CubicWeb instance to which the RQL query is sent ;
    20 
    21 * *remote service* a web service, a SPARQL endpoint or a CubicWeb instance, that
    22   is distinct from the *local instance*.
    23 
    24 Use cases
    25 =========
    26 
    27 UC1
    28 ---
    29 
    30 A *local instance* contains data about medical studies, where patients are
    31 linked to diseases defined by a single attribute, an URI::
    32 
    33      P is Patient, P related_diseases D, D is Disease, D uri "http://dbpedia.org/resource/Epilepsy"
    34 
    35 A *remote service* contains detailed data about diseases, e.g. the mortality rate
    36 of diseases. We would like to filter the patients of the *local instance* based on the mortality rate
    37 retrieved from the distance instance::
    38 
    39     Any P WHERE P related_diseases D
    40                 WITH D BEING (Any X WHERE X mortality_rate > 0.5)
    41                        FROM "http://remote.service.com"
    42 
    43 UC2
    44 ---
    45 
    46 A *local instance* contains information about authors, and we want to query the
    47 thumbnails of the authors with a name that starts with "Victor"::
    48 
    49     Any T WHERE A name N, A name LIKE "Victor %"
    50                 WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
    51                        FROM "http://remote.service.com"
    52 
    53 UC3
    54 ---
    55 
    56 A *local instance* contains information about news articles, with links to
    57 objects cited in the text of the article::
    58 
    59    X is NewsArticle, X recognized_entities E, E uri "http://dbpedia.org/resource/Barack_Obama"
    60 
    61 We want to select the news articles that mention any President of the USA, retrieving
    62 the data from Dbpedia's SPARQL endpoint::
    63 
    64     SELECT DISTINCT ?c WHERE
    65     { ?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States> }
    66 
    67 i.e.::
    68 
    69     Any X WHERE X is NewsArticle, X recognized_entities E
    70                 WITH E FROM "http://dbpedia.org/sparql?q=SELECT DISTINCT ?c WHERE
    71                             {?c dcterms:subject
    72                              <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}"
    73 
    74 
    75 UC4
    76 ---
    77 
    78 Same as the previous case, but with an API::
    79 
    80     Any X, E WHERE X is NewsArticle, X recognized_entities E
    81                  WITH E FROM "http://remote.service.com/api/presidents"
    82 
    83 
    84 
    85 Specifications
    86 ==============
    87 
    88 API Specifications
    89 ~~~~~~~~~~~~~~~~~~
    90 
    91 From these examples, we can define the following specifications for the API:
    92 
    93 * Retrieve data from a CubicWeb instance using a RQL query defined in the
    94   `BEING` clause and a base url defined in the `FROM` clause (use case 1);
    95 
    96 * Retrieve data from Sparql endpoint using a SPARQL query defined as a string in
    97   the `BEING` clause and a base url defined in the `FROM` clause (use case 3);
    98 
    99 * Be able to join the data in the *local instance* (use cases 1 and 3) or in the
   100   *remote service* (use case 2). Indeed, in the use case 2, we do not want to
   101   get ALL the thumbnails for all the authors from the remote instance, but we
   102   would rather only send all the names that verify "Victor %" to the *remote
   103   service*.
   104 
   105 
   106 * Sources may be defined:
   107 
   108   * by an url, e.g. "http://www.cubicweb.org" or "http://www.dbpedia.org/sparql";
   109 
   110   * by an appid, if running on the same machine, e.g. "dbpedia", "geonames".
   111     In this case, this could be the name of specific CWSource.
   112 
   113 
   114 Implementation specifications
   115 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   116 
   117 The following specifications come from different use cases:
   118 
   119 * The request must be **fully dynamic**. We don't want to store any data of a
   120   *remote service* on the *local instance* (for memory sake, or because we want
   121   a query to reflect the last changes in a *remote service*).
   122 
   123 * The *local* and *remote* instances should be fully separated. We don't want
   124   any dependancies between them, in terms of code (e.g. in the schema) or
   125   availability of the instances (i.e. if a *remote service* is down, the *local
   126   instance* should still work).
   127 
   128 * The *local instance* will only work with base types for data from *remote
   129   services*, i.e. the notions of EType is not transferable (at least, the eid
   130   will be an integer). I.e. if the description of the rset on the remote instance
   131   is::
   132 
   133    [['Project'], ['Person'], ['Float'], ['String'],]
   134 
   135 
   136 The description that will be used by the local instance is::
   137 
   138    [['Int'], ['Int'], ['Float'], ['String'],]
   139 
   140 It could be interesting to think about a potential (and NOT mandatory) cache on
   141 some requests (Postgres temporary table, more high-level cache on HTTP requests,
   142 etc.)
   143 
   144 Foreign Data Wrapper are not directly usable for this CWEP, as they require the
   145 definition of a specific table schema, and we do not know it before executing
   146 the query, especially in the case of SPARQL / API.  Moreover, this may be
   147 too use too much memory on the local instance.
   148 
   149 
   150 Data join
   151 ~~~~~~~~~
   152 
   153 Two joins are possible, on attributes or on etypes:
   154 
   155 * the join on attributes is straitghforward::
   156 
   157       Any T WHERE A name N, A name LIKE "Victor %"
   158                   WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
   159                          FROM "http://another.instance.com"
   160 
   161   In this case, we join on the value of N. For now, this is not supported by RQL
   162   (see "Known limitations" below).
   163 
   164 
   165 * the join on entities is a syntatic sugar based  on URI.
   166   It could be interesting to automatically join the entities based on their `cwuri` if the variable
   167   use to join is an etype::
   168 
   169       Any T WHERE A name "Victor %", A same_as Y
   170                   WITH T BEING (Any T WHERE Y thumbnails T)
   171                          FROM "http://another.instance.com"
   172 
   173   In this case, we send to the *remote service* all the uris of entities similar
   174   to entities having a name matching "Victor %". The uris will be used in the *remote* query.
   175   This could be rewritten as::
   176 
   177       Any T WHERE A name "Victor %", A same_as X, X cwuri U
   178                   WITH T BEING (Any T WHERE Y cwuri U, Y thumbnails T)
   179                          FROM "http://another.instance.com"
   180 
   181 
   182 
   183 Changes to RQL
   184 ~~~~~~~~~~~~~~
   185 
   186 Simple cases (pure remote query)::
   187 
   188      Any N WITH N BEING (Any X WHERE X is Project, X name N) FROM "http://www.cubicweb.org"
   189 
   190      Any X WITH X BEING (Any X WHERE X is Dbpediapage) FROM "dbpedia"
   191 
   192 
   193 Join in query::
   194 
   195      Any X,CC WHERE X same_as Y WITH CC BEING (Any CC WHERE Y country CC) FROM "geonames"
   196 
   197      Any N, M WITH N BEING (Any X WHERE X is Project, X name N)
   198                      FROM "http://www.cubicweb.org"
   199                    M BEING (Any X WHERE X is Project, X name N)
   200                      FROM "http://www.cubicweb.org"
   201 
   202 
   203 Remote join in query::
   204 
   205       Any T WHERE A name N, A name "Victor %"
   206                   WITH T BEING (Any T WHERE Y name N, Y thumbnails T)
   207                          FROM "http://remote.service.com" JOINBY N
   208 
   209 
   210       Any T WHERE A name "Victor %", A same_as Y
   211                   WITH T BEING (Any T WHERE Y thumbnails T)
   212                          FROM "http://remote.service.com" JOINBY T
   213 
   214 
   215 
   216 Implementation
   217 ==============
   218 
   219 
   220 In PostgreSQL, it is possible to execute::
   221 
   222  SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
   223                                                ...
   224                                                ('pyqonsole'), ('pyreverse'), ('yams'))
   225                       AS _X (c0)) AS _T0;
   226 
   227 
   228 Thus it is possible to let CubicWeb do all the join logic by simply replacing the function
   229 that generates the SQL statement for the subqueries.
   230 
   231 
   232 POC
   233 ~~~
   234 
   235 A proof-of-concept implementation is available from:
   236 
   237 * http://hg.logilab.org/users/vmichel/cubicweb
   238 
   239 * http://hg.logilab.org/users/vmichel/rql
   240 
   241 
   242 For RQL, it implements:
   243 
   244 * the grammar/tests for ``FROM`` in RQL
   245 
   246 * the logic of annotations/checking of the RQL syntax tree
   247 
   248 
   249 For CubicWeb, it implements:
   250 
   251 * the execution of the remote query and the join with the local query
   252 
   253 
   254 
   255 Known limitations
   256 ~~~~~~~~~~~~~~~~~
   257 
   258 The `WITH` subqueries in CubicWeb create a new variables scope, which the join
   259 on *remote services*.  This is currently the most important limitation, as the
   260 join on *remote services* is one of the biggest use case of the `FROM` clause.
   261 
   262 
   263 Possible improvements
   264 ~~~~~~~~~~~~~~~~~~~~~
   265 
   266 In the specific case of a *remote service* based on RQL, it could be interesting
   267 to implement a specific API used by the *local* querier to retrieve statistics
   268 on the *remote* execution plan, thus allowing to decide if the join by be done
   269 on the *local* instance or the *remote* one.
   270 
   271 
   272 References
   273 ~~~~~~~~~~
   274 
   275 Some exemples and references:
   276 
   277 * Foreign Data Wrapper::
   278 
   279       CREATE FOREIGN TABLE pw
   280       (username text, pw text, uid int, gid int,
   281       comments text, homedir text, shell text)
   282       SERVER file_fdw
   283       OPTIONS (format 'csv', delimiter ':',
   284       filename '/etc/passwd');
   285 
   286   For more details, see:
   287 
   288   * http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper
   289 
   290   * http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf
   291 
   292 
   293 * SPARQL Service::
   294 
   295       PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
   296       SELECT ?name
   297       FROM <http://example.org/myfoaf.rdf>
   298       WHERE
   299       {
   300          <http://example.org/myfoaf/I> foaf:knows ?person .
   301          SERVICE <http://people.example.org/sparql> {
   302            ?person foaf:name ?name . }
   303       }
   304 
   305    For more details, see::
   306 
   307    * http://www.w3.org/TR/sparql11-federated-query/