CWEP-003.rst
author Rabah Meradi <rabah.meradi@logilab.fr>
Thu, 04 Jun 2015 09:28:35 +0200
changeset 28 2d1748aaa34b
parent 15 ce4b5e49bb6b
permissions -rw-r--r--
updates about the pyramid bug
     1 =======================
     2  CWEP-0003 Clause FROM
     3 =======================
     4 
     5 :champion: Vincent Michel
     6 :reviewers: COPIL
     7 :last update: 2014/02/05
     8 :status: draft
     9 
    10 
    11 Introduction
    12 ============
    13 
    14 A recurrent need in some CubicWeb projects is to write RQL queries that make use
    15 of data published by remote databases or web services.
    16 
    17 Let us call:
    18 
    19 * *local instance* the CubicWeb instance to which the RQL query is sent ;
    20 
    21 * *remote service* a web service, a SPARQL endpoint or a CubicWeb instance, that
    22   is distinct from the *local instance*.
    23 
    24 
    25 Use cases
    26 =========
    27 
    28 Use case #1
    29 -----------
    30 
    31 A *local instance* contains data about medical studies, where patients are
    32 linked to diseases defined by a single attribute, an URI::
    33 
    34      P is Patient, P related_diseases D, D is Disease, D uri "http://dbpedia.org/resource/Epilepsy"
    35 
    36 A *remote service* contains detailed data about diseases, e.g. the mortality rate
    37 of diseases. We would like to filter the patients of the *local instance* based on the mortality rate
    38 retrieved from the distance instance::
    39 
    40     Any P WHERE P related_diseases D
    41           WITH D FROM RQL("http://remote.service.com", 'Any X WHERE X mortality_rate > 0.5')
    42 
    43 Use case #2
    44 -----------
    45 
    46 A *local instance* contains information about authors, and we want to query the
    47 thumbnails of the authors with a name that starts with "Victor"::
    48 
    49     Any T WHERE A name N, A name LIKE "Victor %"
    50           WITH T FROM ("http://remote.service.com", 'Any T WHERE Y name N, Y thumbnails T') JOIN ON (N)
    51 
    52 Use case #3
    53 -----------
    54 
    55 A *local instance* contains information about news articles, with links to
    56 objects cited in the text of the article::
    57 
    58    X is NewsArticle, X recognized_entities E, E uri "http://dbpedia.org/resource/Barack_Obama"
    59 
    60 We want to select the news articles that mention any President of the USA, retrieving
    61 the data from Dbpedia's SPARQL endpoint::
    62 
    63     SELECT DISTINCT ?c WHERE
    64     { ?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States> }
    65 
    66 i.e.::
    67 
    68     Any X WHERE X is NewsArticle, X recognized_entities E
    69           WITH E FROM SPARQL("http://dbpedia.org/sparql",
    70                              'SELECT ?c WHERE {?c dcterms:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>}')
    71 
    72 Use case #4
    73 -----------
    74 
    75 Same as the previous case, but with an API::
    76 
    77     Any X, E WHERE X is NewsArticle, X recognized_entities E
    78              WITH E FROM MYAPI("http://remote.service.com/api/presidents") JOIN ON (E)
    79 
    80 where 'MYAPI' is a specific function defined in the cube.
    81 
    82 
    83 Conclusion
    84 ----------
    85 The following specifications come from different use cases:
    86 
    87 * The request must be **fully dynamic**. We don't want to store any data of a
    88   *remote service* on the *local instance* (for memory sake, or because we want
    89   a query to reflect the last changes in a *remote service*).
    90 
    91 * The *local* and *remote* instances should be fully separated. We don't want
    92   any dependancies between them, in terms of code (e.g. in the schema) or
    93   availability of the instances (i.e. if a *remote service* is down, the *local
    94   instance* should still work).
    95 
    96 Foreign Data Wrapper are not directly usable for this CWEP, as they require the
    97 definition of a specific table schema, and we do not know it before executing
    98 the query, especially in the case of SPARQL / API.  Moreover, this may use too
    99 much memory on the local instance.
   100 
   101 
   102 .. XXX influence on tree solutions is depending on each function implementation. Rather Talk about that, related api, and restriction of pre-existing function (RQL, SPARQL)
   103 
   104 .. * The *local instance* will only work with base types for data from *remote
   105 ..   services*, i.e. the notions of EType is not transferable (at least, the eid
   106 ..   will be an integer). I.e. if the description of the rset on the remote instance
   107 ..   is::
   108 
   109 ..     [['Project'], ['Person'], ['Float'], ['String'],]
   110 
   111 
   112 ..   The description that will be used by the local instance is::
   113 
   114 ..     [['Int'], ['Int'], ['Float'], ['String'],]
   115 
   116 
   117 Specifications
   118 ==============
   119 
   120 RQL langage specifications
   121 ~~~~~~~~~~~~~~~~~~~~~~~~~~
   122 
   123 From these examples, we can define the following specifications for the API:
   124 
   125 * Retrieve data from a custom function that will be called in a `FROM` clause (use case 1) ;
   126 
   127 * The function is responsible to define how it behave on which kind of source
   128   (cubicweb instance, Sparql enpoint, etc.) ;
   129 
   130 * Ability to join the data in the *local instance* (use cases 1 and 3) or in the
   131   *remote service* (use case 2), in which case we have to use the new `JOIN ON`
   132   RQL keyword. Indeed, in the use case 2, we do not want to get ALL the
   133   thumbnails for all the authors from the remote instance, but we would rather
   134   only send all the names that verify "Victor %" to the *remote service*. This
   135   keywork may be used later to change the current behaviour of `BEING` clauses.
   136 
   137 
   138 `JOIN ON` should have the following syntax::
   139 
   140    FROM FUNC(...) JOIN ON (V1, V2, ...)
   141 
   142 in order to allow multiple subqueries with specific join variables, e.g.::
   143 
   144    WITH X FROM FUNC(...) JOIN ON (V1, V2), Y FROM FUNC(...) JOIN ON (U1), Z BEING ...
   145 
   146 
   147 
   148 
   149 API specifications
   150 ~~~~~~~~~~~~~~~~~~
   151 
   152 The `FROM` functions are registered in a dictionnary defined in server/clausefrom.py.
   153 We have the following API:
   154 
   155   * in rql.utils we have the `register_from_function` that could be used
   156     to register a `FROM` function;
   157 
   158 The base `FROM` function should probably go in a specific cube to avoid automatic registration
   159 of these functions. CubicWeb should only provide support for a `FROM` function.
   160 
   161 Substitution is delegated to the `FROM` functions. They may deal with joins in different ways,
   162 CubicWeb only gives the join variables name and the join variables cache (i.e. a dictionnary (variable name, variable values) built in the local query).
   163 
   164 
   165 Introduced FROM functions
   166 ~~~~~~~~~~~~~~~~~~~~~~~~~
   167 
   168 A `FROM` function has the following API:
   169 
   170  * a *name* use for registration;
   171 
   172  * a possible *min_arguments* for arguments checking;
   173 
   174  * a possible *allow_join* boolean for join management;
   175 
   176  * should define a *remote_execute* function that takes both parameters
   177    and join variables and that generates the remote rset;
   178 
   179 
   180 The *generate* function takes both parameters and join variables,
   181 and return the subquery sql.
   182 
   183 
   184 We define two examples of `FROM` function:
   185 
   186  * `RQL` takes an endpoint and a RQL query;
   187 
   188  * `SPARQL` takes an endpoint and a SPARQL query. Join should be done
   189    with named substitutions.
   190 
   191 
   192 
   193 Implementation
   194 ==============
   195 
   196 
   197 In PostgreSQL, it is possible to execute::
   198 
   199  SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES('cubicweb-mobile'), ('cubicweb-workcase'),
   200                                                ...
   201                                                ('pyqonsole'), ('pyreverse'), ('yams'))
   202                       AS _X (c0)) AS _T0;
   203 
   204 
   205 Thus it is possible to let CubicWeb do all the join logic by simply replacing the function
   206 that generates the SQL statement for the subqueries.
   207 
   208 It could be interesting to think about a potential (and NOT mandatory) cache on some requests (Postgres temporary table, more high-level cache on HTTP requests, etc.)
   209 
   210 For now, the implementation is PostgreSQL dependant, but using a temporary table could be a solution to extend the FROM clause to other SQL backends. However, this make the temorary table cleanup/management more complicated (temporary table does not exist in SQLite).
   211 
   212 
   213 
   214 
   215 POC
   216 ~~~
   217 
   218 A proof-of-concept implementation is available from:
   219 
   220 * http://hg.logilab.org/cubicweb
   221 
   222 * http://hg.logilab.org//rql
   223 
   224 
   225 For RQL, it implements:
   226 
   227 * the grammar/tests for ``FROM`` in RQL;
   228 
   229 * the grammar/tests for ``JOIN ON`` in RQL;
   230 
   231 * the logic of annotations/checking of the RQL syntax tree
   232 
   233 
   234 For CubicWeb, it implements:
   235 
   236 * the execution of the remote query and the join with the local query;
   237 
   238 * the planning of the ``JOIN ON`` variables fetching;
   239 
   240 
   241 
   242 
   243 Possible improvements
   244 ~~~~~~~~~~~~~~~~~~~~~
   245 
   246 In the specific case of a *remote service* based on RQL, it could be interesting
   247 to implement a specific API used by the *local* querier to retrieve statistics
   248 on the *remote* execution plan, thus allowing to decide if the join by be done
   249 on the *local* instance or the *remote* one.
   250 
   251 Perhaps add a function to deal with microschema in HTML.
   252 
   253 
   254 Know limitations
   255 ~~~~~~~~~~~~~~~~
   256 
   257 The known limitations are:
   258 
   259  * how to deal with an empty RSET from a ``FROM`` clause ? For now, the query::
   260 
   261     SELECT _T0.C0  FROM (SELECT _X.c0 AS C0 FROM (VALUES(
   262 
   263    require at least one value. Some safety belts should be added in ssplaner and rql2sql.
   264 
   265  * how to detect loops in request ? E.g.::
   266 
   267    Any X WHERE X name N WITH N FROM ... JOIN ON (Y), Y FROM ... JOIN ON (X)
   268 
   269  * RQL analysis should change ('Any') to the correct restrictions
   270    after it gets it results from the `FROM` clause.
   271    For now, we use *nodes.CONSTANT_TYPES* in analyze.py but we should
   272    change it to 'Any'.
   273 
   274 
   275 References
   276 ~~~~~~~~~~
   277 
   278 Some exemples and references:
   279 
   280 * Foreign Data Wrapper::
   281 
   282       CREATE FOREIGN TABLE pw
   283       (username text, pw text, uid int, gid int,
   284       comments text, homedir text, shell text)
   285       SERVER file_fdw
   286       OPTIONS (format 'csv', delimiter ':',
   287       filename '/etc/passwd');
   288 
   289   For more details, see:
   290 
   291   * http://multicorn.org/foreign-data-wrappers/#idrss-foreign-data-wrapper
   292 
   293   * http://people.planetpostgresql.org/andrew/uploads/fdw2.pdf
   294 
   295 
   296 * SPARQL Service::
   297 
   298       PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
   299       SELECT ?name
   300       FROM <http://example.org/myfoaf.rdf>
   301       WHERE
   302       {
   303          <http://example.org/myfoaf/I> foaf:knows ?person .
   304          SERVICE <http://people.example.org/sparql> {
   305            ?person foaf:name ?name . }
   306       }
   307 
   308    For more details, see::
   309 
   310    * http://www.w3.org/TR/sparql11-federated-query/