results.txt
author Sylvain Thénault <sylvain.thenault@logilab.fr>
Fri, 17 Apr 2015 12:13:05 +0200
changeset 15 33876a3b7223
parent 14 c46b714f28a0
permissions -rw-r--r--
minor code update

There are several goals in this benchmark:

* show each store usage, pro and cons, and relative performance
* probably improve import performance, by inspiring from each other
* get an unified API for data importing (one may easily see in the benchmark implementation that
  store are not really interchangeable)


Runs
====

Results below are in seconds. See python's `time`_ module documentation for the difference between
`time.time` and `time.clock`. Notice `time.time` is significant in our case becase elsewise time
spent in the backend (postgres) isn't considered, while this is a primary concern in the case of
massive data import.

.. https://docs.python.org/2/library/time.html#module-time

siaf_matieres.xml
-----------------

http://data.culture.fr/thesaurus/data/ark:/67717/Matiere?includeSchemes=true

4000 entities, 9000 relations

========== ========== =========
Store      time.clock time.time
========== ========== =========
massive          1.27      1.65
fastimport       2.49      4.77
sqlgen           2.66      3.50
nohook           3.97      7.79
========== ========== =========


eurovoc.xml
-----------

http://open-data.europa.eu/fr/data/dataset/eurovoc

331396 entities, 25943 relations

==========  ========== =========
Store       time.clock time.time
==========  ========== =========
massive          51.47     74.74
sqlgen          197.90    306.66
nohook          369.23   4947.68
==========  ========== =========

I've not yet been able to import Eurovoc using fastimport (killed by the system because it consumes
too much memory).


API
===

This needs more thoughts.

To my current understanding, I would propose a data import API based on the following concepts:

* `ExtEntity`, some intermediate representation of data to import, using external identifier but no eid

* `Generator`, class or functions that will yield `ExtEntity` from some data source (eg RDF, CSV)

* `Importer`, class responsible for turning `ExtEntity`'s extid to eid, doing creation or update
  accordingly and may be controlling the insertion order of entities before feeding them to a
  `Store`

* `Store`, class responsible for inserting values in the backend database

There are tests implementations for the first three concepts in the skos cube (`skos.ExtEntity`,
`skos.rdfio`, `skos.dataimport`). We still need some work to get a store entity that would make them
interchangeable.

We would also need APIs for:
* fine hooks tuning (not only per category as available today)
* schema tuning (removal of RQLConstraint)
* database index control (list, remove, add)

After discussion, it sounds like `Generator` should be merged with the current `Parser` concept from
datafeed source. Also, every store should be able to mark entities as coming from an external
source, and we should rely on the `Importer` to do the external id to eid mapping, rather than on
repo.extid2eid.


Store API
`````````

* `prepare_insert_entity(<entity type>, **kwargs) -> eid` : given an entity type, attributes and
  inlined relations, return an eid **with no guarantee that anything has been inserted in the
  database**

* `prepare_update_entity(<entity type>, eid, **kwargs)` : given an entity type and eid, promise for
  update given attributes and inlined relations **with no guarantee that anything has been inserted
  in the database**

* `prepare_insert_relation(eid_from, rtype, eid_to) -> None` : indicate that a relation `rtype`
  should be added between entities with eid `eid_from` and `eid_to`. Similarly to `prepare_entity`,
  there is **no guarantee that the relation will be inserted in the database**.

* `flush() -> None` : flush any temporary data to the database. May be called several times during
  an import.

* `finish() -> None` : indicates that the import is terminated so the store may do additional stuff.

Stores comparisons
==================

================= === ====== ======== ======== ==========
Store             rql nohook sqlgen   massive  fastimport
================= === ====== ======== ======== ==========
hooks             yes no     no       no       yes
backend           any any    postgres postgres any
concurrent access any any    sqlgen   postgres any
perf              5   4      2        1        3
================= === ====== ======== ======== ==========

Strategies
``````````

* rql: dumb, use cnx.execute
* nohook: direct access to the native source API
* sqlgen: use COPY FROM
* massive: use COPYFROM + database index tuning (removal / reinsert) + external id handling
* fastimport : execute many + custom hooks management


The sqlgen store isn't doesn't have much interest in comparison to the massive store. In the end, I
would expect two stores implementationd in cubicweb, nohook and massive. Having a store with
fastimport capabilities would be great, but would need some work to have an implementation that
doesn't rely on the worker cube. This would probably remove the need for the rql object store (and
the no hook as well?)