hyper.dev

2019/06/16 - Generic Tuple Store

Keywords

Kesako a tuple store?

In SRFI-168 a tuple store is defined as follow:

a tuple store is an ordered set of tuples.

If you prefer, replace "ordered" with "sorted" (I am not sure what is the difference). Anyway, the order is the lexicographic order given by the bytes packing procedure. That can be summarized as: items of the same type are ordered using their natural order. For instance the number 2 is smaller that the number 10 but the string "2" is bigger than "10".

The comparison between items of different types is defined but it doesn't matter for what is following.

Why Generic?

As explained in Frey:2017 named graph perform better at metadata representation than other strategies. Similarly, nstore excels at representing metadata about triples or quads. In general, whenever you need to attach some information to all tuples you can add an item to the tuple to do so.

The canonical Resource Description Framework (RDF) naming of tuple items are for triple stores:

(subject predicate object)

For named graph (ngraph) or quad store you prefix the tuple with graph:

(graph subject predicate object)

If you come from mongodb world, you will have an easier time with the following naming:

(collection object-id key value)

If you come from the RDBMS world, you will prefer the following naming:

(table primary-key column-name value)

It consider that every row as a primiary key, like in google spanner Corbett:2012 and most relational databases I encountered.

In RDBMS when you need to add information to a row, you add column. That would be equivalent in the RDBMS to RDF mapping to add a tuple to every primary-key from a table with the given column-name.

The strict equivalence in RDF would be, instead, to add an item to every tuple, which is not yet part of the specification.

Based on that reflexion, one might argue that RDBMS is more powerful than RDF or nstore.

n+1 tuple items

To give some food for thought, try to imagine how to represent metadata like:

Done figuring a good schema for that problems?


This is difficult.

There is some prior art, like the Entity-Attribute-Value model that is used in magento CRM.

That example use of EAV model, a triple store, in magento is interesting. Basically, magento try to be "ready" for every possible schema by choosing the triple representation. The meme is it has too many downsides. I argue that this comes from a time where the problem was different. In particular, hard disk is cheaper nowdays. Also, event-sourcing architecture has proven that it is completly feasible to store a lot of data in a Single-Source-Of-Thruth that can written to very quickly and expose "views" of that data in secondary systems.

You might think, that representing provenance, license for a single value is overkill. But that is problem that wikidata has and almost all data science projects should have the same issue if they consider reproducibility and quality assurance.

Also, regarding keeping track of values history, they are known audit-trails in the wild. They don't allow time-traveling queries. Most of the time, it is not systematic. If you have a new table you have to build a new audit-trail table OR rely on EAV model.

So we are back at triple stores.

What is nice about a triple store, is that they allow to to represent every kind of facts whether it is relational / graphical, tabular or documents. And every fact, has its own row. And if you want you can reify a given triple to add more facts about it. That what Frey:2017 explains. Most of those technics require to do more join on every tuple to be able to query to fetch the particular information (provenance, license, history). nstore avoids those extra joins, as the metadata is stored along the rest of the tuple.