Kuntergunt Software Experiences: Relational or noSQL

Recently I ran into this article and was wondering, why the author came to the conclusion that "NoSQL is the wrong tool". So I went through all the reasons why I like the document based approach so much. Here are my thoughts.

First I have to disagree with some statements in that article. I neither think that relational databases are easier to use nor perform better than NoSQL databases. I'm not going into detail on that, I have seen huge improvements with less complexity and much better performance when moving away from relational. I'd like to focus more on terms like ACID, BASE and this thing called eventual consistency. Of course there are other NoSQL based approaches like key-value stores, wide column stores and graph databases and many more. But here I'd like to focus on document stores.

But first, what are the advantages of the document based approach?
Simply the fact, that in object oriented programming languages an object can be serialized to a document without doing a lot of transformation to a database model like ORM (object-relational mapping). Also when it comes to locking the document based approach has it's benefits, there is only one document instead of multiple table rows spread over a couple of tables.

Schema

It is said that document databases are schema less. So what is a schema and what is it good for?

First it defines and groups data. This is what tables or views are about, but also relations and constraints, or triggers and stored procedures.

The other thing a schema is good for are authorizations. This can be done for the whole schema but also for specific objects like tables or rows.

In document databases there is no schema defined. But there are some fields known to the database. Each field that needs to be indexed also needs to be defined. But all the others are not handled and do not need to be handled by the database. I am fine with this because I have something like a schema defined anyway, in my application. I have strings, boolean or numeric fields there, I have object within other objects and arrays or collections. If I want to change this I must understand what I'm doing. For the fields also defined on the database I will have to do this twice. But If I have a schema I will always have to do this twice for every field and every document type aca table in the relational world. So for me the schema less approach seems more closer to what I usually need.

ACID vs BASE

The problem with ACID compliance comes from the nature of distributed systems. Storing multiple documents in a distributed document store ends up in most databases as a series of independent writes. As a possible result, you might end up in saving only one instead of the all-or-nothing approach called "Atomicity". Also in this case it is hard to get a guarantee for "Consistency" and "Isolation". Some document stores have a switch to enable the last one, "Durability". This means that after a successful write (commit) the data has been stored to a permanent storage. This slows down things but in case of a power fail the data will not be lost.

So if ACID can not be guaranteed, how to solve this problem? The answer is quite simple, use the "single document" approach and save the data in one document. If you have an accounting system and you want to move a specific sum from one account to another, save a document with one line for the first account and another one for the second one. You think that this is no good because you want your data stored per account stored in separate "entities"? Why? The entity is always the transaction, in this case the document.

Eventual Consistency

In a distributed system things can behave in a way you might not expect it. When saving 2 documents A and B, there might be a short time where one client will see only A and another one only B. But just a moment later all clients will see both, A and B. This means that the system guarantees a consistent state, but not immediately. This might add some complexity to applications in a form, that the application cannot rely on all data being available. E.g. there might be a scenario with an order document that has a relation to a customer document that you cannot see yet. It will be there, soon, but not yet. From my perspective this is not a big deal. Because if you add authorizations to your documents, what you very likely will need, you might end up any way in not seeing some documents. Even if that is just caused by wrong authorizations being set. And usually a customer is created some time before orders are placed. So in the real world this example may not apply. But in other cases you might end up with the need to create a solution for this.

If you want to know more, go here.

The CAP theorem

This theorem is for distributed system, and states, that out of three concerns only two can be met.

consistency - every read sees the same data

availability - every read to some data will be fulfilled, even if not on the same version of the data

partition tolerance - even if a (temporary) split of the servers occurs, the system is still operating

This is rather confusing, because when you hear this you might think, that you do not want any "trade off" on your system. But looking into detail you will see, that this is not as bad as you thought.

And one more thought on this: just think of what will happen in this case with your relational system. A server went down, a network connection is lost - in this case your relational database will not be available at all.

Kuntergunt Software Experiences

Donnerstag, 25. April 2019

Relational or noSQL - why to use documents

Schema

ACID vs BASE

Eventual Consistency

The CAP theorem

Keine Kommentare:

Kommentar veröffentlichen