Storing sparse, key-value-like data in a relational database

by Jannik Arndt

Some ideas seem great at first but turn out to be incredibly bad in hindsight. I reviewed such an idea today.

Background

The task was to store flight information in a database. The information is transmitted via the IATA SSIM format, which standardizes messages with about 70 different fields. There also are various different types of messages, all of which might contain any of the 70 fields. An example: Airlines can publish their Standard Schedules Messages which contain the seasonal flight schedule, i.e. routes they usually fly on which days of the week. Exceptions from this get their own message (using different fields). Unforeseen changes are published via the Ad hoc Schedules Messages which, of course, use different fields as well. And then there are Movement Messages (from the Airport Handling Manual AHM) which contain the actual information on what each plane does. All of these messages make up a flight: The schedules, the ad hoc changes and what actually happened. The contents overlap but rarely match. In object oriented thinking, this would be extremely easy: You have an abstract message class, implementations for SSM, ASM and MVT and further classes that extend each of these for special cases.

Efficiently storing this in a database however is not that easy. You’re bound to wind up with a horrifying amount of columns and an awful lot of null entries. The contents will resemble the specification which lists all the elements each message sub-type must or might contain:

MESSAGES
message_id sender_id foo boo tip bli bla blubb
12 9 ‘bar’ null ‘top’ null null null
13 9 null null null 1 2 42

Why not…?

Yes, this structure is the reason someone invented document-oriented databases. However it is the only information that would require us to leave the relational world and operate yet another service. Also, the contents are tightly integrated into a lot of other tables, and we love referential integrity.

Storing key-value-pairs in a relational database

You can store key-value-pairs in a database, for example like this:

MESSAGES

id sender_id
12 9
13 9

VALUES_TABLE

id key value message_id
1 ‘foo’ ‘bar’ 12
2 ‘tip’ ‘top’ 12

Now in our case the keys are well defined, so you will have a lot of redundancies there. The idea I was reviewing extracted the keys into yet another table:

MESSAGES

id sender_id
12 9
13 9

VALUES_TABLE

id key_id value message_id
1 1 ‘bar’ 12
2 3 ‘tip’ 12

and

KEYS_TABLE

key_id key
1 ‘foo’
2 ‘boo’
3 ‘tip’
4 ‘bli’
5 ‘bla’
6 ‘blubb’

Wonderfull! No nulls, no redundancies, no more than four columns! This idea sounds great!

…and hell breaks loose

Database design fails can live in production for a long time until they get noticed. And that’s when all your precious data is already put into this bad scheme.

In the case of the above solution things get ugly the moment you query for something specific. For example all messages where ‘foo’ is ‘bar’:

SELECT * FROM MESSAGES m 
         JOIN VALUES_TABLE v ON m.id = v.message_id
         JOIN KEYS_TABLE k ON v.key_id = k.key_id
WHERE k.key    = 'foo' 
  AND v.value  = 'bar'

As opposed to

SELECT * FROM MESSAGES m WHERE m.foo = 'bar'

The problem only grows when you create this query from your business code: You have to get the ‘foo’ string from somewhere:

  • You might hard code it. Yak.
  • You might create an enum replicating the KEYS_TABLE. Until that table changes—and beware, not the structure, but the content!
  • You might be extra clever and save a JOINby using the key_id directly!

My best guess is that you’d wind up with all three options scattered in your code. Good look.

The solution / workaround

As I said in the beginning, tables with many columns and null entries might not be very elegant, but databases and developers are used to them. Also, document-oriented databases were invented for a reason. This might just be that reason.

There is another option: vertical partitioning. Especially in this use case, where there is a specification that provides the information on where to split:

MESSAGES

id sender_id
12 9
13 9

MESSAGES_PART_1

id foo tip
12 ‘bar’ ‘top’

MESSAGES_PART_2

id bli bla blubb
13 1 2 42

Let’s check for our query:

SELECT * FROM MESSAGES m 
    LEFT JOIN MESSAGES_PART_1 p1 ON m.id = p1.id
    LEFT JOIN MESSAGES_PART_2 p2 ON m.id = p2.id
WHERE p1.foo = 'bar'

Well… it doesn’t suck as much. However, if you’re on the INSERT side of the database you’re gonna have a bad time…

Blog


Storing case classes in a MongoDB database is incredibly easy, once you know how. The same goes for java.time classes such as ZonedDateTime, LocalDate or Duration.

This example uses the official Mongo Scala Driver in version 2.x and the bsoncodec project by Ralph Schaer.



This post contains the absolut essence from the Terraform Getting Started Guide: https://www.terraform.io/intro/getting-started/install.html



Ever wondered where bad code comes from?

“This story is done”

“Shouldn’t someone review it first?”

“Oh, yeah … erm … I’ll do a quick refactoring first and then…”

…when that other person is on holiday!



Some ideas seem great at first but turn out to be incredibly bad in hindsight. I reviewed such an idea today.



I tried to create a single image that contains all the most important git commands:



The Problem

You’re a corporation. Your IT department is old, slow and can’t innovate. Your competitor however can. So you try what every corporation tries: Two-speed IT.



Last week one of our programs failed looking up an airplane by its registration. That’s not a surprise, since ac regs are a horrible identifier. They change all the time. Also there is almost no naming rule at all. Wikipedia states

When painted on the fuselage, the prefix and suffix are usually separated by a dash (for example, YR-BMA). When entered in a flight plan, the dash is omitted (for example, YRBMA). In some countries that use a number suffix rather than letters, like the United States (N), South Korea (HL), and Japan (JA), the prefix and suffix are connected without a dash.



My favourite animal: The Beluga!



TL;DR: You don’t. We eventually gave up on it.

My personal lessons-learned:

  • Pentaho Kettle (or “Community Edition”, CE, i.e. the open-source core) is a great product for one-time data transfer or on-demand data transfer, but not for resilient, scheduled jobs.
  • The “Enterprise Edition” (EE) adds scheduling that doesn’t work reliably, and a very powerless server.
  • Kerberos is a bitch.



Task

You have a denormalized table and want to extract a column into a dimension table.

Caveat

You have to keep the ids.

Extra-Caveat

You use an Oracle database.



I cannot believe that googleing “talend does not work” does not find anything helpful. With this entry I try to fill that void in the internet.



#1

Do not fix your code.

Rather understand why nothing kept you from creating this bug. Make your code so easy that this bug would have been obvious the first time.



Being the IT-guy aka personal first-level-support™ for way more people than I am comfortable with, I have held quite a few mobile phones in my hand and stared into the abyss that is their home screen. The home screen is the modern view into someone’s soul. In a post-privacy-world it is probably one of the most private things we have, since it is utterly worthless to someone who does not interact with it on an hourly basis and has grown to live with whatever way the apps are scattered around the screen.



(via I Love Programming)

Thank goodness, we don’t do production.



Since I bought my personal domain name around 2003, I went through several web-solutions, using static html pages, php pages, a custom designed php cms, finally Wordpress and now, as of yesterday, I am back to static html. The 2016-flavour however, which is another attempt of separation of presentation and content (a concept I highly endorse as a LaTeX user).