Categories
Uncategorized

The case of a Key-Value pattern.

I’m quite conservative about how my application store its data. I prefer SQL over NoSQL. I prefer it’s security of the dynamic of NoSQL, which you would eventually probably use an ORM which have schema anyway. Relational databases have been used for decades, so it must work right? And really, most of us will probably never going to need the performance of a NoSQL database. That said, there are good uses for things like Redis, or other form of Key-Value store, notably for simple stuff like configurations. But for real work, for me SQL is the right way to go.

But what if you don’t know the structure of the data of the application? What if there are no structure? What if there are indeed patterns, but there are many special cases where in the end, the structure does not matter much.

For those who are not technically familiar with relational database, in relational database, there are things called ‘Table’ and that is where we store data. For example, if I have many car, I will put create a ‘Car’ table, and in the car table, I will declare the car’s properties such as the height and weight. But all car must have the same attribute. Sometimes, if some car don’t have a certain attribute, we just don’t fill the attribute for that car. That works. Sometimes, instead of cars, we have to represent vehicles, some can be car some can be bicycle. And there are many ways to represent this data, but the gist of it is this, all things of the same type generally have the same properties.

But what if out of a hundred car, two car have a special properties which no other car have. This is what I call special cases and I hate special cases. It makes the code ‘unclean’ and kinda makes me ‘itchy’. Sometimes, these two car have special calculation that need special treatment. Generally we put the name of the car in the configuration file and then special-handle them when they occur somewhere in our code.

In my latest project, I have a bunch of these special cases where this data is the same as these data, or these data is equal to these data multiplied by 100. But only these particular data. In this case, the number of ‘car’ is fixed and there are predefined number of them. But some of it have special characteristic which is not the same as other stuff. On thing the got in common is that they represent values over time. Or in another word, I was to keep record of what are the weight of this care at this particular time. And some car does not have weight.

The reason for this is that previously the business flow of the company is to use Excel sheet with various formula. And some cell is this cell multiplied by these cell. Using Excel is fine by itself, the problem is, half of the time, there are no obvious pattern to derive from. It is very hard to design the sql table for this case.

The original code of the project uses a key-value system where each value is mapped to a key. There can be multiple value per key which also have timestamp of when that value is active. And then in clientside there are list of formula of how to calculate certain values which is calculated from other values. Previously these ‘formula’ pattern is used in two place. In a place where the used periodically key in values, and a background process where it will fetch data from another service, and transform them using these formula. There are also other background process which run some calculation on these values to get some other values such as average, but those are hardcoded.

At first, I modified the key-value formula system so that the formula is stored in the database. Then it is calculated on clientside when needed as most of the calculation is for view purpose. Then came the graph, and some of the graph’s value is calculated from other value. That means if I were to calculate it on clientside, the browser would probably hang, and calculating it on demand would take too much time. So, I make a background task, kinda to calculate these value to store then for graphic purpose. And then some of the value require average of other values, which mean it needs access to multiple values over multiple time. And then I make it work. I make sure to sort the calculation order topologically for correctness, in case a calculation depends on another key which is also calculated. I make it as fast as I could by caching the data used by calculation before sending it to database, which significantly improve its calculation speed as it does not have to hit the database. I make the formula work while having partial set of data, without hitting the database at all, allowing the clientside to ask the server to quickly calculate the resulting value which allow the user to see the changes before it is saved.

Managing this large number of keys is a huge hassle, so I make a template system for the configurations so that I don’t have to repeat repeated patterns. To save space, value segment next to each other that have the same value is combined, effectively compressing the database. When the formula of a value is the key of another value, basically it just link to the other value, and storing the duplicated value is a waste of space and compute resource. So the storage layer is modified a bit so that it just point to the actual value. Linking is a great feature as it allows us to define mapping for a page and then link them to actual data later one. Which is really important as the data may came from multiple other page arbitrarily.

In another word I’ve made it in a way that any time-series number data can be relatively easily represented and linked to each other. Now here comes the big question, Why should I use it? Everyday I ask myself this question, and everyday it always comes to “How else to do this?”. You see, when you made such complex things you kinda ‘own it’. And you want everything to use it. Much like when I made a monadic LoadableResource pattern. But is it really needed? A question I ask myself everyday. My co-worker said that other developers don’t understand it. But ‘other developers’ is the kind of developer which needs to be explained the different between an array of string and an array of dictionary. What does they understand?

Some of the page now does have some pattern. But to me that sounds like it will need to create two more additional table, making sure the query is correct and making sure the calculation is correct. No to mention some data comes from other page. And the structure of the other page may not be the same. The list of car on the other page is not the same as the list of car in this page. And how are you going to map from one page to another? By creating a configuration file. I know, lets make sure all configuration file is stored in the same place and for mapping between two page, have the same structure. That sounds a lot like the whole value mapping system before, except there will be multiple places that do the same thing.

For me the biggest downside is managing the mapping names. There area a lot of them. Up to 6000 right now. But they are defined in multiple file, each file separated by its domain. And there are the template system. Its just that no one is using them except me. Even so, it still feels a bit cumbersome for me. If it gets even more complex, it might as well be its own programming language. Additionally, if it makes so much sense, why haven’t I heard about such pattern before? Is it just very domain specific, or I am just missing the term to google it? Who knows.