09/04/2019

Telemetrics for RediSQL

RediSQL, SQL steroids for Redis. Is a very fast in-memory SQL engine. Its main features are:
  1. Speed, up to 130,000 insert per second
  2. Familiarity, it support standard SQL, no weird dialects
  3. Simplicity, it is very easy to operate and to use with binding for any language.
Code on github: RedBeardLab/rediSQL
 

RediSQL is the product that Redbeardlab is launching. We decide to adopt telemetrics in the product so that we know what functionality is used and where is worth to invest our efforts. Moreover, telemetrics is active in all the non-PRO version of RediSQL and it shut itself down if it is not able to communicate with the telemetrics server.

This post will describe the technical details of adding telemetrics into RediSQL. We will discuss what those metrics are, how those metrics are gathered without incurring in any performance degradation and how we mitigate the risk of shutting down legitimate users that suffer a temporary network issue.

This post is strictly technical, here the motivation behind the strategical choices of implementing telemetrics.

What Metrics are collected

Whenever we talk about data is extremely important to specify what data are collected and how.

From the version 0.9.2 of RediSQL we added a new command, REDISQL.STATISTICS this command returns:

  • How many times each command is been invoked
  • How many times each command succeed
  • How many times each command failed

In the specific case of RediSQL a command is an action like REDISQL.CREATE_DBor REDISQL.EXEC.

To gather this kind of statistics we never, ever, look into the argument of the commands themselves. Hence we never, ever, collect nor look into the name that you gave to your database, or the schema of the database.

Only the amount of times that a command is invoked are registered.

How to collect metrics without impacting performances

Since we are only interested in is how many times a specific commands is been invoked and the failure rate, all we need is a humble counter.

Moreover, the counter is strictly increasing and it is not really important that the count is absolutely accurate (if the counter report 12476 while the correct number is 12479 it is not a big deal.)

Given all this assumptions we can safely use atomic counters to keep track of all our statistics. Indeed, for each command we keep 3 different counter. The first counter keep track of the total invocation of the command, the second counter keep track of the successful returns of each invocation and the last counter keep track of the errors.

In Rust each operation with an Atomic counter takes as input also an ordering parameter (more here) that specify how strict we want to be in terms of memory synchronization. Those parameters range from Relaxed(some order exists but we don’t know which one) to SequentialConsistent(all thread see exactly the same sequence of operation happening). Of course the more strict we are about memory synchronization the slower our code will run.

In our case we don’t care about extreme accuracy of the counters, hence we opted for the Relaxedmemory ordering that incurs in the smallest speed hurdle.

In order to be sure that keeping these counters does not incur in any performance penality we ran a statistical test. The test showed that there are no (significant) time differences in using the counter against not using the counter.

Hence using the Relaxed counters does not incur in any performance differences.

Risk mitigation for legitimate users

A key requirements for the telemetrics is that it should not damage legitimate users even in case of failures.

We approach this requirements from two different points of view, server oriented and client oriented.

Two different telemetrics services

In order to mitigate the source of errors from the server point of view we create two different telemetrics services, those services have:

  • different implementation (AWS Lambda and home-grow implementation),
  • are exposed by different domain name (redisql.com and redbeardlab.com)
  • which are managed by different domain registers (AWS Route53 and Cloudflare) and
  • are hosted by different cloud provider (AWS and Scaleway)
  • on different regions (Global and France)

Finally the home-grow implementation is reachable either under HTTPS and HTTP, which HTTP being the very last option tested.

Grace period on the client

Moreover, we understand that operate software is not simple — and even if our services are always reachable — honest errors and mistakes can happen also on the client side.

In order to avoid that a single temporary issue completely blocks completely our software we provide a very generous grace period for each instance.

We use a simple algorithm called “Leaky Bucket“.

The algorithm is quite simple.

  • Every hour we leak one time unit and we try to contact the telemetrics services.
  • Every successful telemetric connection add 5 time units to the bucket.
  • The bucket has a capacity of 120 time units.
  • When the bucket is empty the grace period is over and RediSQL shut itself down.

This algorithm gives a lot of flexibility to the instances and allow to works on extremely flunky connections.

Indeed a client can go up to 120 hours (5 days) without ever contacting the telemetric server. Moreover, for each succesfful connection the client can add 5 more hours of grace period (up to 120).

Conclusion

This post was meant to be a high-level technical overview of the telemetrics in RediSQL.

The actual implementation is accessible on the main repository.

I hope that we clarify and explained the technicalities of the telemetrics in RediSQL and we convince people that it is nothing evil, finally is important to remember that the telemetrics are active only on the free version of RediSQL.

If you don’t like the idea, please consider purchasing the complete version of RediSQL.

Newsletter

We publish new content each week, subscribe to don't miss any article.