Does your querying strategy match your user stories?

How clear user stories help to guide the design of InfluxDB queries

by Nick Gambino, Software Engineer

‍

When building responsive, data-driven applications, optimizing the queries that return data to the user interface is an important factor to consider. It's important to ask questions that can determine the amount of data fidelity to provide to the end user: What kinds of data does the user need? What sort of interactions or decisions will the user need to make based on the data? Is the data actionable? Does it need to be actionable?

For instance, in a monitoring solution, you often don’t need to provide an extensive amount of historical data to the user, when all they care about is what happened most recently. In another use case, you may need to have more historical context over a longer period of time to illustrate macro trends, where second-to-second data isn’t all that relevant. Defining what these needs are in the form of user stories is a good way to contextualize how these queries need to execute.

In this article, we will define two different user stories within an application that help to illustrate what the data needs are for each, and how the queries can be structured to improve performance and usability.

Let’s assume that we are building a grocery store portal application. This application will be accessible by both store managers and regional managers at the corporate headquarters. Both of these users might need to interact with this application in multiple ways depending on what their end goal is.

Takeaway 1: Sometimes, the current trend is what’s most important

“As a store manager, I want to know whether or not my store is reaching full capacity so that I can react to surges in customers effectively”

A simple story like this can give the engineering team some context as to what data they should query from the database in order to provide actionable insights to the user. More importantly, it tells the engineer what data they don’t need to include in the query.

Notice how we are interested more in how quickly the capacity is changing, rather than what the capacity currently is. This variable would be helpful when making certain managerial decisions (opening up additional cashiers, retrieving excess shopping carts in parking lots). These decisions could lead to lower lines at cash registers, and result in higher customer satisfaction. The rate that the store’s capacity changes can therefore allow the store manager to make an informed decision on whether or not to allocate employees to cash registers, or other store operations to optimize efficiency.

These types of data points are ideally suited for a streaming, time-series database like InfluxDB because they are concerned more with the rate of change rather than the current state. More specifically, a time-series database provides the ability to index a particular column in storage, and continuously update a single datapoint that can then be queried by the application. By indexing this value, the queries in our application now have a reference to access that particular data point, rather than needing to traverse the entire database. This datapoint typically takes the form of a “delta” or change in the value compared to the previous time window. This allows the engineer to designate different time windows in context of what the user needs. You can check out our more detailed article on the difference between time-series and relational db’s here.

Okay, but...exactly what time period is the user actually concerned with? In order to make an informed decision, it’s probably not useful for the manager to know what the delta is for the previous month’s data. The manager may really only be concerned with the delta in the last ten-minute window, depending on how quickly they could reallocate their staff on the grocery store floor. This provides additional context to the engineer when deciding on how to best aggregate and window their time-series data.

We can query this fictitious dataset with a query similar to the following:

SELECT DIFFERENCE(MAX("capacity"))

FROM "capactiy_percent"

WHERE time >= '2020-08-18T00:00:00Z'

AND time <= '2020-08-18T00:30:00Z'

AND "location" = 'honolulu'

GROUP BY time(10m)

Structuring the InfluxQL query in this way allows the application to provide the store manager with high fidelity data, allowing them to understand the rate of change in store capacity in an actionable way. In other words, the rate of change in store capacity provides more context to the store manager than the current state of capacity would.

Takeaway 2: Other times, you really just need the bigger picture.

There are other times when high-fidelity data may not be the best answer to the problem you are trying to solve. Let’s still assume that we are building our grocery store portal application, but rather than trying to understand current capacity patterns to better react to surges in customers, we want to provide a more macro-view on longer term store capacity trends. Again, let’s start with a user story to conceptualize the data that we need:

“As a regional manager, I want to know the average capacity of each of my stores so that I can allocate corporate budgets accordingly”

This is another simple example that illustrates how users may want to use similar pieces of functionality in your app in different ways. In this scenario, you may have a single widget that allows a user to either check the real-time capacity changes (micro), as well as view an overall summary of what the capacity is like over a longer period of time (macro). When developing this widget, you should consider the fact that these two features don’t need to query the same data, and can be used to provide only relevant information to the user. The regional manager may make a decision based on monthly average capacity (open a new location, invest in additional expansion of the store), where more macro data is needed.

In this case, it might be more intuitive to provide the regional manager with a series of static data points that describe daily capacity variables for each day within a specific time period. This type of data can be statically queried at specified time intervals, rather than streaming deltas over a series of time windows. These data points may be sufficient in providing users with actionable data for their needs.

capacityData =

from(bucket: "capacity-metrics") | > range(

start: '2020-08-18T00:00:00Z',

stop: '2020-08-18T00:30:00Z'

) | > filter(

fn: (r) = > r._measurement == "capacity"

and r._field == "percentage"

)

This dataset can be further fine tuned into separate “windows” in order to more accurately group the requested data based on time-bounds:

capacityData

|> window(every: 24h)

Finally, InfluxDB provides the ability to immediately mutate data directly from within the query in order to extract meaningful insights into the data.

capacityData

|> window(every: 24h)

|> mean()

This query will then return a single data point, representing the average capacity, on a daily basis over the entire month of August.

Conclusion:

This might sound obvious, but it’s important to understand the differences between these two different use cases. The examples provided in this post are only meant to illustrate different nuances that could affect your data aggregation strategies.

In building data-driven applications with our clients, we’ve come to appreciate learning as much about the user experience as possible to recommend the appropriate technology decisions, and hope that this also helps other engineers develop a human-centered approach in creating software. If you have other topics you’d like to cover, please drop us a line at hello(at)sudokrew.com! We love hearing from engineers!