Spring Boot 2.0 HTTP request metrics with Micrometer

Introduction

Brand new Spring Boot 2.0 has just been released and TouKs couldn’t wait to try it in the production. One of the newly added features that we investigated was metrics system based on Micrometer library (https://micrometer.io/). In this post I will cover some of our experiences with this so far.

The goal was to get basic HTTP request metrics, report them to InfluxDB and draw some fancy graphs in Grafana. In particular we needed:

  • Throughput – total number of requests in given time unit
  • Response status statistics – how many 200-like and 500-like response occurred
  • Response time statistics: mean, median, percentiles

What was wrong with Dropwizard metrics

Nothing that I am aware of. Metrics Spring integration however is a different story….

Last stable release of Metrics Spring (v. 3.1.3) was in late 2015 and it was compatible with Dropwizard Metrics (v. 3.1.2). From this time Dropwizard Metrics moved to version 4 and 5, but Metrics Spring literally died. This causes a couple of rather unpleasant facts:

  • There are some known bugs that will never be solved
  • You can’t benefit from Dropwizard Metrics improvements
  • Sooner or later you will use a library that depends on a different version of Dropwizard Metrics and it will hurt

As an InfluxDB user I was also facing some problems with reporting tags. After a couple of tries we ended up using an obscure Graphite interface that was luckily compatible with Influx.

Let’s turn on the metrics

Adding metrics to your Spring Boot project can be done in three very simple steps. First add a dependency to micrometer-registry-xxx, where xxx is your favourite metrics storage. In our case:

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-influx</artifactId>
</dependency>

 

Now it is time for just a little bit of configuration in application.yml:

management:
  metrics:
    export:
      influx:
        uri: http://localhost:8086
        db: services
        step: 5s  ### <- (1)

 

And a proper configuration bean:

@Configuration public class MetricsConfig {
    private static final Duration HISTOGRAM_EXPIRY = Duration.ofMinutes(10);
    
    private static final Duration STEP = Duration.ofSeconds(5);
    
    @Value
    ("${host_id}") private String hostId;
    
    @Value
    ("${service_id}") private String serviceId;
    
    @Bean 
    public MeterRegistryCustomizer < MeterRegistry > metricsCommonTags() { // (2)
        return registry - > registry.config()
        .commonTags("host", hostId, "service", serviceId) // (3)
        .meterFilter(MeterFilter.deny(id - > { // (4)
                String uri = id.getTag("uri");
                return uri != null && uri.startsWith("/swagger");
            }))
            .meterFilter(new MeterFilter() {
                @Override 
                public DistributionStatisticConfig configure(Meter.Id id, DistributionStatisticConfig config) {
                    return config.merge(DistributionStatisticConfig.builder().percentilesHistogram(true).percentiles(0.5, 0.75, 0.95) // (5)
                    .expiry(HISTOGRAM_EXPIRY) // (6)
                    .bufferLength((int)(HISTOGRAM_EXPIRY.toMillis() / STEP.toMillis())) // (7)
                    .build());
                }
            });
    }
}

 

Simple as that. For sure it is not the minimal working example, but I believe some of our ideas are worth mentioning.

Dive into configuration

Config is rather self-explanatory, but let’s take a look at couple of interesting features.

(1) Step defines how often data is sent by reporter. This value should be related to your expected traffic, because you don’t want to see 90% of zeros.

(2) Be aware that there can be many reporters sharing the same config. Customising each behaviour can be done by using more specific type parameter e.g. InfluxMeterRegistry.

(3) Tags that will be added to every metric. As you can see it’s very handy for identifying hosts in a cluster.

(4) Skipping not important endpoints will limit unwanted data.

(5) A list of percentiles you would like to track

(6)(7) Histograms are calculated for some defined time window where more recent values have bigger impact on final value. The bigger time window you choose, the more accurate statistics are, but the less sudden will be changes of percentile value in case of very big or very small response time. It is also very important to increase buffer length as you increase expiry time.

Afterthought

We believe that migrating to Micrometer is worth spending time as configuration and reporting becomes simpler. The only thing that surprised us was reporting rate of throughput and status counts rather than cumulative values. But this is another story to be told…

Special thanks to Arek Burdach for support.

You May Also Like

Recently at storm-users

I've been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa's post "Football zero, Apache Pig hero". Since I've encountered a lot of insightful and very interesting information I've decided to describe some of those in this post.

  • nimbus will work in HA mode - There's a pull request open for it already... but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we'll look at reworking the HA pull request. (storm’s pull request)

  • pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm - see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume - some users' point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don't necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That's all for this day - however, I'll keep on reading through storm-users, so watch this space for more info on storm development.

I've been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa's post "Football zero, Apache Pig hero". Since I've encountered a lot of insightful and very interesting information I've decided to describe some of those in this post.

  • nimbus will work in HA mode - There's a pull request open for it already... but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we'll look at reworking the HA pull request. (storm’s pull request)

  • pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.

  • implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()
  • Fliptop is happy with storm - see their presentation here

  • topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.

  • storm vs flume - some users' point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don't necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That's all for this day - however, I'll keep on reading through storm-users, so watch this space for more info on storm development.