In God we trust, all others bring data

One of the core beliefs here at Base is that we should make data driven decisions. As said by Base CTO Paweł Niżnik

You cannot execute well without measurement and proper analysis and you cannot really measure anything without having the data in the first place.

Recently I was involved in several projects that required data collection and consumption for better usage and understanding of data. In this Fluentd blog series I would like to share my lessons learned along the way. In the first part I will make a brief introduction to Fluentd. And in the next ones I will focus on the real life examples and the know-how I gained from them.

Fluentd in a nutshell

Fluentd provides unified logging layer for servers (e.g. td-agent) and embedded devices (e.g. Fluent Bit). It is often used to take care of collection and transport layer in Centralized Logging architecture. Fluentd has a long list of features and supported systems but almost none of this is built-in. Instead, there is a flexible plugin architecture [1] that you can use to customize Fluentd to your needs.

Fluentd Pluggable Architecture

There are 6 types of plugins: Input, Output, Parser, Formatter, Filter and Buffer. I will describe most of them throughout the article.

The life of a Fluentd event

The best way to understand Fluentd is to put oneself in the Fluentd event’s shoes!

Event birth and collection

Input plugins extend Fluentd to gather event logs from external sources. An Input plugin may listen (e.g. tcp) for incoming events or periodically pull data from various data sources such as tail:

<source>
  type tail
  tag "nginx.accesslog"
  format nginx
  path /var/log/nginx/nginx.access.log
  pos_file /var/log/nginx/nginx.access.pos
  # Fluentd will record the position it last read into pos_file
</source>

Generated events consist of three entities: time, tag and record (json format, MessagePack internally). Here is an example event log:

2015-09-09 10:01:15 +0000 example.tag: {"action":"login","user":42}

There are also many integrations for common software types which submit events into Fluentd directly such as Nginx, Docker, as well as bindings for many languages. If your apps follow 12factor methodology you can make use of Fluentd as well. You can capture the output stream in the execution environment using fluent-cat (included in a distribution):

./myapp > >(fluent-cat myapp.stdout) 2> >(fluent-cat myapp.stderr)

Event transport

Fluentd uses tag based routing and every input (source) needs to be tagged. Fluentd tries to match a tag against different outputs (in the order that they appear in the config file) and then sends the event to the matched output. The most common use of the match directive is to output (transport) events to other systems (e.g. S3, Elastic, filesystem):

<match nginx.accesslog>
  type file
  path /backup/nginx/accesslog
  compress gzip
</match>

There are two types of Output plugins, Non-Buffered and Buffered.

Non-Buffered vs. Buffered Output Plugins

Non-Buffered Output plugins do not buffer data and immediately write out results. There is one caveat with Non-Buffered Output plugins that you need to consider – they can block (eg. HTTP POST). Once these are blocked no new inputs can be received. To overcome this limitation some Output plugins are Buffered. Buffer behavior is defined by a separate Buffer plugin. Users can choose the Buffer plugin that best suits their performance and reliability needs. Two common types are buf_memory and buf_file.

Event filtering

Filter plugins enable Fluentd to modify event streams before they are sent to matched output. Example use cases include (from Fluentd documentation):

  • Filtering out events by grepping the value of one or more fields.
  • Enriching events by adding new fields.
  • Deleting or masking certain fields for privacy and compliance.

The filter directive has exactly the same syntax as match directive. The difference is that filter can be chained for processing pipeline, for example:

Source -> filter1 -> ... -> filterN -> Output

Once the event is processed by the filter the event proceeds through the configuration top-down. Hence, if there are multiple filters for the same tag they are also applied. For instance, in the first step we can filter only successful requests (status code 2xx – remember we parse Nginx access logs so each event is a HTTP request representation) and as a second step we can add a source hostname of running machine (event generated from tail Input Plugin doesn’t contain it).

<filter nginx.accesslog>
  type grep
  regexp1 code ^2..$
</filter>

<filter nginx.accesslog>
  type record_transformer
  <record>
    source_host ${Socket.gethostname}
  </record>
</filter>

Summary

I hope you enjoyed this short Fluentd event journey. If there is one thing that you should take away from this, it would be powerful Fluentd Pluggable Architecture that makes Fluentd very flexible and customizable. If you don’t find a plugin to suit your needs and you wonder how to write one yourself, hey, why not subscribe to our blog’s newsletter to make sure you won’t miss the next parts of this blog series, where I will show you how to write your own Elastic slowlog parser plugin and will take a closer look at Fluentd’s resiliency and performance.


  1. http://www.fluentd.org/assets/img/architecture/pluggable.png
  2. Cover photo by r2hox licensed with Attribution-NonCommercial 2.0 Generic License

Posted by

Mirek Nagaś

Share this article