Monitoring Solution Using Thanos Part I-Understanding Thanos

Milind Dhoke
3 min readMay 10, 2021

While designing monitoring solution, different aspects and corner cases need to be considered. Trust me, it's a real challenge to forecast all corner cases.
I also met with several challenges when plotting out a seamless, efficient, high tolerant monitoring solution where Prometheus, Alertmanager, and Grafana are predecided defacto tools. Prometheus has its own problems which are indeed important to be rectified, nevertheless, third-party solutions can be integrated with it.

Certain capabilities I was looking for while choosing monitoring stack were,
1. Long term persistent storage
2. Single URL for querying the targets to collects metrics
3. Faster querying-years-old data

To obtain these capabilities, I did some POC’s on the federation, cortex, and Thanos and chose Thanos and that's why this blog of course.
So the bottom line is, Thanos provides long-term persistent storage, global query pane (cross-cluster as well), faster query retrieval plus more benefits.You can check out Thanos official documentation for more details.

How Thanos gets plugged in?
Prometheus scraps target, creates a time series DB block (TSDB) and stores locally or on mounted volumes. Now, these TSDB blocks must be stored in a central storage location for long-term and quick querying in the future. So the question is how these TSDB blocks can be sent to remote storage? Well, Thanos can do this with the help of a component called a Thanos sidecar. Thanos is an ecosystem of several components consisting, sidecar, receive, querier, store gateway, compactor, ruler, bucket web.

Thanos Sidecar:
Each Prometheus instance has a Thanos sidecar attached to it, this sidecar is provided with remote storage configuration. When a block is ready every after 2 hrs, the sidecar collects the block and ships it to remote storage. Cool isn't it? we attained the first milestone of shipping data to a remote storage location for future querying. But what about real-time querying? That's the real cons of using Thanos sidecar. You won't retrieve data for the first 2 hrs after the Prometheus instance is started. This is not a good approach for production use since you would be just sitting in front of Grafana and watching blank dashboards for 2 Hrs.

Thanos Receive:
Eventually, I had to drop the sidecar approach and read about Thanos receive. Thanos receive is a component that collects the TSDB blocks from Prometheus, also retrieves real-time metrics from targets application, does compaction on received TSDB, and ships blocks to remote storage. Now, this approach seems perfect. But this also has some downsides and more work is needed.

Thanos Store Gateway:
Once the TSDB block is in remote storage, these blocks should be made available for future querying. To make this available, the Thanos store gateway is being used. This acts as a gateway to remote data when query asking for long-term data hits.

Thanos Compactor:
Remotely stored TSDB blocks must be well fragmented and compacted for faster querying old data. Thanos compactor collects blocks from a remote storage and applied compaction process on it. Then ships few more blocks to remote storage. Yes, it ships more blocks than collected, and storage may get increases but not by a big factor. So this can be tolerated.

Thanos Querier:
This is a champ in the Thanos ecosystem, this is a central point where all queries are being hit, and results are returned. Queries are in sync with Thanos receive for querying real-time data and Thanos store gateway for querying old data. These data points are called stores. If both stores are working then querying is possible with querier. Consider this is a query component of Prometheus.

Thanos Ruler:
This is similar to Prometheus rule components where each and every rule is being evaluated with the help of query and on matching, results alerts are being triggered.

Thanos Bucket Web:
This is an HTTP server that displays information about the TSDB block from remote storage. This can be useful to analyze factors like whats the data size when it got pushed, was there any gap in storing sequence. Finding answers to this analysis, we can rectify the issue of shipping blocks.

So that was a little introduction about Thanos, refer to the next part for Implementation.