Why Monitor in a Box?
There should be one single, revision-controlled specification for which metrics to collect.
The entire monitoring stack should come up with one command. Updates should be applicable with one command and testable in a virtualized staging environment.
The monitoring system should detect changes in a dynamic infrastructure: Any change we make to our infrastructure should be detected from the ground-truth of our application and system infrastructure footprint. Any changes not made by us (e.g. a host loses IP connectivity or a process dies) should be detected as such, and not confused with an intentional change.
Metric Collection and Thresholds
Collect and maintain a history of both application performance and infrastructure health metrics.
A sensible collection of widely relevant metrics should be collected by default. Warning and failure thresholds (for which notifications are generated) should be defined over every metric, where possible, by default.
A Useful Dashboard
Provide sensible default visualizations of both nominal and ordinal data.
Queries should be supported over the metrics and interactive plots generated for all historical data, all in the browser.
As former networking researchers, we felt it imperative that all visualizations, plots and figures yield insights as efficiently as possible. One should be able to drill down in the browser to understand what's going on, and it should look beautiful, be responsive, and take advantage of modern web technologies.
Provide sensible default alert and notification criterion for the collected metrics.
Support not only email alerts, but also modern team chat software like Slack or RocketChat.
Confidentiality and Authenticity: All data must be authenticated and encrypted by default.
Availability and Robustness: Ensure no silent failures as long as any system is running. The monitoring solution should monitor itself and components should fail gracefully and never fail silently. Built-in support for out-of-band notifications.
Isolation: The code executed by each system under monitoring is immutable to the monitoring system; The monitoring system can not execute arbitrary code on the systems under monitor. Only the master collection system exposes listening TCP socket to the network.
Easy-to-reason-about failure modes: When a metric fails to be reported to the master within a regular interval, it is considered failed.
Simplicity To Get Started
It should not be necessary to spend a day reading documentation before you have your setup.
We wanted to have a simple, yet meaningfully illustrative running example monitoring solution, which we could then study and adapt to our specific needs.
The problem that we and others experienced is that these goals are not simple to achieve. So we created Monitor in a Box to bring us closer to this ideal solution.