サーバ/プロセスのメトリクスを使ったNagios/Mackerel/Consulのチェックコマンドを作るときに便利な metr を作った

Consulでちょっとしたヘルスチェックを追加したいと思ったのですが、例えば iowaitが高いかつuserは低いとき という条件を書こうとしたときに、「うっ。。！どう書けばいいんだ。。」となってしまったので、作りました。

これはなに

metr は次のような利用を想定したコマンドです

シェルスクリプトにホストやプロセスのメトリクスの値を使った条件を組み込む
Nagios pluginとして利用する
Mackerel check pluginにチェックコマンドとして利用する

使い方

インストールはHomebrew以外にdeb/rpmパッケージを用意しています。基本的にサーバにコマンドとしてインストールするのが良いでしょう

$ dpkg -i metr_0.5.1-1_amd64.deb

`metr list`

取得できるメトリクスは metr list で確認できます。また -p　(--pid) でプロセスのPIDを指定すると、プロセスのメトリクスも対象にできます。

以下はMac上で実行しましたが、Linux上だともう少し取得できるメトリクスが増えます。

$ metr list
cpu (now:33.084577 %): Percentage of cpu used.
mem (now:66.468358 %): Percentage of RAM used.
swap (now:875823104 bytes): Amount of memory that has been swapped out to disk (bytes).
user (now:18.610422 %): Percentage of CPU utilization that occurred while executing at the user level.
system (now:14.143921 %): Percentage of CPU utilization that occurred while executing at the system level.
idle (now:67.245658 %): Percentage of time that CPUs were idle and the system did not have an outstanding disk I/O request.
nice (now:0.000000 %): Percentage of CPU utilization that occurred while executing at the user level with nice priority.
load1 (now:3.640000 ): Load avarage for 1 minute.
load5 (now:4.210000 ): Load avarage for 5 minutes.
load15 (now:4.600000 ): Load avarage for 15 minutes.
numcpu (now:8 ): Number of logical CPUs.
(metric measurement interval: 500 ms)
$ metr list -p `pgrep -n docker`
proc_cpu (now:1.820857 %): Percentage of the CPU time the process uses.
proc_mem (now:1.264739 %): Percentage of the total RAM the process uses.
proc_rss (now:217280512 bytes): Non-swapped physical memory the process uses (bytes).
proc_vms (now:7010299904 bytes): Amount of virtual memory the process uses (bytes).
proc_swap (now:0 bytes): Amount of memory that has been swapped out to disk the process uses (bytes).
proc_connections (now:0 ): Amount of connections(TCP, UDP or UNIX) the process uses.
cpu (now:22.000000 %): Percentage of cpu used.
mem (now:59.768772 %): Percentage of RAM used.
swap (now:781451264 bytes): Amount of memory that has been swapped out to disk (bytes).
user (now:14.925373 %): Percentage of CPU utilization that occurred while executing at the user level.
system (now:6.467662 %): Percentage of CPU utilization that occurred while executing at the system level.
idle (now:78.606965 %): Percentage of time that CPUs were idle and the system did not have an outstanding disk I/O request.
nice (now:0.000000 %): Percentage of CPU utilization that occurred while executing at the user level with nice priority.
load1 (now:1.360000 ): Load avarage for 1 minute.
load5 (now:1.610000 ): Load avarage for 5 minutes.
load15 (now:1.490000 ): Load avarage for 15 minutes.
numcpu (now:8 ): Number of logical CPUs.
(metric measurement interval: 500 ms)

`metr cond` ( alias: `metr test` )

metr cond はシェルスクリプトなどでメトリクスを条件に使いたいときなどに利用できます。

metr cond は test コマンドと同じように、条件にマッチすると終了ステータス0 、マッチしないと 1 で終了します。

$ metr cond 'cpu < 20 and mem < 50' || somecommand

上記の場合、ホストのCPU使用率が20%以上かホストのメモリ使用率が50%以上のときに somecommand が実行されます。

ちなみに、利用できるオペレーターは

+, -, *, /, ==, !=, <, >, <=, >=, not, and, or, !, &&, || などです。

`metr check`

metr check はNagios プラグインやMackerelのチェック監視とほぼ同じ終了ステータスの仕様になっています。

終了ステータス	意味
0	OK
1	WARNING
2	CRITICAL
3	UNKNOWN

$ metr check -w 'cpu > 10 or mem > 50' -c 'cpu > 50 and mem > 90'
METR WARNING: w(cpu > 10 or mem > 50) c(cpu > 50 and mem > 90)

上記の場合、ホストのCPU使用率が10%を超えたかメモリ使用率を50%を超えたときに WARNING（終了ステータス 1）、CPU使用率が50%を超えた、かつメモリ使用率を90%を超えたときに CRITICAL （終了ステータス 2）となります。

というわけで

簡単にメトリクスを利用したチェックコマンドが作れます。楽したいときなど、是非使ってみてください。

以下、余談です。

vmstatの1行目は何を表しているか（ `metr` の `--interval` オプション）

RedHatのCustmer Portalの「Vmstat の出力結果はどのように解釈すれば良いですか?」という質問の回答によると

このレポートの最初の行には、コンピューターが最後に再起動してからの平均値が記載されます。

とあります。

procps-ngのvmstatのソースコードをみてみます。値はgetstat関数で取得しているのですが、具体的には /proc/stat の値を読み取っています。

https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L532

そしてmain loopに入る前に1行出力しています。

https://gitlab.com/procps-ng/procps/blob/master/vmstat.c#L306-361

以降はmain loop内で差分をsleep前後の値の差分を利用しています。

つまり、1行目と2行目以降で値の特性が違うということになります。そして大抵欲しいのは2行目以降の値になります。

metr は上記の（sleepで実現している）時間間隔を -i (--interval) としてmsで指定できるようにしています（デフォルトは500msです）。

vmstatでもvmstat 1 2 | tail -1 とすれば1秒間隔の差分での計測値を取得することができます。

metr だともう少し簡単に取得できて、かつ複数のメトリクスで条件などを組みやすいという感じです。

ちなみにmackarel-agentも同じように前後の計測の差分で計測しているようです。

https://github.com/mackerelio/mackerel-agent/blob/c9db8018714023fef9c8ba7eac20471e8a1b297a/metrics/linux/cpuusage.go#L44-L70

Sheer Heart Attack のコアライブラリとして組み込んだ

私は先行して、Sheer Heart Attackというデバッグ用ツールを作っています。

k1low.hatenablog.com

両ツールはメトリクスの取得という点では同じなので、今回、 metr をコアライブラリとしてSheer Heart Attackに組み込むことで実装の共通化もしました。

[BREAKING] Use github.com/k1LoW/metr by k1LoW · Pull Request #21 · k1LoW/sheer-heart-attack · GitHub

metrを拡張していくことでSheer Heart Attackも自動でパワーアップするという寸法です。

metrの読み方

いつものように metrics を短くしただけなんですけど、なんとも読みづらい箇所で省略したなと。。

「メタァ」とか「メトロ」とかで呼んでいます。まあ、呼ぶことはないと思うので大丈夫かな。。

これはなに

使い方

metr list

metr cond ( alias: metr test )

metr check