1. Introduction §
I wrote the program reed-alert five years ago, I've been using it since its first days, here is some feed back about it.
The software reed-alert is meant to be used by system administrators who want to monitor their infrastructures and get alerts when things go wrong. I got a lot more experience in the monitoring field over time and I wanted to share some thoughts about this project.
The software name is a pun I found in a Star Trek Enterprise episode.
2.2. Project finished §
The code didn't receive many commits over the last years, I consider the program to be complete with regard to features, but new probes could be added, or bug fixes could be done. But the core of the software itself is perfect to me.
The probes are small parts of code allowing to monitor extra states, like http return code, working ping, service started etc... It's already easy to extend reed-alert using a shell command returning 0 or not 0 to define a custom probe.
2.3. Reliability §
I don't remember having a single issue with reed-alert since I've set it up on my server. It's run by a cron job every 10 minutes, this mean a common lisp interpreter is loading the code, evaluating the configuration file, running the check commands and alerts commands if required, and stops. I chose a serviceless paradigm for reed-alert as it make the code and usage a lot simpler. With a running service, it could fail, leak memory, be exploited and certainly many other bugs I can't think of.
Reed-alert is simple as it only need a common lisp interpreter, the most notable sbcl and ecl interpreters are absolutely reliable and change very little over time. Some unix standard commands are required for some checks or default alerts, such as ping, service, mail or curl but this defers all the work to well established binaries.
The source code is minimal with 179 lines for reed-alert core and 159 lines for the probes, a total of 338 lines of code (including empty lines and comments), hacking on reed-alert is super easy and always a lot of fun for me. For whatever reason, my common lisp software often work at first try when I add new features, so it's always pleasant to work on them.
2.4. Awesome features §
One aspect of reed-alert that may disturb users at first is the choice of common lisp code as a configuration file, this may look complicated at first, but a simple configuration doesn't require more common lisp knowledge than what is explained in reed-alert documentation. But it gives all its power when you need to loop over a data entry to run checks, allowing to make reed-alert dynamic instead of handwriting all the configuration.
The use of common lisp as configuration has other advantages, it's possible to chain checks to easily prevent some checks to be done in case a condition is failing. Let me give a few examples for this:
- if you monitor a web server, you first want to check if it replies on ICMP before trying to check and report errors on HTTP level
- if you monitor remote servers, you first want to check if you can reach the internet and that your local gateway is online
- if you check a local web server, it would be a good idea to check if all the required services are running first
All the previous conditions can be done with reed-alert thanks to the code-as-configuration choice.
2.5. Scalability §
I've been asked a few times if reed-alert could be used in a professional context. Depending on what you call a professional environment, I will reply it depends.
Reed-alert is dumb, it needs to be run from a scheduling software (such as cron) and will sequentially run the checks. It won't guarantee a perfect timing between checks.
If you need multiples machines to run a set of checks, reed-alert is not able to share the states to continue to work reliably in a high availability environment.
In regard to resources usage, while reed-alert is small it needs to run the command lisp interpreter every time, if you want to run reed-alert every minute or multiple time per minute, I'd recommend using something else.
3. A real life example §
Here is a chunk of the configuration I've been running for years, it checks the system itself and some remote servers.
(=> mail disk-usage :path "/" :limit 60 :desc "partition /") (=> mail disk-usage :path "/var" :limit 70 :desc "partition /var") (=> mail disk-usage :path "/home" :limit 95 :desc "partition /home") (=> mail service :name "dovecot") (=> mail service :name "spamd") (=> mail service :name "dkimproxy_out") (=> mail service :name "smtpd") (=> mail service :name "ntpd") (=> mail number-of-processes :limit 140) ;; check dataswamp server is working (=> mail ping :host "dataswamp.org" :desc "Dataswamp") ;; check webzine related web servers (and (=> mail ping :host "openports.pl" :desc "Liaison Grifon.fr") (=> mail curl-http-status :url "https://webzine.puffy.cafe" :desc "Webzine Puffy.cafe" :timeout 10) (=> mail curl-http-status :url "https://puffy.cafe" :desc "Puffy.cafe" :timeout 10) (=> mail ssl-expiration :host "webzine.puffy.cafe" :seconds (* 7 24 60 60)) (=> mail ssl-expiration :host "puffy.cafe" :seconds (* 7 24 60 60))) ;; check openports.pl is working (and (=> mail ping :host "220.127.116.11" :desc "Openports.pl ping") (=> mail curl-http-status :url "http://18.104.22.168" :desc "Packages OpenBSD http" :timeout 10)) ;; check www.openbsd.org website is replying under 10 seconds (=> mail curl-http-status :url "https://www.openbsd.org" :desc "OpenBSD.org" :timeout 10) ;; check if a XML file is created regularly and valid (=> mail file-updated :path "/var/www/htdocs/solene/openbsd-current.xml" :limit 1440) (=> mail command :command (format nil "xmllint /var/www/htdocs/solene/openbsd-current.xml") :desc "XML openbsd-current.xml is not valid") ;; monitoring multiple gopher servers (loop for host in '("grifon.fr" "dataswamp.org" "gopherproject.org") do (=> mail command :try 6 :command (format nil "echo '/is-alive?done-by-solene-at-libera' | nc -w 3 ~a 70" host) :desc (concatenate 'string "Gopher " host))) (quit)
4. Conclusion §
I wrote a simple software using an old programming language (Common LISP ANSI is from 1994), the result is that it's reliable over time, require no code maintenance and is fun to code on.