| Commit message (Collapse) | Author | Age |
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
| |
as this was the last host in postgres13, cleanup that hostgroup
|
| | |
|
| |
|
|
| |
It moved to /usr/bin in bookworm.
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
| |
this was only used to check for the existence of the postgrey process,
but that service's arguments changed in bookworm
due to the low value of this check and the impending migration to
something else, simply drop it altogether
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This is useless. Right now, this is warning us that amavis is failing
on rude, but that's only because it was upgraded and the process
signature changed.
Meanwhile: who *cares* if amavis is running unexpectedly on another
host? What are we worried about here, a hostile... virus checker?
I'm tempted to remove *all* such checks: we have a systemd check that
is better suited for this kind of checks, but one hurt at a time.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
this was trying to ignore /srv which wasn't mounted, so it was failing
with:
root@tb-build-02:~# /usr/lib/nagios/plugins/check_disk -X devpts -X proc -X linprocfs -X devfs -X fdescfs -X tracefs -X sysfs -X nfs -X overlay -w 10% -c 5% -A -i '^.*/docker/.*$' -i $(findmnt -o SOURCE -n /srv)
/usr/lib/nagios/plugins/check_disk: option requires an argument -- 'i'
Unknown argument
Usage:
check_disk {-w absolute_limit |-w percentage_limit% | -W inode_percentage_limit } {-c absolute_limit|-c percentage_limit% | -K inode_percentage_limit } {-p path | -x device}
[-C] [-E] [-e] [-f] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ]
[-t timeout] [-u unit] [-v] [-X type] [-N type]
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the build script doesn't like empty host groups:
remote: ./build-nagios
remote: ./build-nagios:365:in `throw': uncaught throw "no hosts for
service process - postgresql11 - master" (UncaughtThrowError)
remote: from ./build-nagios:365:in `block in <main>'
remote: from ./build-nagios:338:in `each'
remote: from ./build-nagios:338:in `<main>'
remote: make: *** [Makefile:5: generated/nrpe_tor.cfg] Error 1
remote: run-parts:
/srv/git.torproject.org/git-helpers/post-receive-per-repo.d/admin%tor-nagios/trigger-nagios-build
exited with return code 2
|
| | |
|
| |
|
|
| |
also moved puppetdb to its own machine, so no more pgsql
|
| | |
|
| | |
|
| | |
|
| |
|
|
| |
The IP of -03 was incorrect, and -02 was missing.
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
| |
This was cargo-culted from the Apache config originally, and I do not
think the checks are warranted. They have never, as far as I can
remember, yielded correct alerts and are currently yielding a false
positive as forum-01 has two nginx servers running simultaneously
because of the container configuration.
|
| |
|
|
| |
it's running inside a container (tpo/tpa/team#41290)
|
| |
|
|
| |
i wonder how that compiled in the first place
|
| |
|
|
|
|
|
|
|
|
|
| |
This check is almost useless: it checks for the master process
presence. We *did* get psql crashes in the past, so it's worthwhile to
monitor it, but for this specific case it fails miserably because the
database server is hidden inside a container.
Let's assume the container gizmo does its thing correctly and not
monitor it at all. Again, a case of monitoring for a possible cause
instead of symptoms.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Those checks have *always* returned false positives. In my decades of
experience with it, I have actually never seen postfix crash
completely, so those checks are basically useless in the first
place. Even if postfix *would* crash, systemd would likely notice and
flag the service as failed. And even if *that* would fail, we should
have *other* systems for checking mail services health like end-to-end
deliverability checks and so on.
This fixes one false positive which is a postfix process running
inside a container for the Discourse service. We've also had false
positives with various causes, for example:
CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
was: the host is rebooting.
connect to address 64.18.183.94 port 5666: Connection refused
... was: the host is *booting*.
PROCS CRITICAL: 0 processes with UID = 0 (root), command name master, args /usr/lib/postfix/sbin/master
This was an actual case of the master process missing, but was
probably due to a race condition on reboot, unclear.
In any case, this is noisy and hard to disable for individual host, so
just make it go away.
|
| | |
|
| |
|
|
|
|
|
|
| |
This reverts commit 0a859296f3bc0d54895666401cc75d0b0a6c4838.
Nagios complains of an unwanted https service without this check. I
think the best fix is to leave this check enabled, and just give apache
a default https vhost to serve.
|
| | |
|
| |
|
|
|
| |
donate-review only serves review apps over https, so there isn't an
https vhost running on the main donate-review-01.tpo domain
|
| | |
|
| | |
|
| | |
|
| |
|
|
| |
It doesn't seem to be properly detecting the nginx service.
|
| | |
|
| |
|
|
| |
This was forgotten in tpo/tpa/team#33750.
|
| | |
|