External watchdog best practices

Topics (not sure which fora)
when not sure where to post, post here and mods will move it to right forum.

Moderators: leecollings, remb0

Post Reply
User avatar
philchillbill
Posts: 400
Joined: Monday 12 September 2016 13:47
Target OS: Linux
Domoticz version: beta
Location: Eindhoven. NL
Contact:

External watchdog best practices

Post by philchillbill »

The NUC running Ubuntu on which I have my Domoticz system sometimes freezes and needs a reboot. Not sure if Domoticz is the culprit because sometimes the machine stays going and the Domoticz service shows it's runnning, but Domoticz is unresponsive. From time to time the whole machine is frozen and the syslog shows nothing of use so it's not esy to debug the root cause.

I can connect a small FET to the reset header on the NUC to reboot it from the GPIO on an external pi. My idea is to have the pi poll Domoticz on the NUC every 15 mins or so and ask for some big JSON like e.g. my entire devices list or timers list or zwave-config. If I get it and it's parseable then I will assume all is ok and the NUC is fine, otherwise if no response for whatever reason I'll reboot the NUC via the MOSFET and the machine will be back in 60 seconds so no big deal.

Just wondering if anybody else ever did something like this and how it worked in practice ? A pi-zero costs EUR 6 and a MOSFET about EUR 1 so it's a small investment from a HW perspective if it keeps me up and running. Murphy's law of course means the freezes almost always happen when I'm away for a few days ;)
Alexa skills author: EvoControl, Statereport, MediaServer, LMS-lite
ben53252642
Posts: 543
Joined: Saturday 02 July 2016 5:17
Target OS: Linux
Domoticz version: Beta
Contact:

Re: External watchdog best practices

Post by ben53252642 »

I don't think this is generally a good solution. You need to find out why the NUC is freezing:

1) Is it faulty ram?
2) Motherboard fault?
3) Power supply?
4) Storage failing?
5) Memory leak?
6) CPU maxed out (using NICE values to keep things under control might help here.)
7) Is the NUC connected to a UPS? If not a power fluctuation could be causing it to crash.

If you simply reboot it each time eventually you are going to end up with a corrupt file system.

Also you should be using a software watchdog such as Monit, enable the hardware watchdog on the NUC and set it up in systemd.
Unless otherwise stated, all my code is released under GPL 3 license: https://www.gnu.org/licenses/gpl-3.0.en.html
User avatar
philchillbill
Posts: 400
Joined: Monday 12 September 2016 13:47
Target OS: Linux
Domoticz version: beta
Location: Eindhoven. NL
Contact:

Re: External watchdog best practices

Post by philchillbill »

Thanks, but I don't suspect the hardware - that same NUC never misbehaves when dual-booted into Win 10. The instability also only happens from time to time during a few Domoticz beta updates and then magically goes away again for tens of betas. I therefore blame Domoticz and seeing as most people run it on pi-hardware under Raspbian, I'm out on a fringe using Ubuntu and have to live with problems that not many others are complaining about. Simplest solution is therefore to reboot the NUC and stay operational for the 3 times a year it crashes.

My real question is whether polling Domoticz from a remote machine and asking for big JSON is the smartest way to do this?

My belief (rightly or wrongly) is that *any* type of local watchdog on a machine itself can be the victim of a freeze and not reboot the machine.
Alexa skills author: EvoControl, Statereport, MediaServer, LMS-lite
User avatar
bbqkees
Posts: 407
Joined: Sunday 17 August 2014 21:01
Target OS: Linux
Domoticz version: 4.1x
Location: The Netherlands
Contact:

Re: External watchdog best practices

Post by bbqkees »

Your proposal with an additional Pi is of course not even a solution but a nasty work-around.
You should consider this:
If your NUC has an Intel i5/i7 processor you can easily run a ESXi hypervisor.
Then you can install virtual machines for your Win10 and (several) Linux instances.

Setting it up requires a bit of work, but when it runs, ESXi will likely run forever.

So you could run Ubuntu with Domoticz in one virtual machine, parallel to other virtual machines.
With this you can create a perfect testbed for also figuring out what is wrong in your system.
Even if you don't, you can have one other virtual machine checking the faulty Ubuntu machine and act on it if it locks up.
Bosch / Nefit / Buderus / Junkers / Worcester / Sieger EMS bus Wi-Fi MQTT Gateway and interface boards: https://bbqkees-electronics.nl/
User avatar
philchillbill
Posts: 400
Joined: Monday 12 September 2016 13:47
Target OS: Linux
Domoticz version: beta
Location: Eindhoven. NL
Contact:

Re: External watchdog best practices

Post by philchillbill »

Problem is, it's a Celeron NUC so no hypervisor. Nice idea though.

EDIT: while I still think it's nice to auto-restart a dead system, I found the culprit that was killing the NUC to most likely be Softsqueeze. Apparently, it sends digital-silence to the speakers while the player is stopped and a bug causes that to crash after a while. Switching off the associated Squeezebox from the LMS interface stopped the original problem.

In the meanwhile, I have the secondary pi sending me a prowl notification if Domoticz refuses to provide more than 300KB worth of JSON when I ask fr my entire devices list - the proof of concept seems to work well.


Sent from my iPhone using Tapatalk
Alexa skills author: EvoControl, Statereport, MediaServer, LMS-lite
Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest