OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Hal Martin - 07.09.2019

FS#2483 - procd: watchdog timeout is too high for sun8i

Device: Orange Pi M2 Ultra
Version: trunk
Steps to reproduce:
- build trunk with the DTS for Orange Pi M2 Ultra
- boot
- wait 10 seconds
- watchdog will reset the SoC

Problem:
- the default wdt_drv_timeout of procd is 30 seconds and cannot be changed in menuconfig (https://git.openwrt.org/?p=project/procd.git;a=blob_plain;f=watchdog.c;hb=refs/heads/master)
- the maximum timeout of the sun8i watchdog is 16 seconds (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/watchdog/sunxi_wdt.c?h=v4.19.71)
- wdt_frequency is not low enough to prevent the sun8i watchdog from resetting the SoC

I was able to mitigate the watchdog reset with the following patch to procd:

--- a/watchdog.c	2019-09-06 06:13:52.293677306 +0000
+++ b/watchdog.c	2019-09-06 06:14:46.716987215 +0000
@@ -30,13 +30,13 @@
 
 static struct uloop_timeout wdt_timeout;
 static int wdt_fd = -1;
-static int wdt_drv_timeout = 30;
-static int wdt_frequency = 5;
+static int wdt_drv_timeout = 5;
+static int wdt_frequency = 1;
 static bool wdt_magicclose = false;
 
 void watchdog_ping(void)
 {
-	DEBUG(4, "Ping\n");
+	DEBUG(4, "Ping watchdog\n");
 	if (wdt_fd >= 0 && write(wdt_fd, "X", 1) < 0)
 		ERROR("WDT failed to write: %m\n");
 }
@@ -170,6 +170,7 @@
 	watchdog_timeout_cb(&wdt_timeout);
 
 	DEBUG(4, "Opened watchdog with timeout %ds\n", watchdog_timeout(0));
+	watchdog_ping();
 }


I’m sorry but I don’t know enough C to provide a patch that includes a configurable watchdog timeout in the main menuconfig for procd.

Admin
Petr Štetiar commented on 10.09.2019 15:07

Can you please attach (unedited) dmesg log?

- the default wdt_drv_timeout of procd is 30 seconds

indeed, this could be improved

- wdt_frequency is not low enough to prevent the sun8i watchdog from resetting the SoC

your reasoning doesn't make much sense to me, as the watchdog should use default max. timeout of 16 seconds but it should be pinged every 5 seconds by procd, thus have no reason for reset

Hal Martin commented on 10.09.2019 15:46

Can you please attach (unedited) dmesg log?

It's hard for me to attach a dmesg log, since the board takes about 7 seconds to reach the uart console and the watchdog reboots it after 10 seconds.

I've attached the full boot log as captured from the serial console. I booted with procd in init_debug=4

your reasoning doesn't make much sense to me, as the watchdog should use default max. timeout of 16 seconds but it should be pinged every 5 seconds by procd, thus have no reason for reset

I agree, from the code it appears that procd should ping the watchdog every 5 seconds and since the sun8i watchdog timeout is 16 seconds, this shouldn't cause the board to reset.

However this is not the experience I've had. If I boot with init=/bin/sh the board will not reset. As soon as the unpatched procd starts though, there's only a few seconds before the watchdog resets the SoC.

Since I have only a few seconds of console access via uart, it's very difficult to debug the root cause.

Apologies, I can't seem to delete the uart log file I've attached. The first boot is my experimenting with init=/bin/sh and the following boot is with procd.

Admin
Petr Štetiar commented on 10.09.2019 18:25

So from your dmesg it looks fine:

[ 0.544007] sunxi-wdt 1c20c90.watchdog: Watchdog enabled (timeout=16 sec, nowayout=0)
> ...
> [ 2.398234] init: - watchdog -
> [ 2.401334] init: Ping
> [ 2.403774] init: Opened watchdog with timeout 30s
> ...
> [ 6.264823] procd: Watchdog handover: fd=3
> [ 6.269030] procd: - watchdog -
> [ 6.272210] procd: Ping
> [ 6.274725] procd: Opened watchdog with timeout 30s

I don't see a reason why it should reboot after ~10s. Can you please check the return value of the ioctl, something like this? Disabling the ioctl completely would be worth trying as well.

diff --git a/watchdog.c b/watchdog.c
index 9d770b4470a9..290432dd08c4 100644
--- a/watchdog.c
+++ b/watchdog.c
@@ -87,10 +87,16 @@ static void watchdog_close(void)
 
 static int watchdog_set_drv_timeout(void)
 {
+       int r;
+
        if (wdt_fd < 0)
                return -1;
 
-       return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+       r = ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+       if (r < 0)
+               ERROR("WDT failed to set timeout: %m\n");
+
+       return r;
 }
 
 void watchdog_set_magicclose(bool val)
Hal Martin commented on 13.09.2019 10:15

I tried applying the above patch, and indeed it appears the ioctl is failing:

[    2.505875] init: Console is alive
[    2.509530] init: - watchdog -
[    2.512638] init: WDT failed to set timeout: Invalid argument
[    2.518439] init: Ping
[    2.520882] init: Opened watchdog with timeout 30s

Full boot log attached.

Admin
Petr Štetiar commented on 13.09.2019 11:09
indeed it appears the ioctl is failing

Ok, good to know, but this shouldn't influence the 16s watchdog timeout. Anyway, can you try it without the ioctl altogether, something like this:

static int watchdog_set_drv_timeout(void)
{
        if (wdt_fd < 0)
                return -1;
 
        return 0;
}
Full boot log attached.

Do I understand that log correctly, that it reboots after ~4 seconds? This is really strange and make no sense.

Hal Martin commented on 13.09.2019 11:19

I think the issue is that procd is trying to set a timeout value above the maximum that sunxi-wdt supports, and thus the kernel is returning -EINVAL.

This is confirmed with the following patch:

--- a/watchdog.c	2019-09-05 21:12:50.000000000 +0000
+++ b/watchdog.c	2019-09-13 10:48:04.070425419 +0000
@@ -30,7 +30,7 @@
 
 static struct uloop_timeout wdt_timeout;
 static int wdt_fd = -1;
-static int wdt_drv_timeout = 30;
+int wdt_drv_timeout = 30;
 static int wdt_frequency = 5;
 static bool wdt_magicclose = false;
 
@@ -87,10 +87,19 @@
 
 static int watchdog_set_drv_timeout(void)
 {
+	int r;
 	if (wdt_fd < 0)
 		return -1;
 
-	return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	//return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	r = ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	while (r < 0) {
+		ERROR("WDT failed to set timeout: %m\n");
+		wdt_drv_timeout--;
+		r = ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	}
+	ERROR("Setting WDT timeout succeeded. Timeout: %d\n", wdt_drv_timeout);
+	return r;
 }
 
 void watchdog_set_magicclose(bool val)

I believe the kernel side is here: https://github.com/torvalds/linux/blob/master/drivers/watchdog/watchdog_core.c#L138-L139

You can see procd is receiving -EINVAL until wdt_drv_timeout is below the maximum supported by sunxi-wdt:

[    2.215693] Run /sbin/init as init process
[    2.493915] init: Console is alive
[    2.497578] init: - watchdog -
[    2.500686] init: WDT failed to set timeout: Invalid argument
[    2.506501] init: WDT failed to set timeout: Invalid argument
[    2.512281] init: WDT failed to set timeout: Invalid argument
[    2.518082] init: WDT failed to set timeout: Invalid argument
[    2.523861] init: WDT failed to set timeout: Invalid argument
[    2.529669] init: WDT failed to set timeout: Invalid argument
[    2.535467] init: WDT failed to set timeout: Invalid argument
[    2.541245] init: WDT failed to set timeout: Invalid argument
[    2.547041] init: WDT failed to set timeout: Invalid argument
[    2.552819] init: WDT failed to set timeout: Invalid argument
[    2.558615] init: WDT failed to set timeout: Invalid argument
[    2.564393] init: WDT failed to set timeout: Invalid argument
[    2.570188] init: WDT failed to set timeout: Invalid argument
[    2.575983] init: WDT failed to set timeout: Invalid argument
[    2.581764] init: Setting WDT timeout succeeded. Timeout: 16
[    2.587465] init: Ping
[    2.589902] init: Opened watchdog with timeout 16s

Unfortunately this doesn't completely resolve the issue for me, the watchdog still resets the SoC a few seconds later.

Hal Martin commented on 13.09.2019 11:23

Even the following patch isn't enough to prevent the watchdog from resetting the SoC:

--- a/watchdog.c	2019-09-13 11:04:30.725915275 +0000
+++ b/watchdog.c	2019-09-13 11:09:31.010271821 +0000
@@ -30,8 +30,8 @@
 
 static struct uloop_timeout wdt_timeout;
 static int wdt_fd = -1;
-static int wdt_drv_timeout = 30;
-static int wdt_frequency = 5;
+int wdt_drv_timeout = 30;
+int wdt_frequency = 5;
 static bool wdt_magicclose = false;
 
 void watchdog_ping(void)
@@ -87,10 +87,25 @@
 
 static int watchdog_set_drv_timeout(void)
 {
+	int r;
 	if (wdt_fd < 0)
 		return -1;
 
-	return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	//return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	r = ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	while (r < 0) {
+		ERROR("WDT failed to set timeout: %m\n");
+		wdt_drv_timeout--;
+		r = ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	}
+	ERROR("Setting WDT timeout succeeded. Timeout: %d\n", wdt_drv_timeout);
+	if ( wdt_drv_timeout/4 > 1 ) {
+		wdt_frequency = wdt_drv_timeout/4;
+	} else {
+		wdt_frequency = 1;
+	}
+	ERROR("Set ping frequency to: %d\n", wdt_frequency);
+	return r;
 }
 
 void watchdog_set_magicclose(bool val)
@@ -170,6 +185,7 @@
 	watchdog_timeout_cb(&wdt_timeout);
 
 	DEBUG(4, "Opened watchdog with timeout %ds\n", watchdog_timeout(0));
+	watchdog_ping();
 }
 

It is stable when I set wdt_frequency = 1, but otherwise it seems to reset after a slightly longer lifetime. :(

Admin
Petr Štetiar commented on 13.09.2019 11:33
I think the issue is that procd is trying to set a timeout value above the maximum that sunxi-wdt supports, and thus the kernel is returning -EINVAL.

Sure, that's clear.

Unfortunately this doesn't completely resolve the issue for me, the watchdog still resets the SoC a few seconds later.

That's why I've asked you (two times already) to remove that WDIOC_SETTIMEOUT ioctl altogether.

Hal Martin commented on 13.09.2019 12:03

Removing the ioctl does not resolve the issue.

Patch:

--- a/watchdog.c	2019-09-13 11:04:30.725915275 +0000
+++ b/watchdog.c	2019-09-13 11:46:53.658245623 +0000
@@ -90,7 +90,7 @@
 	if (wdt_fd < 0)
 		return -1;
 
-	return ioctl(wdt_fd, WDIOC_SETTIMEOUT, &wdt_drv_timeout);
+	return 0;
 }
 
 void watchdog_set_magicclose(bool val)

It is still resetting shortly after boot. Log attached.

Admin
Petr Štetiar commented on 13.09.2019 13:15
Removing the ioctl does not resolve the issue.

Ok, so this basically confirms, that the problem is not related to procd. You'll probably get the same result with init=/bin/sh and then opening the watchdog device (cat /dev/watchdog should be probably enough), thus starting the watchdog.

It is still resetting shortly after boot. Log attached.

Then the problem is probably somewhere in the kernel and you should be able to figure it out with a bunch of printfs at the obvious places (sunxi_wdt_set_timeout, sunxi_wdt_start, sunxi_wdt_ping).

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing