OpenWrt/LEDE Project

  • Status New
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very Low
  • Reported Version openwrt-21.02
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Alexander Lochmann - 15.05.2020
Last edited by Baptiste Jonglez - 17.05.2021

FS#3099 - ipq806x: kernel 5.4 crash related to CPU frequency scaling

Linux kernel experiences an oops from time to time
- Device: TP-Link Archer C2600
- Kernel Version: 5.4.40
- OpenWRT Version: OpenWrt SNAPSHOT, r13224+16-2308644b0c
- Steps to reproduce: Simply wait. After an unknown amount of time the kernels crashes leading to a reboot.

 


John T commented on 12.07.2020 19:16

Thanks for linking my bug. It does seem similar. How did you get the full kernel call stack?

Alexander Lochmann commented on 12.07.2020 19:36
John T commented on 12.07.2020 20:22

Thanks! Have you tried reverting to kernel 4.19 and try to catch this again?

I checked the clk-krait.c, file where "krait_mux_set_parent" is defined, introduced by the patch in 4.19 and the one already in 5.4 and they're pretty similar.

FWIW I'm on 5.4.48 and still seeing this issue.

Alexander Lochmann commented on 13.07.2020 08:50

No, I haven't tried this yet.
How have you reverted back to 4.19?

Which patch are you talking about?
Can you pls point me to the location?

John T commented on 13.07.2020 16:22

I haven't but I'm trying different workarounds in that CPU scaling area. I have 2 identical routers so I can experiment with different settings. I'll post if I find anything working, the issue is that it's taking days, weeks to crash.

The patch I was referring to is in openwrt\target\linux\ipq806x\patches-4.19:
0034-0007-clk-qcom-Add-support-for-Krait-clocks.patch

And looks already merged into 5.4.

John T commented on 17.07.2020 18:18

I might be too soon, but one of the routers survived for over 7 days now, with no reboots.
I basically disabled the CPU scaling on both cores. Might be something to try and post any updates. I hope I get to 2 weeks with no power outages.

# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance

John T commented on 23.07.2020 21:59

Update: 2 weeks uptime now and no reboots with CPU scaling disabled.

Alexander Lochmann commented on 26.07.2020 18:19

Nice. I applied your settings as well. So far no reboot for 2 and a half days.
Keep you posted.

Alexander Lochmann commented on 03.08.2020 09:30

Update: Uptime went up to ~9 days. I switched back to ondemand governor, and experienced a crash within 12 hours.....

Filip Matijević commented on 03.08.2020 19:47

Had no crashes to report for a while - they are typical after a week or so for me, and since I had one power failure and a couple of user initiated reboots I couldn't catch any logs but I have finally caught one to report (files attached).

I'll try performance setting for both cores and report back if I experience any crashes.

John T commented on 03.08.2020 19:58

Filip, your call stacks look a little different, so it might be a different issue.

I'm almost at 30 days uptime with the governor set to "performance".

Filip Matijević commented on 04.08.2020 07:53

I've just had another reboot and it seems that crashlog doesn't get overwritten on subsequent crashes - I'll do a hard reboot and wait.
I'm not able to say if it's the same problem here as my crashlogs are not consistent (for example: https://pastebin.com/raw/fhD5vVLQ) In any changing governor for me did the opposite for stability - I'm wondering if making CPU to run at 100% frequency causes overheating issue making it more unstable in my case. I'll try with the same governor once more to see if that's the case.

Alexander Lochmann commented on 03.09.2020 20:43

@John T: Disabling cpufreq did the trick. :-/ Uptime is currently 32 days.

John T commented on 03.09.2020 21:41

Thanks for confirming.

I kind of had some power failures, then had to flash a new image to get the stuff that I needed, so I haven't been able to test it for so long!

I'm on kernel 5.4.60 now with cpufreq disabled from kernel_menuconfig.

tmo26 commented on 15.10.2020 21:24

@Alexander Lochmann @John T
Any news regarding uptime? Did disabling cpufreq solve your reboot problems?

John T commented on 15.10.2020 21:40

Last uptime for me was over 30 days, before we had a power blip due to a wind storm. But I think it would've gone even longer. At this point I'm sure the cpufreq was causing the random reboots.

Alexander Lochmann commented on 16.10.2020 06:19

@tmo26: Yes, it seems that switching to the performance governor solved the issue. My reboots so far were all manually triggered. I observed uptimes of severals days.
I once switched back to the ondemand governor, and my router crashed within hours.

John T commented on 16.10.2020 16:21

So, now we have a workaround, what about the real fix? Is anyone looking into this?

Gram commented on 13.11.2020 12:46

Just to keep bug report alive as it seems very low priority bugs will not receive too much love from designers.

R7800 and kernel 5.4.72. 17d up and spontaneous reboot due to same issue.

Project Manager
Baptiste Jonglez commented on 17.05.2021 08:34

Assigning to 21.02 as it seems to still happen with the rc1.

Can somebody test again with 21.02.0-rc1 to confirm?

Project Manager
Baptiste Jonglez commented on 17.05.2021 08:40

Actually, the ipq806x cpufreq driver changed recently in master: https://git.openwrt.org/6e411b8416388a9c8be1b2291be9b5adeeb07784

It has not been backported to the 21.02 branch.

What would be useful is:

- test latest snapshot: it's using the new driver
- test 21.02.0-rc1: it's still using the old driver

If latest snapshot works correctly, then we can look at backporting this change before 21.02 comes out.

Shane commented on 17.05.2021 18:57

I appear to be running into this bug with 21.02.0-rc1 on the ZyXEL NBG6817, as noted on a follow-up to the WiFi firmware bug seemingly being fixed.

I have a bursty CPU workload (backing up via Deja Dup to separate OpenSSH SFTP chroot, which cycles between local PC compress/encrypt, low router CPU use, and SFTP upload of 25 MB files, moderate router CPU use), which triggers a panic and reboot within about 1-8 hours. Switching to the "performance" governor has finally allowed a full backup to complete successfully.

Continuous high CPU and network usage via iperf3 and OpenSSL benchmark as suggested by a past bug report does not seem to trigger this issue as quickly as that bug report suggested it should (tried 40 minutes, willing to try longer if desired).

Most panics do not show anything being sent to a UDP rsyslog listener on the wired network; the only kernel panic I've caught so far was an "rcu_sched" hang on one CPU core, so I imagine I'll need to either compile a custom kernel for ramoops support, or solder on serial headers inside the router.

My next goal was to devise a script that alternatively pegs the CPU for 1-5 seconds then let it rest for 1-5 seconds to exercise the CPU frequency governor. If I can reliably reproduce the issue, then I'll try the snapshot - though if others are able to test this sooner, please don't wait for me! (This is my primary home Internet router so it's a bit difficult to schedule experiments with it.)

Edit 2021-5-17 - issue unclear

Of course, it's only after I make this comment that the router reboots even with performance governor. I had decided to redo the backups for an unrelated reason, and it's restarted with no UDP rsyslog output. It does seem like switching governor has been more reliable, but I'll have to see - time to find the documentation on making and flashing a ramoops kernel, or just jump to the snapshot and see what happens, downgrading later.

Edit 2021-5-18 - now running snapshot

I am now running the ZyXEL NBG6817 snapshot from Tue May 18 22:05:39 2021, and I will post a new comment depending on my findings.

Shane commented on 19.05.2021 06:09

☑ Snapshot succeeds where 21.02.0-rc1 fails

I'm testing a very recent snapshot for the ZyXEL NBG6817 ("Tue May 18 22:05:39 2021"), and the first (EDIT: multiple!) full Deja Dup desktop system backup succeeded with the default ("ondemand") CPU governor! The end result is a bursty workload (due to SSH encryption and Deja Dup pausing to locally compress/encrypt before uploading in 25 MB chunks) of about 224 GB of data transferred via WiFi to a USB 3.0 external HDD over a period of roughly a fourth a day.

Granted, it's possible other changes between 21.02.0rc1 and the snapshot fixed this issue, and it's possible the issue still exists and I just haven't triggered it yet, but given the similarities to this bug report, I'm inclined to believe it's the CPU governor changes.

I will continue testing by redoing full Deja Dup backups to the router's OpenSSH SFTP chroot and I'll edit my comment to log further test results.

I've also downloaded a 42.31 GB Google Takeout export (i.e. able to saturate ISP download or router SQM handling - router CPU seems to be the limit) during the full backup process, and will continue to retry this periodically.

(If the NBG6817 ath10k-ct driver bug is fixed as it seems to be, and this bug gets resolved in 21.02, in the 2 years I've had this router, this might become the first time I've had greater than 1-3 week uptimes!)

Current test results: 6/6 full backups succeeded

Last updated:

Thu 20 May 2021 11:50:24 PM EDT
Shane commented on 21.05.2021 04:02

☑ Snapshot succeeds ~6/6 times where 21.02.0-rc1 fails ~7/8 times

As explained in my previous comment, I've tested an operation of around six hours that would only succeed roughly 1 out of 8 tries with 21.02.0-rc1, and it worked 6 out of 6 times with the snapshot ("Tue May 18 22:05:39 2021") on the ZyXEL NBG6817.

Due to the periodic, non-continuous CPU usage of the operation (Deja Dup SFTP backup), I suspect this exercises the CPU governor, hence the near-consistent crashing on rc1 without the changes, and the rock-solid operation after on the snapshot with the changes.

It remains possible that other changes between 21.02.0-rc1 and the snapshot I used fixed the issue, but due to keeping configuration (sysupgrade) and the CPU-related nature, it might be unlikely.

EDIT: After catching up with openwrt/openwrt pull requests, it sounds like I might have run into different issues. If I should file a new bug report, I'd be happy to do so, or if there's other testing I should do instead.

Apologies for the noise!

Edit 2021-5-22: testing /etc/init.d/cpufreq script

John T commented on 21.05.2021 06:16

This would be my 3rd day on a snapshot from Sun May 16 21:59:13 2021. I left the governor set to "ondemand" and so far no reboots. I don't have a reliable way of reproducing the issue but in 3 days it would've rebooted at least once. I'll give it a few more days before I call it good! :)

Thanks for making noise on this age-old bug.

Project Manager
Baptiste Jonglez commented on 22.05.2021 08:32

As suggested by Hannu Nyman, you can try adding this script on top of 21.02-rc1 or 21.02-SNAPSHOT: https://github.com/openwrt/openwrt/commit/861b82d36ae43efec8d16e61b82482e38996af92

Alexander Lochmann commented on 22.05.2021 20:02

I'm running a C2600 fw based on git commit b36b8b6929. I switched to ondemand scaling govenor. No reboots so far.

Shane commented on 22.05.2021 21:23

Unfortunately, I ran into a crash within 3 hours of the first test (Deja Dup full system SFTP backup) on OpenWRT 21.02.0-rc1 on ZyXEL NBG6817 with the /etc/init.d/cpufreq changes applied (as confirmed by "cat /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq", etc).

There's likely some other change between OpenWRT 21.02.0-rc1 and the May 18th snapshot beyond adjusting the minimum CPU frequency given the snapshot worked 6/6 times tested.

At this point, if I should file a new issue, I'd be happy to do so, or if there's any other testing I could provide, please let me know.

Thank you for your time!

EDIT: Another crash, this time without the backup running. Roughly 3 hours apart, definitely something fishy here.

EDIT 2021-5-22 11:35 pm EST: Continuing additional testing (beyond first backup) as before, will edit this comment depending on results.

EDIT 2021-5-23 1:13 pm EST: Second full backup attempt failed (GIO reported location not mounted, as with previous reboots), but the router didn't kernel panic this time - slight progress..? Will keep testing.

EDIT 2021-5-23 1:29 pm EST: Second router kernel panic. Will keep retrying backup.

EDIT 2021-5-24 2:51 am EST: After 8 tries, one of the full backups finally succeeded. Cleared the working backup, continuing testing...

EDIT 2021-5-25 8:37 pm EST: I have added serial UART access to my ZyXEL NBG6817 by opening and modifying it according to the OpenWRT wiki. I'll continue testing until I get another kernel panic that shows up in the serial log.

EDIT 2021-5-26 4:51 pm EST: Okay, now I'm baffled. The router rebooted while connected via serial console and nothing shows in the serial console log. I'll try one more time. Maybe I'm needing to file a new bug. The weird thing is… the OpenWRT snapshot worked reliably, 6/6 tries consecutively worked! I don't know what's going on anymore…

EDIT 2021-5-27 1:39 pm EST: The second router reboot with serial console, the second time not much is shown. At least this time the "rcu_sched" tasks hung error actually showed up at all, instead of being totally silent.

EDIT 2021-5-28 3:04 am EST: The second successful backup, after 20 tries (well, technically it failed with a GPG error at the end, but I think that's unrelated). I'll keep trying to see if I can get new and exciting serial console kernel logs.

EDIT 2021-5-28 1:50 pm EST: The third successful backup - technically it failed with a GPG error at the end again, but.. still. This boggles the mind; I changed nothing. I'll see how long this successful run goes.

EDIT 2021-5-28 8:52 pm EST: Fourth successful backup, third in a row. I don't think I've changed any router-related settings. Given the "rcu_sched" hang, maybe weird race condition? I suspect I should file a new bug; I'll keep editing this comment for now.

EDIT 2021-5-29 4:35 am EST: There's the failure - only a SFTP disconnect this time, but that hadn't happened 6 times in a row on the snapshot.

EDIT 2021-5-29 6:21 pm EST: Two crashes, a kernel panic and a ath10k firmware failure, with logs! Finally the serial console paid off.

EDIT 2021-5-30 3:44 pm EST: Another reboot without anything suspicious in the serial console. This time I wasn't even running a backup.

EDIT 2021-5-30 10:09 pm EST: Now running OpenWRT 21.02.0rc2! Will see what happens.

EDIT 2021-5-31 9:32 pm EST: Got my first crash on OpenWRT 21.02.0rc2.

EDIT 2021-6-1 3:17 pm EST: And the first hang on OpenWRT 21.02.0rc2 - I think it's safe to say rc2 had no change from rc1 with the /etc/init.d/cpufreq script.

EDIT 2021-6-1 3:46 pm EST: I'm now trying the 2021-5-31 snapshot (with new CPU governor and other changes) to reconfirm that the snapshot still works correctly.

EDIT 2021-6-1 9:57 pm EST: Current plan of action: verify snapshot works, then figure out how to experimentally backport the CPU governor patches atop the 21.02 OpenWRT branch.

EDIT 2021-6-9 8:52 pm EST: Currently running a debug build of 21.02 SNAPSHOT with kernel and package debugging information enabled. I'll look to see what stress-ng parameters recreate this issue.

EDIT 2021-6-10 3:47 am EST: Woohoo! First crash on the debug-enabled custom 21.02 SNAPSHOT build! I wasn't (yet) able to figure out how to crash it with stress-ng, but it was time for another system backup so it crashed anyways. I'll keep the router on the master snapshot branch to let the backup finish, then I'll explore backporting the CPU governor patchset.

EDIT 2021-6-14 9:16 pm EST: I am now running a custom build of 21.02.0-rc3 with the CPU governor patches backported (git cherry-pick'd). I'll post a new comment once I have 1-2 test results.

Old status (21.02.0rc1): 4/25 backups succeeded (9 panics, 1 hang, 11 SFTP disconnects)

In terms of router kernel panics, it's only crashed a few times instead of every single time the SFTP connection is dropped. I've not yet determined what's causing the SFTP connection between the desktop and OpenSSH on OpenWRT to break, since this was working on the May 18th snapshot and working back on 19.07.7. The SFTP issue might be separate from kernel panics though.

1 time, the router hung instead of automatically rebooting. I had forgotten to re-enable UDP syslog (after restoring to 21.02.0-rc1), so I don't have the logs - I suspect it's the "rcu_sched detected stalls" I had a few weeks ago. Will keep an eye on this too.

The 7th backup failure (3rd kernel panic) was proceeded by a kernel bug logged over UDP syslog 15 minutes before the panic itself. This may be unrelated, but I've included it here just in case.

Current status (21.02.0rc2): 1/5 backups succeeded (1 panic, 1 hang, 2 SFTP disconnects)

I've moved to testing OpenWRT 21.02.0rc2.

For now, I suspect the SFTP disconnects should be a separate bug report.

Last updated:

Mon 14 Jun 2021 09:17:38 PM EDT
Shane commented on 27.05.2021 18:38

Update that I feel warrants a new comment/email notification:

I have added a hardware serial connection to my router according to the OpenWRT wiki page on the NBG6817, and recorded two reboots during the SFTP backup process.

Hardware serial console logs

For the first, nothing shows in the kernel log at all over the serial console (the invalid VHT happened some time before reboot):

[   53.792100] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-2: link becomes ready
[   82.627616] ath10k_pci 0000:01:00.0: Invalid peer id 1 or peer stats buffer, peer: fc03a643  sta: 00000000
[ 2239.209010] ath10k_pci 0000:01:00.0: htt tx: fixing invalid VHT TX rate code 0xff
[ 2262.956279] ath10k_pci 0000:01:00.0: Invalid VHT mcs 15 peer stats
[59332.200618] ath10k_pci 0001:01:00.0: wmi: fixing invalid VHT TX rate code 0xff
�

U-Boot 2012.07 [Standard IPQ806X.LN,unknown] (Oct 03 2018 - 18:59:17)

(Full log available here: 2021-05-25 20-33-18-public.log.txt)

For the second reboot, "rcu_shed" complained about hung tasks at some point before reboot:

[   54.350758] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-2: link becomes ready
[   55.250942] ath10k_pci 0001:01:00.0: Invalid peer id 1 or peer stats buffer, peer: fa53385a  sta: 00000000
[   70.277303] ath10k_pci 0001:01:00.0: Invalid VHT mcs 15 peer stats
[  954.621190] ath10k_pci 0000:01:00.0: htt tx: fixing invalid VHT TX rate code 0xff
[25315.944981] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[25315.945024] rcu: 	1-...0: (1 GPs behind) idle=8f2/1/0x40000000 softirq=431343/431344 fqs=1050 
[25315.949715] 	(detected by 0, t=2102 jiffies, g=2009085, q=250)
[25315.958391] Sending NMI from CPU 0 to CPUs 1:
^@

U-Boot 2012.07 [Standard IPQ806X.LN,unknown] (Oct 03 2018 - 18:59:17)

(Full log available here: 2021-05-26 22-22-26-public.log.txt)

(I switched the serial connection to a spare machine with an older version of screen, etc, hence different encoding of non-ASCII characters.)

Monitoring methods for serial console, SFTP connection

Method of recording serial console output, run via a computer connected to the USB serial console:

mkdir --parents "$HOME/Desktop/logs/serial/"
sudo chown root:$(id --group --name) /dev/ttyUSB0
screen -L -Logfile "$HOME/Desktop/logs/serial/$(date "+%F %H-%M-%S").log" /dev/ttyUSB0 115200
# Alternative for Ubuntu 16.04 version of screen without "-Logfile" option:
# LOGFILE="$HOME/Desktop/logs/serial/$(date "+%F %H-%M-%S").log"
# screen -L /dev/ttyUSB0 115200
# mv screenlog.0 "$LOGFILE"

Method of monitoring router for OpenSSH SFTP connection drops, run via the default dropbear root user:

while true; do
	echo
	echo "$(date -R): Watching usbdrive SFTP connection..."
	while ps | grep -v "grep" | grep -q "sshd: usbdrive"; do
	echo "$(date -I): usbdrive SFTP active - $(uptime)"
	sleep 10m
	done
	echo "-/!\---------------------------------------------"
	echo "$(date -I): usbdrive SFTP disconnected! - $(uptime)"
	echo "-------------------------------------------------"
	echo -n "Waiting for connection.."
	while ! ps | grep -v "grep" | grep -q "sshd: usbdrive"; do
	echo -n "."
	sleep 10m
	done
	echo " connected!"
done

What to do next..?

At this point, I'm unsure of what exactly could be causing this problem. If I should file a new bug report, let me know! Or if there's any particular testing, I'm willing to try more drastic steps, including a "git bisect / build / flash / test" of the OpenWRT repository changes from 21.02 branch-off point to May 18th snapshot, if that's even possible given the multi-repository layout with pulling in extra kernel modules via opkg.

It's particularly strange that this worked reliably on 19.07.x and the snapshots, but not 21.02.0rc1. I'm tempted to retry the snapshot, but it was only May 18th since I had 6/6 consecutive successful backups. I've not had any consecutive successful backups on 21.02.0rc1 even with the /etc/init.d/cpufreq script, let alone six.

I apologize for the lack of clear information; I had sincerely hoped that a hardware serial console would be much more insightful, only to get almost nothing from it so far (except for the fun of successfully modding an out-of-warranty device).

In the meantime, I will continue testing, and I'll keep updating my previous comment with the backup results (currently, only 1/19 successful) on 21.02.0rc1 while using the /etc/init.d/cpufreq script.

Thank you very much for your time!

Shane commented on 30.05.2021 03:54

Two new panics/reboots captured on serial console, two new unique logs...

I'm wondering if I might be running into a collection of unrelated bugs with the same trigger (periodic CPU/USB/network usage), or if these bugs are merely symptom of another cause, whether or not it's from the CPU governor changes or something else between 19.07 (good asides from existing ath10k WiFi driver crash), 21.02 (bad), and snapshot (good).

If this is useful, I'll continue collecting data about each reboot. If this is not useful, or I should be testing or trying something else instead, providing more information, etc, let me know!

See my previous comments (updated) for ongoing test results and hardware serial console setup.

Hardware serial console logs

For the first, it seems like a potential ath10k WiFi driver failure - this one time might be unrelated:

[   58.937876] ath10k_pci 0000:01:00.0: Invalid peer id 1 or peer stats buffer, peer: cd1d2e61  sta: 00000000
[26066.031472] ath10k_pci 0000:01:00.0: htt tx: fixing invalid VHT TX rate code 0xff
[89662.334382] ath10k_pci 0001:01:00.0: wmi: fixing invalid VHT TX rate code 0xff
[183728.043364] ath10k_pci 0000:01:00.0: Cannot communicate with firmware, previous wmi cmds: 36967:18342458 36967:18342450 36967:18342442 36967:18342434, jiffies: 18342760, attempting to fake crash and restart firmware, dev-flags: 0x42
[183728.043480] ath10k_pci 0000:01:00.0: failed to send wmi nop: -108
[183728.063136] ath10k_pci 0000:01:00.0: could not request stats (type -268435456 ret -108 specifier 1)


U-Boot 2012.07 [Standard IPQ806X.LN,unknown] (Oct 03 2018 - 18:59:17)

(Full log available here: 2021-05-27 12-29-38-public.log.txt)

For the second reboot, a full, proper kernel panic log was provided:

(In hindsight I've probably used the term panic incorrectly, as hard reboots might be caused by things other than kernel panics.)

[ 1080.330020] ath10k_pci 0000:01:00.0: htt tx: fixing invalid VHT TX rate code 0xff
[ 6122.017923] 8<--- cut here ---
[ 6122.017946] Unable to handle kernel NULL pointer dereference at virtual address 00000001
[ 6122.019904] pgd = 73fb8949
[ 6122.029064] [00000001] *pgd=00000000
[ 6122.030646] Internal error: Oops: 5 [#1] SMP ARM

[...trimmed to keep comment short, see link to full log below...]

[ 6122.769090] dfe0: be8ea8d8 be8ea8c8 b6fbf66c b6fbeb44
[ 6122.777246] Rebooting in 3 seconds..

(Full log available here: 2021-05-29 15-34-17-public.log.txt)

Both of these reboots are with OpenWRT 21.02.0rc1 with the backported /etc/init.d/cpufreq script enabled and active.

As noted above, I'll continue testing until I find something relevant to investigate. I appreciate the time spent on this.

Shane commented on 01.06.2021 02:27

See my previous comments (updated) for ongoing test results and hardware serial console setup.

Running OpenWRT 21.02.0rc2, still getting crashes. So far, 1 kernel panic on the fourth attempt, with 1 successful backup. Unique kernel panic crash log this time.

21.02.0rc2: Hardware serial console logs

Crash in events:

[82924.705282] 8<--- cut here ---
[82924.705303] Unable to handle kernel NULL pointer dereference at virtual address 00000030
[82924.707228] pgd = b3635b07
[82924.717072] [00000030] *pgd=00000000
[82924.717987] Internal error: Oops: 5 [#1] SMP ARM
[...trimmed...]
[82924.840472] Workqueue: events 0xc072e4bc

[...trimmed to keep comment short, see link to full log below...]

[82925.268163] Function entered at [<c034aeec>] from [<423024cc>]
[82925.273983] Rebooting in 3 seconds..

(Full log available here: 2021-05-30 20-51-44-public.log.txt)

Note: the network interface restarts hours earlier in the full log was due to tinkering with the bridge interface migration.

Hang/stalled CPU:

[24541.766577] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[24541.766621] rcu: 	1-...!: (0 ticks this GP) idle=e7e/1/0x40000000 softirq=537184/537184 fqs=0 
[24541.771311] 	(detected by 0, t=2102 jiffies, g=2067077, q=295)
[24541.779988] Sending NMI from CPU 0 to CPUs 1:
[24551.785811] rcu: rcu_sched kthread starved for 2102 jiffies! g2067077 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[24551.785840] rcu: RCU grace-period kthread stack dump:
[24551.795042] rcu_sched       I    0    10      2 0x00000000
[24551.800257] Function entered at [<c0917b18>] from [<c0917dcc>]
[24551.805634] Function entered at [<c0917dcc>] from [<c091bcf8>]
[24551.811450] Function entered at [<c091bcf8>] from [<c03802ac>]
[24551.817267] Function entered at [<c03802ac>] from [<c033da30>]
[24551.823083] Function entered at [<c033da30>] from [<c03010e8>]
[24551.828898] Exception stack(0xdd467fb0 to 0xdd467ff8)
[24551.834724] 7fa0:                                     00000000 00000000 00000000 00000000
[24551.839860] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24551.848014] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[24551.858833] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 2, skipped old beacon

[...trimmed to keep comment short, see link to full log below...]

(Full log available here: 2021-05-31 19-57-25.log.txt)

I'll keep testing; I suspect the root problem remains, whatever the problem is. If this keeps happening, I'll try the snapshot again and verify that the problem remains resolved there, as tested back on May 18th.

This is on 21.02.0rc2 with the /etc/init.d/cpufreq script active, as confirmed by "cat /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq".

I've also joined the OpenWRT IRC channels on OFTC (Open and Free Technology Community) as "digitalcircuit" in #openwrt and #openwrt-devel in case anyone would prefer to get in touch with me that way.

Shane commented on 10.06.2021 08:04

See my previous comments (updated) for ongoing test results and hardware serial console setup.

I've made a custom build of the OpenWRT 21.02 branch with kernel symbols and other debugging information compiled in. No luck with stress-ng so far, but a backup was queued to happen and I got another crash - this time with more information!

I've temporarily switched back to master branch snapshot to let the backups finish successfully, then I'll work on backporting the CPU governor patches to the 21.02 branch to see if that fixes these crashes.

21.02-branch-with-debug: Hardware serial console logs

Crash in events:

[ 1428.078137] 8<--- cut here ---
[ 1428.078159] Unable to handle kernel NULL pointer dereference at virtual address 00000020
[ 1428.080082] pgd = 7528ca69
[ 1428.088407] [00000020] *pgd=00000000
[ 1428.090847] Internal error: Oops: 5 [#1] SMP ARM
[ 1428.094573] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat ath10k_pci ath10k_core ath xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_netlink nf_conntrack mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_recent xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc sch_cake nfnetlink nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb sctp libcrc32c tun md5 uas usb_storage leds_gpio xhci_plat_hcd
[ 1428.094670]  xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 crc32c_generic
[ 1428.186147] CPU: 1 PID: 75 Comm: kworker/1:2 Not tainted 5.4.124 #0
[ 1428.208376] Hardware name: Generic DT based system
[ 1428.214646] Workqueue: events dbs_work_handler
[ 1428.219496] PC is at qcom_wdt_start+0x10/0xb0
[ 1428.223919] LR is at refcount_dec_and_mutex_lock+0x18/0x60

[...trimmed to keep comment short, see link to full log below...]

[ 1428.476346] Exception stack(0xd3343fb0 to 0xd3343ff8)
[ 1428.483900] 3fa0:                                     1941a857 3d872f63 bf98d3a4 e73bf00c
[ 1428.488943] 3fc0: fda1eba5 78376fcb 40802f96 28a8dae7 47859c58 c625b770 29698da4 00000006
[ 1428.497101] 3fe0: 2794f848 be81e658 6f665708 b6d4bad8 20000010 ffffffff
[ 1428.505258] Rebooting in 3 seconds..

(Full log available here: 2021-06-10 02-05-14-public.log.txt)

Custom debug build configuration

Patch delta from 21.02.0-rc2 config-21.02.0-rc2.buildinfo

Full diffconfig to recreate this custom build

Up next on my "to figure out" list: a custom 21.02 build with the CPU governor patchset backported.

Feel free to let me know if anything else comes to mind to try!

Shane commented on 17.06.2021 01:29

I have tested a build of OpenWRT 21.02.0-rc3 with the CPU governor patchset backported, and 4 backups succeeded in a row! This is the same SFTP to the USB drive on the NBG6817 setup for the past series of comments.

The DTS changes (adjusting voltage, etc) do not seem to be needed, just the CPU governor patchset. (They don't seem to hurt either, as the main branch snapshot builds were just fine too.)

I'm confident enough in these results to submit a patch on the OpenWRT mailing list, as per zorun's suggestion on OFTC/#openwrt-devel - thanks!

EDIT 2021-6-17 1:59 am EST: Backport patch has been submitted!

Let me know if there's any particular regression testing or other confirmations I should do. Meanwhile, I'll continue testing Déjà Dup SFTP uploads alongside normal router usage (this is the primary/WAN-facing router and AP, with a second AP connected sharing VLANs).

EDIT 2021-6-17 11:15 am EST: Before this patch at most 2 backups succeeded consecutively. With 5 consecutive successful backups after applying this patch, I'm pretty confident this is a substantial improvement, even if it might not fix every reboot. I'll keep testing though, and monitoring stability outside of the SFTP backup tests too.

EDIT 2021-6-17 1:27 pm EST: During one backup, the SFTP connection was dropped on the desktop computer. Given the router itself didn't kernel panic/reboot, I wouldn't count this against the success of this patch.

EDIT 2021-6-20 7:33 pm EST: I have submitted these test results and further details on my environment to the OpenWRT development mailing list, too. If there's any further steps to take that would help get this merged, let me know!

Current status (21.02.0-rc3 with CPU patch): 11/11 backups succeeded

Last updated:

Sun 20 Jun 2021 07:34:35 PM EDT

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing