New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#764 - MT7621: Any traffic shaping results in crashes/stack traces #6269
Comments
Mushoz: There is a typo in the task title. MT6721 should be MT7621. I can't seem to be able to edit it. Could a mod please edit the title? Thanks :) edit 1: I forgot to mention: Disabling offloading with ethtool does not fix it for me. It only delays the crashes. It does not eliminate them. edit 2: Another important detail: the crashes always seem to happen during the upload part of the speedtest |
Bartvz: Possibly related, if running SQM QoS the following stack traces appear in the kernel and system logs: Kernel log: System log: If necessary I can provide more information. |
Borromini: I can confirm this behaviour. Disabling offloading improves stability but will not eliminate the reboots, they just become less frequent. |
Bartvz:
And it failed... Kernel log: System log: |
azuwis: I have the some issue with a different MT7621 device, ZBT-WG3526. |
Mushoz: A compile of the latest master branch with the 4.9 kernels seems to have fixed the issue for me. Initial testing has shown no crashes whatsoever. I will monitor this over a longer period of time to see whether the issue is completely gone or not. Backporting some code from 4.9 to 4.4 might be needed to fix this issue once and for all. Unfortunately, I am also running into a few new issues with the build with the 4.9 kernel. After a while, the router stops responding to a large number of commands in SSH sessions, and LUCI is also not able to apply changed settings (presumably because the restart command for the network is not working either). I am not sure if that is a widespread issue, or whether more people have the same issue. I will keep this thread updated as more information comes in about the current situation. Hopefully more people will be able to join in on the testing. |
Borromini: That's good to know Jaap, but from what you're saying (and from what I saw committed to trunk) it looks like 4.9 support isn't fully done yet for mt7621? |
Mushoz: I haven't checked the status in the commits, so I am not sure where the support for 4.9 currently stands. But the router running the 4.9 kernel now has an uptime of 2.5 days, so it's not that bad. Still far from perfect though. I'm still suspecting the drivers of being buggy and causing all the SQM/QOS and aforementioned issues. edit: This is with SQM disabled btw. The entire platform is still very unstable with SQM enabled, just like on the 4.4 kernel. The mt76 probably needs some work to fix this for once and for all. |
Mushoz: Is there anything I can do to help pinpoint the exact cause of this bug? Would love to hear if anybody has already found some time to take a look at this issue :) |
camel: try to disable the wlan 2,4GHZ ... this is buggy as hell, and for me it is useless to use. |
Mushoz: @Camel Are you running SQM and is it stable for you, even with cake? And what do you mean with completely disabling 2.4ghz wlan? Just disabling it in luci is enough, or something else as well? |
camel: i disabled it completely via web (disable all SSID from 2,4GHZ .. currently using another router for 2,4Ghz (what a shame ...) so, this mt7621 is really totally unstable and useless.. and maybe it is really related to the mt7621 driver parts of 2,4GHZ. pls test it |
camel: meanwhile tested ... with default SQM it is OK |
Mushoz: @Camel And that is with Wifi enabled or disabled? |
camel: i guess it is one of the most wanted parts to get a proper stable 2,4GHZ wifi wifi 2,4GHZ |
Borromini: It might be premature, but with kernel 4.4.70, my uptime is approaching one week. Usually, my DIR-860L would reboot once ever two days or sometimes even multiple times during the day. SQM is still enabled (with cake). Will follow up. |
jordipalet: I'm having the same troubles, tried with two different MT7621 device, ZBT-WG3526 and SK-WB8. The problem happens to me several times per day if I've SQM activated. Tested with Lede 17.01.0, 17.01.1 and 17.01.2. I've tried also disabling the WiFi interfaces, no difference. Of course the 2.4 WiFi has its own problems ... but I believe they are not related to SQM. |
Mushoz: This commit supposedly fixes this longstanding issue: I will personally test this commit this weekend. For people that have time earlier, feel free to test this :) |
dchard: This happnes under kernel 4.9.37 and I agree this is load depenedent. I am not using any QoS like SQM, but when I run an iperf3 server on the router itself, after a few seconds I got exactly the same kernel trace as in the first post. However I never seen a reboot of my device. |
bjonglez: This is very likely the same issue as FS#804. Please test the patch mentioned above or the pre-compiled images here https://pub.polyno.me/lede-ramips-FS804/ (17.01 + the above patch, version is r3464+1-82b20d74cb) |
dchard: I installed the patch and run CPU benchmarks for about an hour. So far so good, I was not able to reproduce this issue any way I did it before. I will test it for 2-3 days and report back with the result. |
camel: it is looking very good, but i can not fully test it, as many packages can'T be installed on this patched version, as:
pls can devel team deploy this patch to current trunk ? PLS commit to current TRUNK, and let us know when done then i will ASAP doing more testing. |
dchard: After 3 days of testing I can tell that the kernel warning is not presenting itself anymore. I did not used SQM or QoS at all, but I tested the SoC under heavy load for hours and nothing happened: no hardlocks, no crash, no kernel errors/warnings, no restarts. Everything looks fine. Previously it took about 5-10 minutes to present the RCU error under heavy load. I agree that this patch could be merged to trunk, would be a lot easier for us to further test it. |
Mushoz: Sorry for my late reply. Unfortunately, during traffic shaping the Dir-860l still crashes with that patch applied. So it does not seem to be a complete fix. It does look like it takes longer for it to manifest, so I believe we're getting closer to the solution for our issues :) |
dchard: After 4 days of running, I found this in kernel log:
The system log is empty. No restarts or any other way of noticing this except the log entry. This is with the patched version, maybe completely unrelated. |
pparent76: Can you test with these images containing latest patch for this issue: https://www.own-mailbox.com/lede/ It will also be in next trunk built (in 1 day max). |
dchard: I am testing with latest trunk (patches already included), but after 6 days of error free operation, today I got this:
It is intersting, as I tortured the patched version for hours with both 100% CPU load and iperf3 tests in the same time, yet I got nothing. And today, out of the blue I got this error again. No crashes, no reboot, no any other sign of this event except the kernel log entry. |
codemarauder: It is crashing on x86_64 APU2 with 17.01.4 kernel version 4.4.92 as well. Created a bug report [[https://bugs.lede-project.org/index.php?do=details&task_id=1136|FS#1136 ]] |
camel: is that bug still existing ? |
Mushoz: No, this bug has been fixed in 17.01.4 and in the master branch. |
ds_shadof: ZBT-WG3526 (16M) |
nick471: Hi There, I also have the same issue running snapshot r8378-9ac7350240 on ubiquiti edgerouter x sfp. This also occurs on stable 18.06.x builds as well. This is random in it behaviour and can occur at any time. No particular pattern appears to trigger this failure.
[178873.280538] ------------[ cut here ]------------
[178873.289945] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x1ac/0x324
[178873.306578] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[178873.320602] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt i2c_gpio i2c_algo_pca i2c_algo_bit gpio_pca953x i2c_dev ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables leds_gpio gpio_button_hotplug
[178873.444409] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.78 #0
[178873.456520] Stack : 00000000 8ff1e540 805a0000 8006f804 805d0000 8056cd38 00000000 00000000
[178873.473323] 80537728 8fc0ddc4 8fc441fc 805a8947 80532818 00000001 8fc0dd68 532616af
[178873.490124] 00000000 00000000 80610000 00003f58 00000000 000000d3 00000008 00000000
[178873.506924] 00000000 805b0000 0006c7f9 70617773 00000000 00000000 805d0000 8037ff98
[178873.523724] 00000009 00000140 00000001 8ff1e540 00000000 802a2ca8 00000004 80610004
[178873.540527] ...
[178873.545555] Call Trace:
[178873.550603] [<800106c0>] show_stack+0x58/0x100
[178873.559632] [<8047244c>] dump_stack+0x9c/0xe0
[178873.568468] [<8002e408>] __warn+0xe0/0x114
[178873.576783] [<8002e46c>] warn_slowpath_fmt+0x30/0x3c
[178873.586828] [<8037ff98>] dev_watchdog+0x1ac/0x324
[178873.596369] [<80086774>] call_timer_fn.isra.3+0x24/0x84
[178873.606931] [<80086990>] run_timer_softirq+0x1bc/0x248
[178873.617322] [<8048f920>] __do_softirq+0x128/0x2ec
[178873.626853] [<80032b34>] irq_exit+0xac/0xc8
[178873.635344] [<802511ac>] plat_irq_dispatch+0xfc/0x138
[178873.645560] [<8000b5e8>] except_vec_vi_end+0xb8/0xc4
[178873.655602] [<8000cfb0>] r4k_wait_irqoff+0x1c/0x24
[178873.665322] [<8006687c>] do_idle+0xe4/0x168
[178873.673812] [<80066af8>] cpu_startup_entry+0x24/0x2c
[178873.683935] ---[ end trace 83be30e64239c52f ]---
Happy to help debug this one if anyone can assist? Can we get this ticket re-opened or should a new ticket be created? Cheers. |
cwbsw: this problem still exists, trunk version. |
Mushoz:
There has been a large number of reports of bugs with MT7621 devices in combination with SQM. Debugging is difficult, because it often results in a hardcrash which leaves no log files. I believe I have some interesting details that might make it easier to debug.
Device: DIR-860L rev B1, but according to reports all MT7621 devices are affected.
LEDE Version: LEDE Reboot SNAPSHOT r4094-961c0ea
Steps to reproduce: Run a dslreports.com speedtest with a large number of upload and download streams (32/32) with either SQM or QOS enabled on your WAN interface.
Observations:
Crash log:
There is usually no crash log because the router hardlocks and then reboots. But I got very lucky once and managed to get a log of the event:
[ 710.140000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 710.150000] 1-...: (257 GPs behind) idle=dfc/0/0 softirq=48167/48179 fqs=1
[ 710.160000] (detected by 2, t=6004 jiffies, g=13114, c=13113, q=1063)
[ 710.170000] Task dump for CPU 1:
[ 710.180000] swapper/1 R running 0 0 1 0x00100000
[ 710.190000] Stack : 00000000 5b6c286a 000000a3 ffffffff 00000090 773742c0 804df2a4 80490000
[ 710.190000] 8048c75c 00000001 00000001 8048c540 8048c724 80490000 00000000 800135e4
[ 710.190000] 00000000 00000001 87c70000 87c71ec0 80490000 8005ec74 1100fc03 00000001
[ 710.190000] 00000000 80490000 804df2a4 8005ec6c 80490000 8001b1a8 1100fc03 00000000
[ 710.190000] 00000004 8048c4a0 000000a0 8001b1b0 8c94e220 00008018 dc124877 a0020044
[ 710.190000] ...
[ 710.260000] Call Trace:
[ 710.270000] [<8000be98>] __schedule+0x574/0x758
[ 710.280000] [<800135e4>] r4k_wait_irqoff+0x0/0x20
[ 710.290000]
[ 710.290000] rcu_sched kthread starved for 6016 jiffies! g13114 c13113 f0x0 s3 ->state=0x1
[ 782.470000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 782.470000] 1-...: (0 ticks this GP) idle=12c/0/0 softirq=48179/48179 fqs=0
[ 782.470000] (detected by 0, t=6002 jiffies, g=13324, c=13323, q=1260)
[ 782.470000] Task dump for CPU 1:
[ 782.470000] swapper/1 R running 0 0 1 0x00100000
[ 782.470000] Stack : 00000000 00000001 0000000a 00000000 00000000 00000001 804df2a4 80490000
[ 782.470000] 8048c75c 00000001 00000001 8048c540 8048c724 80490000 00000000 800135e4
[ 782.470000] 00000000 00000001 87c70000 87c71ec0 80490000 8005ec74 1100fc03 00000001
[ 782.470000] 00000000 80490000 804df2a4 8005ec6c 80490000 8001b1a8 1100fc03 00000000
[ 782.470000] 00000004 8048c4a0 000000a0 8001b1b0 8c94e220 00008018 dc124877 a0020044
[ 782.470000] ...
[ 782.470000] Call Trace:
[ 782.470000] [<8000be98>] __schedule+0x574/0x758
[ 782.470000] [<800135e4>] r4k_wait_irqoff+0x0/0x20
[ 782.470000]
[ 782.470000] rcu_sched kthread starved for 6002 jiffies! g13324 c13323 f0x0 s3 ->state=0x1
[ 860.040000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 860.050000] 1-...: (0 ticks this GP) idle=5a8/0/0 softirq=48179/48179 fqs=0
[ 860.060000] (detected by 3, t=6004 jiffies, g=13501, c=13500, q=2389)
[ 860.070000] Task dump for CPU 1:
[ 860.080000] swapper/1 R running 0 0 1 0x00100000
[ 860.090000] Stack : 00000000 00002cd1 00000000 777882c0 00000000 00000000 804df2a4 80490000
[ 860.090000] 8048c75c 00000001 00000001 8048c540 8048c724 80490000 00000000 800135e4
[ 860.090000] 00000000 00000001 87c70000 87c71ec0 80490000 8005ec74 1100fc03 00000001
[ 860.090000] 00000000 80490000 804df2a4 8005ec6c 80490000 8001b1a8 1100fc03 00000000
[ 860.090000] 00000004 8048c4a0 000000a0 8001b1b0 8c94e220 00008018 dc124877 a0020044
[ 860.090000] ...
[ 860.160000] Call Trace:
[ 860.170000] [<8000be98>] __schedule+0x574/0x758
[ 860.180000] [<800135e4>] r4k_wait_irqoff+0x0/0x20
[ 860.190000]
[ 860.190000] rcu_sched kthread starved for 6017 jiffies! g13501 c13500 f0x0 s3 ->state=0x1
I hope it contains useful information for tracking down this bug. If there is anything else I can supply or test in order to help the debugging process, please let me know.
The text was updated successfully, but these errors were encountered: