Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#606 - Stall on Mediatek #5933

Closed
openwrt-bot opened this issue Mar 8, 2017 · 17 comments
Closed

FS#606 - Stall on Mediatek #5933

openwrt-bot opened this issue Mar 8, 2017 · 17 comments
Labels

Comments

@openwrt-bot
Copy link

karesch:

On D-Link DIR-860L B1, processor MediaTek MT7621
LEDE 17.01.0

The system used connected to WAN and wifi is used (ac and n). WDS is also used (as server) and SQM is used on WAN. Only the WAN switch port is used.

After random time wifi is disconnected and connected again, can be seen in the log as: WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:4194. However, after some time, which can be minutes or days wifi totally disconnects, and even DHCP does nor work on the LAN ports. The log shows: INFO: rcu_sched self-detected stall on CPU. Only a reboot can help. Sometimes the device reboots by itself.

The above could not be linked to any event (wifi connection, change in sqm settings, etc).

Switched back to Openwrt 15.05.1.

A similar issue was reported here: [[https://forum.lede-project.org/t/build-for-the-d-link-dir-860l/948|External Link]]

@openwrt-bot
Copy link
Author

karesch:

Similar issue reported also here: [[https://forum.lede-project.org/t/lede-v17-01-0-rc1/1285/11|External Link]], relayd, rp_filter is suspected there

@openwrt-bot
Copy link
Author

nbd:

Please make a build with CONFIG_KERNEL_KALLSYMS enabled and post a fresh log

@openwrt-bot
Copy link
Author

karesch:

Hi Felix,

I would love to do that, but, unfortunately, making new builds is way beyond my abilities. I'm just a user. Sorry for that.

@openwrt-bot
Copy link
Author

rogerpueyo:

Hi Karoly,

I've compiled a sysupgrade image for you with CONFIG_KERNEL_KALLSYMS. Please find it at any of the following links:

@openwrt-bot
Copy link
Author

karesch:

Hi Roger,

You were quick, I'm impressed. I'll do the testing this evening and will come back to you.

Thanks a lot,

@openwrt-bot
Copy link
Author

karesch:

This is the 1st log, I'll submit further bugs:
[ 33.520000] WARNING: CPU: 0 PID: 0 at net/core/skbuff.c:4194 skb_try_coalesce+0x22c/0x36c()
[ 33.530000] Modules linked in: pppoe ppp_async iptable_nat pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt sch_cake nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mt76x2e mt7603e ledtrig_usbport mt76 mac80211 cfg80211 compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables ifb leds_gpio xhci_mtk xhci_plat_hcd xhci_pci xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[ 33.710000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52 #0
[ 33.710000] Stack : 00000000 00000000 804c6862 00000033 00000000 00000000 80470000 804e0000
[ 33.710000] 8046a064 80469c83 803e4c74 00000000 00000000 804c367c 804837a0 804fade0
[ 33.710000] 80466880 80063f7c 80470000 804e0000 8046e198 8046e19c 803e9588 87c0dbdc
[ 33.710000] 00000003 80061cc8 804837a0 804fade0 80466880 00000000 00000000 00c0dbdc
[ 33.710000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 33.710000] ...
[ 33.710000] Call Trace:
[ 33.710000] [<8001684c>] show_stack+0x50/0x84
[ 33.710000] [<801b4068>] dump_stack+0x84/0xbc
[ 33.710000] [<8002bf50>] warn_slowpath_common+0xa0/0xd0
[ 33.710000] [<8002c008>] warn_slowpath_null+0x18/0x24
[ 33.710000] [<80290988>] skb_try_coalesce+0x22c/0x36c
[ 33.710000] [<802ecf64>] tcp_try_coalesce+0x4c/0xd0
[ 33.710000] [<802ed048>] tcp_queue_rcv+0x60/0x12c
[ 33.710000] [<802f1d28>] tcp_rcv_established+0x4e4/0x61c
[ 33.710000] [<802f9bf8>] tcp_v4_do_rcv+0xa0/0x1f8
[ 33.710000] [<802fc104>] tcp_v4_rcv+0x6c0/0xa28
[ 33.710000] [<802d941c>] ip_local_deliver_finish+0xd4/0x134
[ 33.710000] [<802d9988>] ip_local_deliver+0xd0/0xe8
[ 33.710000] [<802d9c54>] ip_rcv+0x2b4/0x320
[ 33.710000] [<802a15dc>] __netif_receive_skb_core+0x6f0/0x914
[ 33.710000] [<802a31e8>] process_backlog+0xac/0x184
[ 33.710000] [<802a2f8c>] net_rx_action+0x130/0x2e0
[ 33.710000] [<8002eb64>] __do_softirq+0x294/0x2e0
[ 33.710000] [<8002ee4c>] irq_exit+0x78/0x94
[ 33.710000] [<801de42c>] plat_irq_dispatch+0xb4/0xdc
[ 33.710000]
[ 33.980000] ---[ end trace f57fba055f35649b ]---

@openwrt-bot
Copy link
Author

karesch:

[ 52.700000] WARNING: CPU: 3 PID: 1980 at net/core/skbuff.c:4194 skb_try_coalesce+0x22c/0x36c()
[ 52.720000] Modules linked in: pppoe ppp_async iptable_nat pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt sch_cake nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mt76x2e mt7603e ledtrig_usbport mt76 mac80211 cfg80211 compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables ifb leds_gpio xhci_mtk xhci_plat_hcd xhci_pci xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[ 52.890000] CPU: 3 PID: 1980 Comm: uhttpd Not tainted 4.4.52 #0
[ 52.910000] Stack : 00000000 00000000 804c6862 00000033 00000000 00000000 80470000 804e0000
[ 52.910000] 877918ec 80469c83 803e4c74 00000003 000007bc 804c367c 804837a0 804fade0
[ 52.910000] 80466880 80063f7c 80470000 804e0000 8046e198 8046e19c 803e9588 87c21bdc
[ 52.910000] 00000003 80061cc8 804837a0 804fade0 80466880 00000000 00000000 00c21bdc
[ 52.910000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 52.910000] ...
[ 52.980000] Call Trace:
[ 52.980000] [<8001684c>] show_stack+0x50/0x84
[ 52.990000] [<801b4068>] dump_stack+0x84/0xbc
[ 53.000000] [<8002bf50>] warn_slowpath_common+0xa0/0xd0
[ 53.010000] [<8002c008>] warn_slowpath_null+0x18/0x24
[ 53.020000] [<80290988>] skb_try_coalesce+0x22c/0x36c
[ 53.030000] [<802ecf64>] tcp_try_coalesce+0x4c/0xd0
[ 53.040000] [<802ed048>] tcp_queue_rcv+0x60/0x12c
[ 53.050000] [<802f1d28>] tcp_rcv_established+0x4e4/0x61c
[ 53.060000] [<802f9bf8>] tcp_v4_do_rcv+0xa0/0x1f8
[ 53.070000] [<802fc104>] tcp_v4_rcv+0x6c0/0xa28
[ 53.080000] [<802d941c>] ip_local_deliver_finish+0xd4/0x134
[ 53.090000] [<802d9988>] ip_local_deliver+0xd0/0xe8
[ 53.100000] [<802d9c54>] ip_rcv+0x2b4/0x320
[ 53.110000] [<802a15dc>] __netif_receive_skb_core+0x6f0/0x914
[ 53.120000] [<802a31e8>] process_backlog+0xac/0x184
[ 53.130000] [<802a2f8c>] net_rx_action+0x130/0x2e0
[ 53.140000] [<8002eb64>] __do_softirq+0x294/0x2e0
[ 53.150000] [<8002ee4c>] irq_exit+0x78/0x94
[ 53.160000] [<801de42c>] plat_irq_dispatch+0xb4/0xdc
[ 53.160000]
[ 53.170000] ---[ end trace cea7224c49ce0bb3 ]---

@openwrt-bot
Copy link
Author

karesch:

under heavy load, the system crashes and reboots

@openwrt-bot
Copy link
Author

karesch:

it's still the same. I have the skbuff problem, what you can see above, just 1-2 minutes after boot. Under load (iperf3 or heavy torrent) the router crashes and reboots, I have no chance to look at the DMESG or syslog.

@openwrt-bot
Copy link
Author

bjonglez:

Karoly, you can try looking at /sys/kernel/debug/crashlog after a crash-reboot, it might hold an interesting stack trace.

@openwrt-bot
Copy link
Author

karesch:

Hi Baptiste,

Thanks for the tip, I didn't know that.

I checked the directory. Unfortunately, there was no //crashlog// file in the ///sys/kernel/debug/// directory. I waited for 3 crashes, no results. I also did //echo c >/proc/sysrq-trigger// to check, whether crashlog settings are OK. This method generated a crash file, but the bug unfortunately, not.

Do you have any other idea to get a crashlog file?

Thanks,

@openwrt-bot
Copy link
Author

karesch:

Hi Baptiste,

I could copy-paste this:
[ 329.570000] WARNING: CPU: 3 PID: 0 at net/core/skbuff.c:4194 skb_try_coalesce+0x22c/0x36c()
[ 329.580000] Modules linked in: pppoe ppp_async iptable_nat pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt sch_cake nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mt76x2e mt7603e ledtrig_usbport mt76 mac80211 cfg80211 compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables ifb leds_gpio xhci_mtk xhci_plat_hcd xhci_pci xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[ 329.760000] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.52 #0
[ 329.770000] Stack : 00000000 00000000 804c6862 00000033 00000000 00000000 80470000 804e0000
[ 329.770000] 87c4bf6c 80469c83 803e4c74 00000003 00000000 804c367c 804837a0 804fade0
[ 329.770000] 80466880 80063f7c 80470000 804e0000 8046e198 8046e19c 803e9588 87c21bdc
[ 329.770000] 00000003 80061cc8 804837a0 804fade0 80466880 00000000 00000000 00c21bdc
[ 329.770000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 329.770000] ...
[ 329.840000] Call Trace:
[ 329.850000] [<8001684c>] show_stack+0x50/0x84
[ 329.860000] [<801b4068>] dump_stack+0x84/0xbc
[ 329.860000] [<8002bf50>] warn_slowpath_common+0xa0/0xd0
[ 329.880000] [<8002c008>] warn_slowpath_null+0x18/0x24
[ 329.890000] [<80290988>] skb_try_coalesce+0x22c/0x36c
[ 329.900000] [<802ecf64>] tcp_try_coalesce+0x4c/0xd0
[ 329.910000] [<802ed048>] tcp_queue_rcv+0x60/0x12c
[ 329.910000] [<802f1d28>] tcp_rcv_established+0x4e4/0x61c
[ 329.930000] [<802f9bf8>] tcp_v4_do_rcv+0xa0/0x1f8
[ 329.930000] [<802fc104>] tcp_v4_rcv+0x6c0/0xa28
[ 329.940000] [<802d941c>] ip_local_deliver_finish+0xd4/0x134
[ 329.950000] [<802d9988>] ip_local_deliver+0xd0/0xe8
[ 329.960000] [<802d9c54>] ip_rcv+0x2b4/0x320
[ 329.970000] [<802a15dc>] __netif_receive_skb_core+0x6f0/0x914
[ 329.980000] [<802a31e8>] process_backlog+0xac/0x184
[ 329.990000] [<802a2f8c>] net_rx_action+0x130/0x2e0
[ 330.000000] [<8002eb64>] __do_softirq+0x294/0x2e0
[ 330.010000] [<8002ee4c>] irq_exit+0x78/0x94
[ 330.020000] [<801de42c>] plat_irq_dispatch+0xb4/0xdc
[ 330.030000]
[ 330.030000] ---[ end trace 8bd5e0303178b765 ]---
[ 1137.390000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1137.400000] 2-...: (0 ticks this GP) idle=5f8/0/0 softirq=76940/76940 fqs=0
[ 1137.410000] (detected by 3, t=6004 jiffies, g=32609, c=32608, q=4661)
[ 1137.420000] Task dump for CPU 2:
[ 1137.430000] swapper/2 R running 0 0 1 0x00100000
[ 1137.440000] Stack : 00000000 001c5bc4 00000108 05080000 00000000 18c45114 804b72a4 80460000
[ 1137.440000] 8046675c 00000001 00000000 804665e0 80466724 80460000 00000000 800134fc
[ 1137.440000] 1100fc03 00000002 87c72000 87c73ec0 80460000 8005dc48 1100fc03 00000002
[ 1137.440000] 00000000 80460000 804b72a4 8005dc40 80460000 8001b0c8 1100fc03 00000000
[ 1137.440000] 00000004 804664a0 000000a0 8001b0d0 00000000 00000000 00000000 00000000
[ 1137.440000] ...
[ 1137.510000] Call Trace:
[ 1137.520000] [<8000bdb0>] __schedule+0x574/0x758
[ 1137.530000] [<800134fc>] r4k_wait_irqoff+0x0/0x20
[ 1137.540000]
[ 1137.540000] rcu_sched kthread starved for 6017 jiffies! g32609 c32608 f0x0 s3 ->state=0x1

@openwrt-bot
Copy link
Author

karesch:

[ 1317.440000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1317.450000] 2-...: (0 ticks this GP) idle=997/1/0 softirq=76940/76940 fqs=1
[ 1317.460000] (detected by 1, t=24009 jiffies, g=32609, c=32608, q=19485)
[ 1317.470000] Task dump for CPU 2:
[ 1317.480000] swapper/2 R running 0 0 1 0x00100000
[ 1317.490000] Stack : 00000000 00000001 06f85030 05080000 3fff0000 00000000 804b72a4 80460000
[ 1317.490000] 8046675c 00000001 00000000 804665e0 80466724 80460000 40000000 800134fc
[ 1317.490000] 1100fc03 00000002 87c72000 87c73ec0 80460000 8005dc48 1100fc03 00000002
[ 1317.490000] 00000000 80460000 c0800400 80013514 80460000 8001b0c8 1100fc03 00000000
[ 1317.490000] 00000004 804664a0 000000a0 8001b0d0 00000000 00000000 00000000 00000000
[ 1317.490000] ...
[ 1317.560000] Call Trace:
[ 1317.570000] [<8000bdb0>] __schedule+0x574/0x758
[ 1317.580000] [<800134fc>] r4k_wait_irqoff+0x0/0x20
[ 1317.590000]
[ 1317.590000] rcu_sched kthread starved for 18003 jiffies! g32609 c32608 f0x0 s3 ->state=0x1
[ 1377.630000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1377.640000] 2-...: (0 ticks this GP) idle=117/1/0 softirq=76940/76940 fqs=0
[ 1377.650000] (detected by 3, t=6004 jiffies, g=32610, c=32609, q=23765)
[ 1377.660000] Task dump for CPU 2:
[ 1377.670000] swapper/2 R running 0 0 1 0x00100000
[ 1377.680000] Stack : 00000000 00000001 05eaf910 05080000 3fff0000 00000000 804b72a4 80460000
[ 1377.680000] 8046675c 00000001 00000000 804665e0 80466724 80460000 40000000 800134fc
[ 1377.680000] 1100fc03 00000002 87c72000 87c73ec0 80460000 8005dc48 1100fc03 00000002
[ 1377.680000] 00000000 80460000 804b72a4 8005dc40 80460000 8001b0c8 1100fc03 00000000
[ 1377.680000] 00000004 804664a0 000000a0 8001b0d0 00000000 00000000 00000000 00000000
[ 1377.680000] ...
[ 1377.750000] Call Trace:
[ 1377.760000] [<8000bdb0>] __schedule+0x574/0x758
[ 1377.770000] [<800134fc>] r4k_wait_irqoff+0x0/0x20
[ 1377.780000]
[ 1377.780000] rcu_sched kthread starved for 6017 jiffies! g32610 c32609 f0x0 s3 ->state=0x1

@openwrt-bot
Copy link
Author

karesch:

there is still no crashlog, the above test was done on the version provided (//LEDE Reboot 17.01-SNAPSHOT r3276-4a405ac8f9 / LuCI lede-17.01 branch (git-17.063.59066-a5191ef)//) with SQM cake switched on on WAN. I attached the dmesg output.

@openwrt-bot
Copy link
Author

karesch:

Another interesting symptom was that on snapshot r3276 the ac wifi speed was smaller lower than on 17.01 release (80-100 Mbit/s versus 220-300 Mbit/s)

@openwrt-bot
Copy link
Author

karesch:

a few minutes after the above stalls the rooter rebooted, no crashlog found in /sys/kernel/debug/ directory:
root@LEDE:~# ls /sys/kernel/debug/ -l
drwxr-xr-x 12 root root 0 Jan 1 1970 bdi
-r--r--r-- 1 root root 0 Jan 1 1970 bootrom
drwxr-xr-x 4 root root 0 Jan 1 1970 clk
-rw-r--r-- 1 root root 0 Jan 1 1970 fault_around_bytes
-r--r--r-- 1 root root 0 Jan 1 1970 gpio
drwxr-xr-x 4 root root 0 Mar 21 00:36 ieee80211
drwxr-xr-x 2 root root 0 Jan 1 1970 mips
drwxr-xr-x 3 root root 0 Jan 1 1970 pinctrl
drwxr-xr-x 2 root root 0 Jan 1 1970 pm_qos
drwxr-xr-x 2 root root 0 Jan 1 1970 regmap
-r--r--r-- 1 root root 0 Jan 1 1970 sleep_time
drwxr-xr-x 2 root root 0 Jan 1 1970 ubi
drwxr-xr-x 2 root root 0 Jan 1 1970 ubifs
drwxr-xr-x 2 root root 0 Jan 1 1970 usb

@openwrt-bot
Copy link
Author

bjonglez:

This has been fixed, see FS#804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant