OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very Low
  • Reported Version openwrt-21.02
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Steven Johnson - 29.04.2021

FS#3760 - Wifi to WAN traffic causes eth0 to crash with a NETDEV WATCHDOG

I have been testing OpenWRT 21.02-RC1 stock released firmware image on the Newifi-D2 target (MT7621 based).

Sporadically while using the router, i experience the WAN interface crashing with NETDEV WATCHDOG errors.

I am using the released image for OpenWrt 21.02-rc1 for the Newifi-D2 target
I have also experienced this fault with a custom router build based on MT7621 for a different target, the fault appears general to MT7621 not this target.

I have loaded 19.07.7 stock firmware on this same hardware and the fault does not occur in that firmware version. A colleague has replicated this fault in 21.02-RC1 with the same type of target (physically different unit), but in a different environment with no shared network resources.

Other than setting up the wifi ssid and password and enabling the 5ghz radio, the network settings are stock.

I have worked out a 90% reliable test case to reproduce the fault:

Have a client device connected to the router on WiFi 5ghz
And a local Iperf3 server running on the wan side of the wired interface.

On the server run:

iperf3 -s

Note the servers IP address

On the client run:

iperf3 -c <serverIP> -P 50

Notes:
Iperf3 server needs to be local on a 1gbps wired link. With an internet based Iperf3 server (or one with a slower connection) the error is much harder to trigger, it will happen but its much more random.
The number of connections does not need to be 50, however, the lower the number of connections the less reliable the error triggers.
The WiFi client should be getting about 300-400mbps throughput over the wifi link when its working correctly.

On most runs, the iperf3 performance will drop to 0mbps, logread on the router then reveals:

Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.065748] ------------[ cut here ]------------
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.070403] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:448 0x8047e780
Thu Apr 29 03:58:44 2021 kern.info kernel: [  578.077443] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.084412] Modules linked in: pppoe ppp_async iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.147503] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.4.111 #0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.153490] Stack : 00000000 80840000 ffffffff 8007d6e0 00000000 00000000 00000000 00000000
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.161831]         00000000 00000000 00000000 00000000 00000000 00000001 8fc0fd50 8ba64c1c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.170163]         8fc0fde8 00000000 00000000 00000000 00000038 805e1804 342e3520 3131312e
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.178486]         00000000 0000001c 00000000 0002402f 00000000 8fc0fd30 00000000 8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.186808]         00000009 00000001 00200000 00000122 00000003 80359e2c 00000004 80810004
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.195132]         ...
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.197567] Call Trace:
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.197583] [<8007d6e0>] 0x8007d6e0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.203512] [<805e1804>] 0x805e1804
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.206990] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.210464] [<80359e2c>] 0x80359e2c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.213941] [<8000b05c>] 0x8000b05c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.217409] [<8000b064>] 0x8000b064
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.220879] [<805c6f9c>] 0x805c6f9c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.224351] [<8007d8ac>] 0x8007d8ac
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.227829] [<8002bfe8>] 0x8002bfe8
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.231302] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.234775] [<8002c0c0>] 0x8002c0c0
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.238257] [<8047e780>] 0x8047e780
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.241731] [<800a9018>] 0x800a9018
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.245212] [<8047e484>] 0x8047e484
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.248688] [<800965d4>] 0x800965d4
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.252161] [<8009681c>] 0x8009681c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.255640] [<805e7d1c>] 0x805e7d1c
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.259116] [<80030768>] 0x80030768
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.262588] [<802f8404>] 0x802f8404
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.266064] [<80006c28>] 0x80006c28
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.269533]
Thu Apr 29 03:58:44 2021 kern.warn kernel: [  578.271153] ---[ end trace eabfb050c5985bce ]---
Thu Apr 29 03:58:44 2021 kern.err kernel: [  578.275799] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Thu Apr 29 03:58:44 2021 kern.info kernel: [  578.282602] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Thu Apr 29 03:58:44 2021 kern.info kernel: [  578.318310] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
Thu Apr 29 03:58:44 2021 kern.info kernel: [  578.326251] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

This only occurs on the first instance of the error, the error will repeatedly occur but the only evidence in the logs after the first error is:

Thu Apr 29 04:33:36 2021 kern.err kernel: [ 2670.043971] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.050555] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.084619] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
Thu Apr 29 04:33:36 2021 kern.info kernel: [ 2670.092556] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Steven Johnson commented on 01.05.2021 02:48

I just build from master with a 5.10.32 kernel and the problem persists:

Sat May  1 02:38:18 2021 kern.warn kernel: [  304.135261] ------------[ cut here ]------------
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.139934] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x2fc/0x304
Sat May  1 02:38:18 2021 kern.info kernel: [  304.148205] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.155132] Modules linked in: pppoe ppp_async iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nf_nat nf_flow_table nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.215126] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.32 #0
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.221147] Stack : 8072aa08 ffffffff 80830000 80082648 80830000 8072b958 00000000 00000000
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.229518]         8140bdac 80980000 807bf2a8 807bedc7 806fa7cc 00000001 8140bd50 8fe23c7d
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.237866]         00000000 00000000 806fa7cc 8140bbf0 ffffefff 80828764 00000000 00000000
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.246212]         00000000 0000000e 00000000 00034856 00000000 80830000 00000000 80700000
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.254556]         00000009 ffffffff 00000122 ffffffff 00000018 803db974 00000000 80980000
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.262899]         ...
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.265343] Call Trace:
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.267810] [<80008108>] show_stack+0x30/0x100
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.272251] [<803606e0>] dump_stack+0xa4/0xe0
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.276643] [<8002d958>] __warn+0xc0/0xe8
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.280646] [<8002da0c>] warn_slowpath_fmt+0x8c/0xac
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.285605] [<8050ad44>] dev_watchdog+0x2fc/0x304
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.290317] [<8009ff34>] call_timer_fn.isra.30+0x20/0x90
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.295607] [<800a08f4>] run_timer_softirq+0x220/0x4c8
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.300740] [<80652e4c>] __do_softirq+0x16c/0x334
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.305426] [<800325a4>] irq_exit+0xc4/0xec
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.309592] [<8037bc7c>] plat_irq_dispatch+0x64/0x104
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.314621] [<80003548>] except_vec_vi_end+0xb8/0xc4
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.319576] [<8047b9c8>] cpuidle_enter_state_coupled+0x3b0/0x500
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.325558] [<80479214>] cpuidle_enter+0x54/0xac
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.330182] [<8005d0d8>] do_idle+0x25c/0x2cc
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.334432] [<8005d3c0>] cpu_startup_entry+0x2c/0x34
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.339383] [<80844d68>] start_kernel+0x5bc/0x5e4
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.344068]
Sat May  1 02:38:18 2021 kern.warn kernel: [  304.345968] ---[ end trace 67930318eb59ae3a ]---
Sat May  1 02:38:18 2021 kern.err kernel: [  304.350623] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Sat May  1 02:38:18 2021 kern.info kernel: [  304.357732] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Kristian Evensen commented on 11.05.2021 07:41

I spent some time trying to reproduce the error at hand. The transmit timeout issue has been bothering the mt7531 driver "forever", there are plenty of examples of the issue affecting both 19.07 and 18.06. When testing your setup, I was able to trigger the timeout relatively easy.

However, when I tested the timeout only happened every now and then (and not repeatedly). While the ideal would be to fix the problem once and for all, I will argue that the behavior of the upstream driver is better than that of the downstream driver used in for example 19.07. At least during my testing, the automatic restart of the switch was always successful and iperf3 would resume after ~5 seconds. With the downstream driver, the restart had to triggered from user space.

Steven Johnson commented on 11.05.2021 09:49

I agree the restart is certainly more friendly. I am concerned about all the connections that are established at the time being dropped.

My company is prepared to pay a developer to fix the issue once and for all. So if there is anyone willing and capable of doing it, please contact me.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing