OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Votes 1
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by builker - 02.04.2019
Last edited by Petr Štetiar - 02.04.2019

FS#2218 - Linksys EA4500 suffers endless reboots: Radio1 (5GHz) crashes in mac80211

One of 5 EA3500’s, each at a different site, keeps crashing and restarting. Both 18.06.2 and Snapshot r9614-b614954 are suffering it at this particular site. I could finally keep it booting up after disabling radio1 (5 GHz). Other 4 sites have no trouble with the same system and wireless configuration.

I can pretty much rule out hardware problems. Just boot up Snapshot r9614-b614954 on an EA4500, whose radio1 (5G) worked on another site. It suffers the same infinite crash/restart sequence. At least 3 different devices worked at different locations, but crash only at this one.

From the serial console: A crash at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]

[   22.960919] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   23.010664] br-lan: port 1(eth0.1) entered blocking state
[   23.016172] br-lan: port 1(eth0.1) entered disabled state
[   23.021848] device eth0.1 entered promiscuous mode
[   23.026687] device eth0 entered promiscuous mode
[   23.110502] br-lan: port 1(eth0.1) entered blocking state
[   23.115976] br-lan: port 1(eth0.1) entered forwarding state
[   23.121730] IPv6: ADDRCONF(NETDEV_UP): br-lan: link is not ready
[   23.191580] mv643xx_eth_port mv643xx_eth_port.1 eth1: link up, 1000 Mb/s, full duplex, flow control disabled
[   23.202650] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   24.009012] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   24.015903] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[   24.022546] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   25.052979] 0000:01:00.0: unable to load firmware helper image
[   25.058872] ieee80211 phy0: Cannot start firmware
[   25.063675] ieee80211 phy0: Trying to reload the firmware again
[   25.987785] ieee80211 phy0: 88w8366 v7, 20aa4b891bf2, AP firmware 5.2.8.17
[   26.047574] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   26.113947] 0000:02:00.0: unable to load firmware helper image
[   26.119821] ieee80211 phy1: Cannot start firmware
[   26.124636] ieee80211 phy1: Trying to reload the firmware again
[   26.511780] ieee80211 phy1: 88w8366 v7, 20aa4b891bf4, AP firmware 5.2.8.17
[   26.565994] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[   26.577344] br-lan: port 2(wlan0) entered blocking state
[   26.582745] br-lan: port 2(wlan0) entered disabled state
[   26.588309] device wlan0 entered promiscuous mode
[   26.614795] br-lan: port 3(wlan1) entered blocking state
[   26.620186] br-lan: port 3(wlan1) entered disabled state
[   26.625800] device wlan1 entered promiscuous mode
[   27.137261] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   27.143826] br-lan: port 2(wlan0) entered blocking state
[   27.149170] br-lan: port 2(wlan0) entered forwarding state
[   27.179154] ------------[ cut here ]------------
[   27.183961] WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]
[   27.194991] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 mwl8k mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 tun gpio_button_hotplug
[   27.254721] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.105 #0
[   27.260669] Hardware name: Marvell Kirkwood (Flattened Device Tree)
[   27.266969] Backtrace: 
[   27.269438] [<c010530c>] (dump_backtrace) from [<c01055f4>] (show_stack+0x18/0x1c)
[   27.277057]  r7:bf1f012c r6:00000000 r5:bf227600 r4:00000000
[   27.282772] [<c01055dc>] (show_stack) from [<c058b880>] (dump_stack+0x20/0x28)
[   27.290036] [<c058b860>] (dump_stack) from [<c01196b0>] (__warn+0xdc/0x108)
[   27.297048] [<c01195d4>] (__warn) from [<c0119794>] (warn_slowpath_null+0x28/0x30)
[   27.304665]  r9:c7104ff0 r8:bf226390 r7:bf226318 r6:c71058a0 r5:c7104bc0 r4:c72be9c0
[   27.312535] [<c011976c>] (warn_slowpath_null) from [<bf1f012c>] (ieee80211_rx_napi+0x1fc/0xa54 [mac80211])
[   27.322376] [<bf1eff30>] (ieee80211_rx_napi [mac80211]) from [<bf1cb68c>] (ieee80211_tasklet_handler+0x64/0xc0 [mac80211])
[   27.333486]  r10:00000000 r9:c7104ff0 r8:bf226390 r7:bf226318 r6:c7104bc0 r5:c7104fe4
[   27.341359]  r4:c72be9c0
[   27.343976] [<bf1cb628>] (ieee80211_tasklet_handler [mac80211]) from [<c011c5dc>] (tasklet_action+0x88/0xd8)
[   27.353865]  r9:00000100 r8:ffffe000 r7:00000000 r6:c082a000 r5:c08085fc r4:00000000
[   27.361664] [<c011c554>] (tasklet_action) from [<c0101464>] (__do_softirq+0xac/0x25c)
[   27.369534]  r7:c082a020 r6:40000006 r5:c082a038 r4:00000006
[   27.375236] [<c01013b8>] (__do_softirq) from [<c011c940>] (irq_exit+0xc8/0x110)
[   27.382594]  r10:00000000 r9:c0801f00 r8:c7805200 r7:00000001 r6:00000000 r5:c0829178
[   27.390458]  r4:00000000
[   27.393022] [<c011c878>] (irq_exit) from [<c01465e4>] (__handle_domain_irq+0x8c/0xa8)
[   27.400893] [<c0146558>] (__handle_domain_irq) from [<c010138c>] (orion_handle_irq+0x74/0xa0)
[   27.409469]  r9:c0801f00 r8:00000001 r7:c085564c r6:c780801c r5:00000400 r4:0000000a
[   27.417263] [<c0101318>] (orion_handle_irq) from [<c01060c8>] (__irq_svc+0x68/0x84)
[   27.424968] Exception stack(0xc0801f00 to 0xc0801f48)
[   27.430044] 1f00: 00000000 00000000 00000000 60000013 00000000 ffffe000 c0803094 c08206e0
[   27.438271] 1f20: c080a852 c065e3e0 00000000 c0801f5c c0801f50 c0801f50 c0102ec0 c05a6814
[   27.446494] 1f40: 60000013 ffffffff
[   27.450004]  r10:00000000 r9:c0800000 r8:c080a852 r7:c0801f34 r6:ffffffff r5:60000013
[   27.457875]  r4:c05a6814
[   27.460441] [<c05a67dc>] (default_idle_call) from [<c013fdb8>] (do_idle+0x84/0x144)
[   27.468154] [<c013fd34>] (do_idle) from [<c01400d8>] (cpu_startup_entry+0x14/0x18)
[   27.475774]  r10:00721c14 r9:c0723a20 r8:c7ffcc80 r7:00000000 r6:c0803020 r5:ffffffff
[   27.483654]  r4:c0809c4c r3:40000013
[   27.487258] [<c01400c4>] (cpu_startup_entry) from [<c05a1660>] (rest_init+0x74/0x94)
[   27.495063] [<c05a15ec>] (rest_init) from [<c0700d3c>] (start_kernel+0x3a0/0x424)
[   27.502595]  r5:ffffffff r4:c08299a0
[   27.506188] [<c070099c>] (start_kernel) from [<00008048>] (0x8048)
[   27.512404] ---[ end trace 351d5fe97981d9fa ]---
builker commented on 02.04.2019 04:09
tapper82 commented on 02.04.2019 06:53

I would like to see this fixt. These routers don't get mutch love.
https://bugs.openwrt.org/index.php?do=details&task_id=505 Because of the posted bug I could not even get openwrt on to my friends ea4500

Admin
Petr Štetiar commented on 02.04.2019 07:41
[   27.183961] WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]

It looks like a warning to me, not a crash which would reboot the router. I can't locate rx.c:4516 in source code since you're using "quite dated" snapshot with 4.14.105 kernel and we're on 4.14.109 now, please try to update and post the warning again.

Anyway, if the router crashed later, it would be probably helpful if you could connect serial console to the router and provide complete log as seen there. Otherwise I'm afraid we're flying blind.

builker commented on 02.04.2019 22:33

Thank you, Petr, for looking at it. You're right that was only a warning. Please find the complete console log (attached EA4500 crash console.log), albeit from the dated snapshot with 4.14.105 kernel built on March 14. Forget to bring my serial cable today. I will try get a log based on a more recent snapshot later.

In the complete log, there are 54 occurrences of the warning before crashing and rebooting. Run out memory? After the reboot, /sys/kernel/debug/crashlog cannot be found. Ordinarily, on the EA4500 I could trigger a crash with "echo c > /proc/sysrq-trigger" and found the crashlog.

builker commented on 03.04.2019 03:33

I built Snapshot r9614-b614954 and obtained (attached here) build_dir/target-arm_xscale_musl_eabi/linux-kirkwood/backports-4.19.23-1/net/mac80211/rx.c. rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 points to the following WARN_ON

    case RX_ENC_LEGACY:
      if (WARN_ON(status->rate_idx >= sband->n_bitrates))
   rx.c (126.3 KiB)
Admin
Petr Štetiar commented on 03.04.2019 16:04

Thanks, that's what I've guessed. I expect off-by-one in the mwl8k driver. Please try following patch and let me know if it has changed anything.

builker commented on 04.04.2019 01:48

Thank you, Petr. I'm building today's trunk (b3d8b3a) with your 999-DEBUG-net-mac80211-mwl8k-debugging.patch. Will report back after I test it on the troubling site, probably over the weekend.

builker commented on 06.04.2019 22:33

Hello Petr,

Please find the attached log from a test run with your debugging patch. It was produced by issuing the following command:

root@OpenWrt:~# wifi down; sleep 4; wifi up; sleep 4; logread > /tmp/logread-patch.log

I removed a bunch of repeated line at the beginning

Sat Apr  6 14:58:03 2019 kern.info kernel: [ 1931.235657] got rate_idx: 251 (0xfb) fixing it to 7 (0x7)

Thank you for looking into it.

builker commented on 07.04.2019 00:24

I notice the default (generated) /etc/config/wireless does not trip over the problem. I applied its settings back to my wireless. It appears channel 44 or 48 gets into trouble every time.

--- a/etc/config/wireless
+++ b/etc/config/wireless
@@ -21,7 +21,7 @@ config wifi-iface 'default_radio0'
 
 config wifi-device 'radio1'
        option type 'mac80211'
-       option channel '44'
+       option channel '36'
        option htmode 'HT40'
        option hwmode '11a'
        option path 'mbus@f1000000/mbus@f1000000:pcie@82000000/pci0000:00/0000:00:02.0/0000:02:00.0'

Channel 36 or 40 does not trigger the message "kern.info kernel: [ 1931.235657] got rate_idx: 251 (0xfb)". 'auto' always chooses 36, very smart. However, if I do in the LuCI "Join Network: Wireless Scan" (http://192.168.1.1/cgi-bin/luci/admin/network/wireless_join), the kern.info message shows up 3 times every ~12 seconds.

Admin
Petr Štetiar commented on 07.04.2019 07:28

Thanks, I'll look into it later, but I'm wondering if you're still seeing the endless reboots or they've dissappeared as well.

builker commented on 08.04.2019 03:57

With pr_info, the router doesn't crash and reboot. The original crash was probably caused by rapidly repeated warnings with released settings and neighborly (troubling) APs. Folks have been working around the problem, which may not be right, still costing extra CPU cycles or resources. The WARN_ON is there for a good reason: The bug may be in the lower driver?

Admin
Petr Štetiar commented on 08.04.2019 08:28

Please can you try following fix? If the proposed patch fixes your issue, could you please provide me with the following details as I would like to send the patch upstream:

builker commented on 08.04.2019 16:32

Thank you, Petr. I'm building it with today's trunk (8f1cbef1). Will test it in 2 days at the troubling site.

builker commented on 10.04.2019 04:18

Petr, with your fix patch, EA3500 booted up, stayed up, and radio1 (5 GHz) on channel 44 or 48 accepted clients. With Snapshot r9614-b614954, it went to endless reboots at the same site.

However, the kernel warning (mac80211/rx.c:4469 ieee80211_rx_napi+0x90/0xa4c) happened once after one reboot. Please see the attached log. I rebooted 5~6 times and couldn't reproduce it. Some initialization code still sets rate_idx to an erroneous value by chance?

Admin
Petr Štetiar commented on 10.04.2019 07:46

Thanks for testing, could you please give me your real name and email so I can add it to Tested-by/Reported-by tags and send it upstream?

That other warning is another issue with the driver as it happens in https://elixir.bootlin.com/linux/v4.19.34/source/net/mac80211/rx.c#L4453.

I'll wait for your details for Tested-by/Reported-by tags, then I'm going to send it for review to the upstream kernel and once I get some feedback I'll push it to OpenWrt tree. Then I'm going to close this issue as resolved.

If you would like to get the new warning fixed, I can help with that as well, but you'll need to wait until I find some time for this. BTW you can try to report this issue upstream as well by yourself, since it does look like a problem in upstream driver so you might get it fixed faster.

builker commented on 11.04.2019 16:46

Thank you, Petr, for sending the fix upstream. Please see your true.cz email and briefly acknowledge it if you receive it.

Thanks discovering a different issue. I appreciate it when you have time looking into it in the upstream driver. I have no idea what sband == 0 means. By the way, where to report the issue upstream, https://bugzilla.kernel.org?

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing