Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#2218 - Linksys EA4500 suffers endless reboots: Radio1 (5GHz) crashes in mac80211 #7059

Open
openwrt-bot opened this issue Apr 2, 2019 · 17 comments
Labels

Comments

@openwrt-bot
Copy link

builker:

One of 5 EA3500's, each at a different site, keeps crashing and restarting. Both 18.06.2 and Snapshot r9614-b614954 are suffering it at this particular site. I could finally keep it booting up after disabling radio1 (5 GHz). Other 4 sites have no trouble with the same system and wireless configuration.

I can pretty much rule out hardware problems. Just boot up Snapshot r9614-b614954 on an EA4500, whose radio1 (5G) worked on another site. It suffers the same infinite crash/restart sequence. At least 3 different devices worked at different locations, but crash only at this one.

From the serial console: A crash at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]

[ 22.960919] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 23.010664] br-lan: port 1(eth0.1) entered blocking state
[ 23.016172] br-lan: port 1(eth0.1) entered disabled state
[ 23.021848] device eth0.1 entered promiscuous mode
[ 23.026687] device eth0 entered promiscuous mode
[ 23.110502] br-lan: port 1(eth0.1) entered blocking state
[ 23.115976] br-lan: port 1(eth0.1) entered forwarding state
[ 23.121730] IPv6: ADDRCONF(NETDEV_UP): br-lan: link is not ready
[ 23.191580] mv643xx_eth_port mv643xx_eth_port.1 eth1: link up, 1000 Mb/s, full duplex, flow control disabled
[ 23.202650] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 24.009012] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 24.015903] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[ 24.022546] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 25.052979] 0000:01:00.0: unable to load firmware helper image
[ 25.058872] ieee80211 phy0: Cannot start firmware
[ 25.063675] ieee80211 phy0: Trying to reload the firmware again
[ 25.987785] ieee80211 phy0: 88w8366 v7, 20aa4b891bf2, AP firmware 5.2.8.17
[ 26.047574] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 26.113947] 0000:02:00.0: unable to load firmware helper image
[ 26.119821] ieee80211 phy1: Cannot start firmware
[ 26.124636] ieee80211 phy1: Trying to reload the firmware again
[ 26.511780] ieee80211 phy1: 88w8366 v7, 20aa4b891bf4, AP firmware 5.2.8.17
[ 26.565994] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[ 26.577344] br-lan: port 2(wlan0) entered blocking state
[ 26.582745] br-lan: port 2(wlan0) entered disabled state
[ 26.588309] device wlan0 entered promiscuous mode
[ 26.614795] br-lan: port 3(wlan1) entered blocking state
[ 26.620186] br-lan: port 3(wlan1) entered disabled state
[ 26.625800] device wlan1 entered promiscuous mode
[ 27.137261] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 27.143826] br-lan: port 2(wlan0) entered blocking state
[ 27.149170] br-lan: port 2(wlan0) entered forwarding state
[ 27.179154] ------------[ cut here ]------------
[ 27.183961] WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]
[ 27.194991] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 mwl8k mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 tun gpio_button_hotplug
[ 27.254721] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.105 #0
[ 27.260669] Hardware name: Marvell Kirkwood (Flattened Device Tree)
[ 27.266969] Backtrace:
[ 27.269438] [] (dump_backtrace) from [] (show_stack+0x18/0x1c)
[ 27.277057] r7:bf1f012c r6:00000000 r5:bf227600 r4:00000000
[ 27.282772] [] (show_stack) from [] (dump_stack+0x20/0x28)
[ 27.290036] [] (dump_stack) from [] (__warn+0xdc/0x108)
[ 27.297048] [] (__warn) from [] (warn_slowpath_null+0x28/0x30)
[ 27.304665] r9:c7104ff0 r8:bf226390 r7:bf226318 r6:c71058a0 r5:c7104bc0 r4:c72be9c0
[ 27.312535] [] (warn_slowpath_null) from [] (ieee80211_rx_napi+0x1fc/0xa54 [mac80211])
[ 27.322376] [] (ieee80211_rx_napi [mac80211]) from [] (ieee80211_tasklet_handler+0x64/0xc0 [mac80211])
[ 27.333486] r10:00000000 r9:c7104ff0 r8:bf226390 r7:bf226318 r6:c7104bc0 r5:c7104fe4
[ 27.341359] r4:c72be9c0
[ 27.343976] [] (ieee80211_tasklet_handler [mac80211]) from [] (tasklet_action+0x88/0xd8)
[ 27.353865] r9:00000100 r8:ffffe000 r7:00000000 r6:c082a000 r5:c08085fc r4:00000000
[ 27.361664] [] (tasklet_action) from [] (__do_softirq+0xac/0x25c)
[ 27.369534] r7:c082a020 r6:40000006 r5:c082a038 r4:00000006
[ 27.375236] [] (__do_softirq) from [] (irq_exit+0xc8/0x110)
[ 27.382594] r10:00000000 r9:c0801f00 r8:c7805200 r7:00000001 r6:00000000 r5:c0829178
[ 27.390458] r4:00000000
[ 27.393022] [] (irq_exit) from [] (__handle_domain_irq+0x8c/0xa8)
[ 27.400893] [] (__handle_domain_irq) from [] (orion_handle_irq+0x74/0xa0)
[ 27.409469] r9:c0801f00 r8:00000001 r7:c085564c r6:c780801c r5:00000400 r4:0000000a
[ 27.417263] [] (orion_handle_irq) from [] (__irq_svc+0x68/0x84)
[ 27.424968] Exception stack(0xc0801f00 to 0xc0801f48)
[ 27.430044] 1f00: 00000000 00000000 00000000 60000013 00000000 ffffe000 c0803094 c08206e0
[ 27.438271] 1f20: c080a852 c065e3e0 00000000 c0801f5c c0801f50 c0801f50 c0102ec0 c05a6814
[ 27.446494] 1f40: 60000013 ffffffff
[ 27.450004] r10:00000000 r9:c0800000 r8:c080a852 r7:c0801f34 r6:ffffffff r5:60000013
[ 27.457875] r4:c05a6814
[ 27.460441] [] (default_idle_call) from [] (do_idle+0x84/0x144)
[ 27.468154] [] (do_idle) from [] (cpu_startup_entry+0x14/0x18)
[ 27.475774] r10:00721c14 r9:c0723a20 r8:c7ffcc80 r7:00000000 r6:c0803020 r5:ffffffff
[ 27.483654] r4:c0809c4c r3:40000013
[ 27.487258] [] (cpu_startup_entry) from [] (rest_init+0x74/0x94)
[ 27.495063] [] (rest_init) from [] (start_kernel+0x3a0/0x424)
[ 27.502595] r5:ffffffff r4:c08299a0
[ 27.506188] [] (start_kernel) from [<00008048>] (0x8048)
[ 27.512404] ---[ end trace 351d5fe97981d9fa ]---

@openwrt-bot
Copy link
Author

builker:

Another occurrence: [[https://forum.openwrt.org/t/ea4500-loop-reboot-only-failsafe-working/34393|External LinkEa4500 loop reboot / only failsafe working]]

@openwrt-bot
Copy link
Author

tapper82:

I would like to see this fixt. These routers don't get mutch love.
https://bugs.openwrt.org/index.php?do=details&task_id=505
Because of the posted bug I could not even get openwrt on to my friends ea4500

@openwrt-bot
Copy link
Author

ynezz:

[ 27.183961] WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 [mac80211]

It looks like a warning to me, not a crash which would reboot the router. I can't locate rx.c:4516 in source code since you're using "quite dated" snapshot with 4.14.105 kernel and we're on 4.14.109 now, please try to update and post the warning again.

Anyway, if the router crashed later, it would be probably helpful if you could connect serial console to the router and provide complete log as seen there. Otherwise I'm afraid we're flying blind.

@openwrt-bot
Copy link
Author

builker:

Thank you, Petr, for looking at it. You're right that was only a warning. Please find the complete console log (attached EA4500 crash console.log), albeit from the dated snapshot with 4.14.105 kernel built on March 14. Forget to bring my serial cable today. I will try get a log based on a more recent snapshot later.

In the complete log, there are 54 occurrences of the warning before crashing and rebooting. Run out memory? After the reboot, /sys/kernel/debug/crashlog cannot be found. Ordinarily, on the EA4500 I could trigger a crash with "echo c > /proc/sysrq-trigger" and found the crashlog.

@openwrt-bot
Copy link
Author

builker:

I built Snapshot r9614-b614954 and obtained (attached here) build_dir/target-arm_xscale_musl_eabi/linux-kirkwood/backports-4.19.23-1/net/mac80211/rx.c. rx.c:4516 ieee80211_rx_napi+0x1fc/0xa54 points to the following WARN_ON

case RX_ENC_LEGACY:
if (WARN_ON(status->rate_idx >= sband->n_bitrates))

@openwrt-bot
Copy link
Author

ynezz:

Thanks, that's what I've guessed. I expect [[https://github.com/torvalds/linux/commit/854783444bab0024556c0aefdb0a860f2f1da286|off-by-one]] in the mwl8k driver. Please try following [[https://github.com/ynezz/openwrt/commit/ef9aa2ef6c5fa4ef60cdb17028c76e0fc90e04a9|patch]] and let me know if it has changed anything.

@openwrt-bot
Copy link
Author

builker:

Thank you, Petr. I'm building today's trunk (b3d8b3a) with your 999-DEBUG-net-mac80211-mwl8k-debugging.patch. Will report back after I test it on the troubling site, probably over the weekend.

@openwrt-bot
Copy link
Author

builker:

Hello Petr,

Please find the attached log from a test run with your debugging patch. It was produced by issuing the following command:
root@OpenWrt:~# wifi down; sleep 4; wifi up; sleep 4; logread > /tmp/logread-patch.log

I removed a bunch of repeated line at the beginning
Sat Apr 6 14:58:03 2019 kern.info kernel: [ 1931.235657] got rate_idx: 251 (0xfb) fixing it to 7 (0x7)

Thank you for looking into it.

@openwrt-bot
Copy link
Author

builker:

I notice the default (generated) /etc/config/wireless does not trip over the problem. I applied its settings back to my wireless. It appears channel 44 or 48 gets into trouble every time.
--- a/etc/config/wireless
+++ b/etc/config/wireless
@@ -21,7 +21,7 @@ config wifi-iface 'default_radio0'

config wifi-device 'radio1'
option type 'mac80211'

  •   option channel '44'
    
  •   option channel '36'
      option htmode 'HT40'
      option hwmode '11a'
      option path 'mbus@f1000000/mbus@f1000000:pcie@82000000/pci0000:00/0000:00:02.0/0000:02:00.0'</code>
    

Channel 36 or 40 does not trigger the message "kern.info kernel: [ 1931.235657] got rate_idx: 251 (0xfb)". 'auto' always chooses 36, very smart. However, if I do in the LuCI "Join Network: Wireless Scan" (http://192.168.1.1/cgi-bin/luci/admin/network/wireless_join), the kern.info message shows up 3 times every ~12 seconds.

@openwrt-bot
Copy link
Author

ynezz:

Thanks, I'll look into it later, but I'm wondering if you're still seeing the endless reboots or they've dissappeared as well.

@openwrt-bot
Copy link
Author

builker:

With pr_info, the router doesn't crash and reboot. The original crash was probably caused by rapidly repeated warnings with released settings and neighborly (troubling) APs. [[https://forum.openwrt.org/t/ea4500-5ghz-hangs-the-router/12491/2|Folks have been working around the problem]], which may not be right, still costing extra CPU cycles or resources. The WARN_ON is there for a good reason: The bug may be in the lower driver?

@openwrt-bot
Copy link
Author

ynezz:

Please can you try following [[https://github.com/ynezz/openwrt/commit/739dba70dd54b62c708749cba229f8aaa5a66520|fix]]? If the proposed patch fixes your issue, could you please provide me with the following details as I would like to send the patch upstream:

@openwrt-bot
Copy link
Author

builker:

Thank you, Petr. I'm building it with today's trunk (8f1cbef). Will test it in 2 days at the troubling site.

@openwrt-bot
Copy link
Author

builker:

Petr, with your fix patch, EA3500 booted up, stayed up, and radio1 (5 GHz) on channel 44 or 48 accepted clients. With Snapshot r9614-b614954, it went to endless reboots at the same site.

However, the kernel warning (mac80211/rx.c:4469 ieee80211_rx_napi+0x90/0xa4c) happened once after one reboot. Please see the attached log. I rebooted 5~6 times and couldn't reproduce it. Some initialization code still sets rate_idx to an erroneous value by chance?

@openwrt-bot
Copy link
Author

ynezz:

Thanks for testing, could you please give me your real name and email so I can add it to Tested-by/Reported-by tags and send it upstream?

That other warning is another issue with the driver as it happens in [[different place|https://elixir.bootlin.com/linux/v4.19.34/source/net/mac80211/rx.c#L4453]].

I'll wait for your details for Tested-by/Reported-by tags, then I'm going to send it for review to the upstream kernel and once I get some feedback I'll push it to OpenWrt tree. Then I'm going to close this issue as resolved.

If you would like to get the new warning fixed, I can help with that as well, but you'll need to wait until I find some time for this. BTW you can try to report this issue upstream as well by yourself, since it does look like a problem in upstream driver so you might get it fixed faster.

@openwrt-bot
Copy link
Author

builker:

Thank you, Petr, for sending the fix upstream. Please see your true.cz email and briefly acknowledge it if you receive it.

Thanks discovering a different issue. I appreciate it when you have time looking into it in the upstream driver. I have no idea what sband == 0 means. By the way, where to report the issue upstream, https://bugzilla.kernel.org?

@hadifarnoud
Copy link

it looks like a hardware issue. I have Linksys image and still restarts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants