OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Votes 1
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by dissent1 - 29.11.2017

FS#1197 - Netgear ipq806x severe wifi problem

Netgear R7800 (ipq8065, dual QCA9984)
Netgear R7500v2 (ipq8064, dual QCA9980)
Kernel 4.9 - happens on both 17.01 and trunk with k4.9 but k4.4 is unaffected.

Wireless transmits broken/malformed frames that are not detected/corrected on low layer protocols. It can be seen by:
1. Extremely low throughput on both bands - 20-50 mbits max on 2.4 and 5 ghz. Some times only either or neither band is affected but it's a matter of time when both get affected. Or they just switch between each other if only 1 was affected.
2. Sometimes broken LUCI web styles when accessed through wifi, sometimes broken ssh/ssl connections: https://bugs.lede-project.org/index.php?do=details&task_id=1173

My observations:
1. The issue affects only Netgear ipq806x devices and not related to qca9984. According to recent findings Netgear R7500v2 with qca9980 also suffers from the issue
https://bugs.lede-project.org/index.php?do=details&task_id=1173

2. The bug happens only on k4.9 according to tests done between k4.4 and k4.9 with completely same set of wireless backports and firmware + cal/pre-cal data + board (API 1/2 or GPL) data. I've tried all available firmware, cal/pre-cal data and board data options including GPL - no effect.

3. It doesn't seem to be related to stmmac, I've tried different buffer options.

Actually I'm starting to think that it's smth within uboot that messes with ram, like it does in the last 2 mibs of memory region.

Additional forum investigations https://forum.lede-project.org/t/netgear-r7800-exploration-ipq8065-qca9984/285/481 But I must say that that workaround didn't work for me.

The issue seem to be floating and probably related to some kind of code/byte padding/alignment in memory?

As fact I have been having this issue only with 2.4ghz before and after some unrelated commits now I have it on 5ghz as well.

dissent1 commented on 29.11.2017 10:14

I must add that all those throughput numbers are TX (router wise), while RX (router wise) is completely ok and greatly exceeds those values, so the issue with the router when transmitting data, not receiving.

Ansuel commented on 29.11.2017 21:58

I can confirm this too....
running latest trunk same problem
same gui broken that needs lots of refresh to make it work...
wifi crash sometimes and performance are really s*it...

and i can also confirm that problems are with rx... tx is all good.

Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.251798] ath10k_pci 0000:01:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296253] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296281] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.309466] ath10k_pci 0000:01:00.0: firmware crashed! (guid c95b3960-6710-4367-ad39-3706a2029428)
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.309501] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.317563] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.329756] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.336109] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.349056] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.368374] ath10k_pci 0000:01:00.0: failed to get memcpy hi address for firmware address 4: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.368399] ath10k_pci 0000:01:00.0: failed to read firmware dump area: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.376263] ath10k_pci 0000:01:00.0: Copy Engine register dump:
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.382956] ath10k_pci 0000:01:00.0: [00]: 0x0004a000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.388808] ath10k_pci 0000:01:00.0: [01]: 0x0004a400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.397885] ath10k_pci 0000:01:00.0: [02]: 0x0004a800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.406757] ath10k_pci 0000:01:00.0: [03]: 0x0004ac00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.415598] ath10k_pci 0000:01:00.0: [04]: 0x0004b000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.424470] ath10k_pci 0000:01:00.0: [05]: 0x0004b400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.433311] ath10k_pci 0000:01:00.0: [06]: 0x0004b800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.442152] ath10k_pci 0000:01:00.0: [07]: 0x0004bc00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.450960] ath10k_pci 0000:01:00.0: [08]: 0x0004c000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.459872] ath10k_pci 0000:01:00.0: [09]: 0x0004c400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.468738] ath10k_pci 0000:01:00.0: [10]: 0x0004c800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.477587] ath10k_pci 0000:01:00.0: [11]: 0x0004cc00 3735928559 3735928559 3735928559 3735928559
dissent1 commented on 02.12.2017 11:04

A reason and a fix seem to be found. Waiting for more users to provide feedback to be completely sure.
https://forum.lede-project.org/t/netgear-r7800-exploration-ipq8065-qca9984/285/572

zefie commented on 08.01.2018 17:02

On R7500v2 I've observed the following:

Images built with lede image builder for appropriate versions:

17.01.3 (prior to november): Stable
17.01.3 & 17.01.4 (as of 2018-01-08): Wifi does not work at all (neither band) (image builder only, released october rom works due to old kernel)
snapshot (2017-12-31 - 2018-01-08): Both wifi works, but 5ghz crashes after a undetermined period of time.

Edit: Log from 2018-01-08 snapshot:

[ 2875.949369] ------------[ cut here ]------------
[ 2875.949503] WARNING: CPU: 0 PID: 0 at backports-2017-11-01/net/mac80211/driver-ops.h:17 ieee80211_unreserve_tid+0x378/0x5dc [mac80211]
[ 2875.949511] wlan0:  Failed check-sdata-in-driver check, flags: 0x9
[ 2875.949842] Modules linked in: ath10k_pci ath10k_core ath pppoe nf_nat_pptp nf_conntrack_pptp mac80211 lz4 l2tp_ppp iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_geoip xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_NETMAP xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm pptp pppox ppp_async nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_amanda nf_log_ipv4 nf_defrag_ipv4
[ 2875.950085]  nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast ts_kmp nf_conntrack_amanda lz4_decompress lz4_compress libcrc32c iptable_raw iptable_mangle iptable_filter ipt_ah ipt_ECN ip6table_raw ip_tables crc7 crc_ccitt compat fuse sg ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink sr_mod cdrom ip6t_NPT ip6t_MASQUERADE nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc msdos ip_gre gre sit l2tp_netlink l2tp_core udp_tunnel ip6_udp_tunnel tunnel6 tunnel4 ip_tunnel tun vfat fat udf crc_itu_t hfsplus hfs configfs cifs dm_crypt dm_mirror dm_region_hash dm_log dm_mod br2684 atm multipath raid10 raid1 raid0 linear md_mod nls_utf8 nls_iso8859_1 nls_cp437 sha512_generic sha1_generic md5 md4 usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_of_simple ohci_platform ohci_hcd phy_qcom_dwc3 ahci ehci_platform sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug reiserfs f2fs ext4 jbd2 mbcache exfat btrfs xor xor_neon raid6_pq crc32c_generic crc32_generic
[<bf75f468>] (ieee80211_reconfig [mac80211]) from [<bf734190>] (ieee80211_restart_work+0x94/0xa8 [mac80211])
[ 2876.201235] [<bf734190>] (ieee80211_restart_work [mac80211]) from [<c0231264>] (process_one_work+0x1d4/0x310)
[ 2876.210605] [<c0231264>] (process_one_work) from [<c0231f6c>] (worker_thread+0x2ec/0x42c)
[ 2876.220505] [<c0231f6c>] (worker_thread) from [<c0235f18>] (kthread+0xd8/0xec)
[ 2876.228663] [<c0235f18>] (kthread) from [<c020ec90>] (ret_from_fork+0x14/0x24)
[ 2876.235774] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.9.73 #0
[ 2876.235884] ---[ end trace 74622c1c839abce6 ]---
[ 2876.250164] Hardware name: Generic DT based system
[ 2876.254891] [<c0215530>] (unwind_backtrace) from [<c0211f1c>] (show_stack+0x10/0x14)
[ 2876.259487] [<c0211f1c>] (show_stack) from [<c03a0b88>] (dump_stack+0x7c/0x9c)
[ 2876.267380] [<c03a0b88>] (dump_stack) from [<c021d864>] (__warn+0xbc/0xec)
[ 2876.274402] [<c021d864>] (__warn) from [<c021d8c8>] (warn_slowpath_fmt+0x34/0x44)
[ 2876.281381] [<c021d8c8>] (warn_slowpath_fmt) from [<bf75837c>] (ieee80211_unreserve_tid+0x378/0x5dc [mac80211])
[ 2876.289028] [<bf75837c>] (ieee80211_unreserve_tid [mac80211]) from [<bf759de0>] (ieee80211_tx_prepare_skb+0x1e0/0x218 [mac80211])
[ 2876.298924] [<bf759de0>] (ieee80211_tx_prepare_skb [mac80211]) from [<bf75ae98>] (__ieee80211_subif_start_xmit+0x854/0x8a8 [mac80211])
[ 2876.310735] [<bf75ae98>] (__ieee80211_subif_start_xmit [mac80211]) from [<bf75b190>] (ieee80211_subif_start_xmit+0x2a4/0x2b4 [mac80211])
[ 2876.322612] [<bf75b190>] (ieee80211_subif_start_xmit [mac80211]) from [<c050701c>] (dev_hard_start_xmit+0xac/0x120)
[ 2876.334911] [<c050701c>] (dev_hard_start_xmit) from [<c050758c>] (__dev_queue_xmit+0x43c/0x680)
[ 2876.345079] [<c050758c>] (__dev_queue_xmit) from [<c05cec18>] (br_dev_queue_push_xmit+0xf8/0x148)
[ 2876.353755] [<c05cec18>] (br_dev_queue_push_xmit) from [<c05cec98>] (br_forward_finish+0x30/0x90)
[ 2876.362779] [<c05cec98>] (br_forward_finish) from [<c05ced88>] (__br_forward+0x90/0x10c)
[ 2876.371633] [<c05ced88>] (__br_forward) from [<c05cee48>] (deliver_clone+0x44/0x50)
[ 2876.379790] [<c05cee48>] (deliver_clone) from [<c05cef64>] (maybe_deliver+0x68/0x80)
[ 2876.387170] [<c05cef64>] (maybe_deliver) from [<c05cf034>] (br_flood+0xb8/0x148)
[ 2876.395153] [<c05cf034>] (br_flood) from [<c05d0594>] (br_handle_frame_finish+0x498/0x4e4)
[ 2876.402528] [<c05d0594>] (br_handle_frame_finish) from [<c05d085c>] (br_handle_frame+0x27c/0x300)
[ 2876.410616] [<c05d085c>] (br_handle_frame) from [<c0502e08>] (__netif_receive_skb_core+0x42c/0x8fc)
[ 2876.419549] [<c0502e08>] (__netif_receive_skb_core) from [<c0505394>] (process_backlog+0x7c/0x11c)
[ 2876.428396] [<c0505394>] (process_backlog) from [<c0505a88>] (net_rx_action+0xe8/0x2a8)
[ 2876.437434] [<c0505a88>] (net_rx_action) from [<c0220fe4>] (__do_softirq+0xd0/0x204)
[ 2876.445329] [<c0220fe4>] (__do_softirq) from [<c022139c>] (irq_exit+0x94/0x104)
[ 2876.453324] [<c022139c>] (irq_exit) from [<c0255f38>] (__handle_domain_irq+0x90/0xb4)
[ 2876.460346] [<c0255f38>] (__handle_domain_irq) from [<c02093d0>] (gic_handle_irq+0x50/0x94)
[ 2876.468335] [<c02093d0>] (gic_handle_irq) from [<c021288c>] (__irq_svc+0x6c/0x90)
[ 2876.476472] Exception stack(0xc0787f60 to 0xc0787fa8)
[ 2876.484139] 7f60: 00000001 00000000 00000000 c021a420 00000000 c0786000 c0788fe4 00000001
[ 2876.489180] 7f80: c0783a30 00000000 c0787fb8 00000001 00000000 c0787fb0 c020f510 c020f514
[ 2876.497313] 7fa0: 60000013 ffffffff
[ 2876.505480] [<c021288c>] (__irq_svc) from [<c020f514>] (arch_cpu_idle+0x2c/0x38)
[ 2876.508781] [<c020f514>] (arch_cpu_idle) from [<c024f2e4>] (cpu_startup_entry+0xe8/0x198)
[ 2876.516422] [<c024f2e4>] (cpu_startup_entry) from [<c074ac28>] (start_kernel+0x370/0x3f4)
[ 2876.524594] ---[ end trace 74622c1c839abce7 ]---
[ 2876.532723] ------------[ cut here ]------------

If there is any more useful information, it has been flooded out of the logs by this repeating over and over (although CPU can be 0 or 1)

bouwew commented on 14.01.2018 14:55

Here is the (not yet included) pull request to LEDE: https://github.com/lede-project/source/pull/1559

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing