OpenWrt/LEDE Project

  • Status Closed
  • Percent Complete
    100%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity High
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Votes 3
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Daniel Wandrei - 01.02.2017
Last edited by Mathias Kresin - 13.05.2017

FS#447 - Unstable 2.4 GHz WiFi on WDR3600 when using WDS

I’m using two TP-LINK WDR3600 (one per floor). After around 2 days the 2.4 GHz WiFi of the WDR on the ground floor starts to get unstable. Mobile devices don’t get any data throughput, connection drops, clients are not able to connect to WiFi... till there is no client left connected and no new connections can be established. The Problem occurred just after I added a CPE210 which is WDS bridged (WDR3600 WDS AP, CPE210 WDS client) this particular WDR3600.

Both Devices are running latest Lede trunk, the Problem exist in every trunk Version since I added the CPE210 (about 4-5 Weeks). As soon as the WDR3600 gets unstable, the CPE210 starts spamming log entry’s like:

Wed Feb  1 01:49:21 2017 kern.info kernel: [120752.942945] wlan0: authenticate with a0:f3:xx:xx:xx:xx
Wed Feb  1 01:49:21 2017 kern.info kernel: [120752.965502] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 1/3)
Wed Feb  1 01:49:21 2017 kern.info kernel: [120753.074740] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 2/3)
Wed Feb  1 01:49:21 2017 kern.info kernel: [120753.137152] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 3/3)
Wed Feb  1 01:49:21 2017 kern.info kernel: [120753.236001] wlan0: authentication with a0:f3:xx:xx:xx:xx timed out

After a reboot of the WDR3600, everything is fine again for 1-3 Days, all clients reconnect and bridge is established.

Closed by  Mathias Kresin
13.05.2017 10:36
Reason for closing:  Fixed
Additional comments about closing:  

According to reporter: Seems to be solved. Either by newer Lede Version or by adding a Interface just for WDS.

Daniel Wandrei commented on 01.02.2017 13:24

And here it goes again... Went away around 13:55, max 10 mins. later, no 2.4 GHz Client connected or able to connect. Last reboot was only 6 hours ago.

Syslog: http://pastebin.com/HXLZ3sg8

Kernel Log: http://pastebin.com/v9d05yXx

A0:F3:... = 2.4 GHz of WDR3600
84:16:... = CPE210 WDS CLient

Daniel Wandrei commented on 02.02.2017 09:59

Changed WDS bridge to the other WDR3600, same behavior. So, its not the WDR3600 itself.

This is the Traffic-Data of the CPE210, there should be a constant Stream of Traffic because there are IP-Cams connected to it. WDS Bridge broke yesterday evening, came back to live at night (I didn't do anything) and broke again this morning.

http://imgur.com/a/0kf2q

Project Manager
Felix Fietkau commented on 02.02.2017 15:20

What's the last version that you tried?

Daniel Wandrei commented on 03.02.2017 07:53

Im running r3204-2711b94 on all 3 devices.

Daniel Wandrei commented on 06.02.2017 10:04

r3281-a4d12ae now, problem still exists.

Project Manager
Felix Fietkau commented on 12.02.2017 13:44

Please try the latest version, just fixed some more bugs in the airtime fairness code.

Daniel Wandrei commented on 13.02.2017 08:22

Going with r3426-4c09f99 now, lets see if this helps.

Daniel Wandrei commented on 14.02.2017 16:50

28 Hours now, still going strong without hiccups...

Daniel Wandrei commented on 15.02.2017 08:00

Sorry to say: Its getting worse again after around 42-43 Hours:

Ping to CPE210 WDS Client is getting unstable, till its completely gone, all 2.4 GHz Stations disconnected from the WDR3600, none is able to reconnect. The Peaks on the left were right before last sysupgrade to r3426-4c09f99.

Client still spamming "Authentication timed out", AP saying "Neighbor lost.". After WDR3600 reboot, everything is fine again.

Edit: Will try setting airtime_flag to 0, to see if this helps.

Daniel Wandrei commented on 15.02.2017 13:56

Not fixed! (see above)

Daniel Wandrei commented on 16.02.2017 13:29

This Problem seems to be unrelated to the airtime fairness, even with airtime_flag to 0 on all devices, it got worse a few minutes ago, just after about ~14 Hours.

Dave Täht commented on 20.02.2017 17:58

We did not test WDS very much at all... there is a new statistics file that you might want to look at tho (aqm for the station)... but it doesn't look like ATF or fq_codel from here.

I find it interesting that your disconnects were during business hours....

Can you get an aircap?

Daniel Wandrei commented on 21.02.2017 07:49

I can not nail this problem on any sepcific time. If you look at the ping graph of the whole week i attached.

Here is aqm stat file from WDS client:

access name value
R fq_flows_cnt 4096
R fq_backlog 0
R fq_overlimit 0
R fq_overmemory 0
R fq_collisions 0
R fq_memory_usage 0
RW fq_memory_limit 4194304
RW fq_limit 8192
RW fq_quantum 300

I will pull the stats from the station as soon as it gets unstable again.

Aircap could be a bit complicated, the station produces over 20 GB traffic a day.

   ping.JPG (168.5 KiB)
Daniel Wandrei commented on 24.02.2017 06:19

So, here are the aqm stats from the Station after it went bad:

access name value
R fq_flows_cnt 4096
R fq_backlog 0
R fq_overlimit 932
R fq_overmemory 7076
R fq_collisions 89535
R fq_memory_usage 0
RW fq_memory_limit 4194304
RW fq_limit 8192
RW fq_quantum 300
Baptiste Jonglez commented on 25.02.2017 23:03

Daniel, can you share your /etc/config/network and /etc/config/wireless config from both the WDS AP and station?

Also, do you see any kernel messages (in dmesg) when this occurs, other than the messages about br-lan?

Daniel Wandrei commented on 26.02.2017 10:58

Sure,

Network AP: http://pastebin.com/2ft3N3iP Wireless AP: http://pastebin.com/CaLv4U1V

Network Station: http://pastebin.com/FUjKsMPp Wireless Station: http://pastebin.com/T08ZGTyA

Sorry, no messages besides br-lan in kernel log.

Baptiste Jonglez commented on 08.03.2017 22:31

I am trying to reproduce the issue, and I have another question about the setup: you have a WDS AP, on which you connect another LEDE device (WDS STA). But you also have regular (non-WDS) STA connected to the same WDS AP?

Do you see the same issue if only the WDS STA is connected to the AP? (i.e. forbid any other STA to associate)

Also, I noticed that you use STP, did you try to disable it? Does your network topology have a physical loop when you enable the WDS bridge? Maybe STP sometimes decides to cut the WDS bridge and then it breaks connectivity for all STA?

Daniel Wandrei commented on 09.03.2017 08:10

But you also have regular (non-WDS) STA connected to the same WDS AP?

Yes

Do you see the same issue if only the WDS STA is connected to the AP? (i.e. forbid any other STA to associate)

Never tried that. Anyway, it's working fine with only normal stations connected. Will try to only the WDS Station connected to that AP.

Also, I noticed that you use STP, did you try to disable it?

There is no physical loop but for my knowledge (Cisco certificated) STP enabled on WDS devices is kinda best practice. Also I've never seen unwanted behaviors on non redundant network topologys caused by enabled STP. I'll try to disable STP on the AP's and my core Switch.

Baptiste Jonglez commented on 09.03.2017 08:33

Ok, thanks.

I tried the same setup with two WR841N, and even without STP, the AP was crashing every few hours. I managed to get a crashlog (see attached file) but it's completely unreadable.

Did you also see crashes (i.e. reboots) of the AP?

   crashlog (1.8 KiB)
Baptiste Jonglez commented on 09.03.2017 09:17

Here is a better crashlog:

<4>[ 2581.160504] Trap instruction in kernel code[#1]:
<4>[ 2581.165297] CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.52 #0
<4>[ 2581.171239] task: 8042ef58 ti: 80428000 task.ti: 80428000
<4>[ 2581.176814] $ 0   : 00000000 804a0000 00000000 fffffffe
<4>[ 2581.182253] $ 4   : 00afc0cc 009f844e 0000007f a0e7c980
<4>[ 2581.187691] $ 8   : 009f849c 00000052 0000004a 01000000
<4>[ 2581.193129] $12   : 80000000 8000004e 00000000 00000002
<4>[ 2581.198559] $16   : 81b27000 80df7000 81a15000 81942f00
<4>[ 2581.203989] $20   : 00000000 00000000 8042a4e0 80d18400
<4>[ 2581.209428] $24   : 00000000 8007d2d4                  
<4>[ 2581.214866] $28   : 80428000 81809a58 8042a4e0 80272104
<4>[ 2581.220305] Hi    : 00000019
<4>[ 2581.223280] Lo    : 00000000
<4>[ 2581.226272] epc   : 80272128 __dev_queue_xmit+0x2d8/0x4c4
<4>[ 2581.231852] ra    : 80272104 __dev_queue_xmit+0x2b4/0x4c4
<4>[ 2581.237428] Status: 1100f403	KERNEL EXL IE 
<4>[ 2581.241785] Cause : 00800034 (ExcCode 0d)
<4>[ 2581.245924] PrId  : 0001974c (MIPS 74Kc)
<4>[ 2581.249972] Modules linked in: ath9k ath9k_common pppoe ppp_async ath9k_hw ath pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
<4>[ 2581.309535] Process swapper (pid: 0, threadinfo=80428000, task=8042ef58, tls=00000000)
<4>[ 2581.317710] Stack : 80ecf420 8187992c 81809b2c 81879900 81b27068 00000001 80430000 fffffff4
<4>[ 2581.317710] 	  80ece09c 80ed7000 0000004a 80ed7000 80d18400 80ed7000 00000000 8035b5a4
<4>[ 2581.317710] 	  00000000 80df7000 00000000 80905e00 80f81000 80430000 80df7000 80271cd0
<4>[ 2581.317710] 	  80428000 00440002 00000020 0000000e 00000003 80df7000 8042a4e0 00000000
<4>[ 2581.317710] 	  80d18400 80271718 80430000 803fc6cc 8190af00 80ece090 80430000 80df7000
<4>[ 2581.317710] 	  ...
<4>[ 2581.354706] Call Trace:
<4>[ 2581.357239] [<80272128>] __dev_queue_xmit+0x2d8/0x4c4
<4>[ 2581.362495] [<8035b5a4>] vlan_dev_hard_start_xmit+0x98/0x128
<4>[ 2581.368341] [<80271cd0>] dev_hard_start_xmit+0x2a8/0x354
<4>[ 2581.373834] [<80272210>] __dev_queue_xmit+0x3c0/0x4c4
<4>[ 2581.379068] [<8034a004>] br_dev_queue_push_xmit+0x16c/0x1a0
<4>[ 2581.384828] [<8034a074>] br_forward_finish+0x3c/0xb0
<4>[ 2581.389963] [<8034a304>] __br_forward+0xa8/0x114
<4>[ 2581.394741] [<8034bc5c>] br_handle_frame_finish+0x4f4/0x550
<4>[ 2581.400500] [<8034c038>] br_handle_frame+0x380/0x418
<4>[ 2581.405662] [<8026e540>] __netif_receive_skb_core+0x42c/0x898
<4>[ 2581.411768] [<80da0a04>] ieee80211_attach_ack_skb+0x103c/0x19e8 [mac80211]
<4>[ 2581.418913] 
<4>[ 2581.420454] 
<4>[ 2581.420454] Code: 02002025  10000014  00008825 <8e020074> 00431024  ae020074  1000000f  00008825  8e020000 
<4>[ 2581.430831] ---[ end trace 8251bd1741f30e24 ]---
<0>[ 2581.437731] Kernel panic - not syncing: Fatal exception in interrupt
Daniel Wandrei commented on 10.03.2017 07:29

Did you also see crashes (i.e. reboots) of the AP?

Not for me, no crashes, no reboots. The 2.4 GHz radio just gets unstable till all clients disconnect and are not able to reconnect. 5 GHz radio does work well meanwhile.

Edit: Installed latest trunk, disabled STP on all devices and testing now.

Baptiste Jonglez commented on 10.03.2017 08:42

Ok, the issue might be somewhat different then, I opened a new bug report FS#615

Note that I could reliably trigger crashes even without regular STA and without STP.

Daniel Wandrei commented on 15.03.2017 10:03

Problem also persists for me without using STP.

Daniel Wandrei commented on 05.05.2017 14:38

Seems to be solved. Either by newer Lede Version or by adding a Interface just for WDS.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing