Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#447 - Unstable 2.4 GHz WiFi on WDR3600 when using WDS #5747

Closed
openwrt-bot opened this issue Feb 1, 2017 · 24 comments
Closed

FS#447 - Unstable 2.4 GHz WiFi on WDR3600 when using WDS #5747

openwrt-bot opened this issue Feb 1, 2017 · 24 comments
Labels

Comments

@openwrt-bot
Copy link

cyablo:

I'm using two TP-LINK WDR3600 (one per floor). After around 2 days the 2.4 GHz WiFi of the WDR on the ground floor starts to get unstable. Mobile devices don't get any data throughput, connection drops, clients are not able to connect to WiFi... till there is no client left connected and no new connections can be established. The Problem occurred just after I added a CPE210 which is WDS bridged (WDR3600 WDS AP, CPE210 WDS client) this particular WDR3600.

Both Devices are running latest Lede trunk, the Problem exist in every trunk Version since I added the CPE210 (about 4-5 Weeks). As soon as the WDR3600 gets unstable, the CPE210 starts spamming log entry's like:

Wed Feb 1 01:49:21 2017 kern.info kernel: [120752.942945] wlan0: authenticate with a0:f3:xx:xx:xx:xx
Wed Feb 1 01:49:21 2017 kern.info kernel: [120752.965502] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 1/3)
Wed Feb 1 01:49:21 2017 kern.info kernel: [120753.074740] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 2/3)
Wed Feb 1 01:49:21 2017 kern.info kernel: [120753.137152] wlan0: send auth to a0:f3:xx:xx:xx:xx (try 3/3)
Wed Feb 1 01:49:21 2017 kern.info kernel: [120753.236001] wlan0: authentication with a0:f3:xx:xx:xx:xx timed out

After a reboot of the WDR3600, everything is fine again for 1-3 Days, all clients reconnect and bridge is established.

@openwrt-bot
Copy link
Author

cyablo:

And here it goes again... Went away around 13:55, max 10 mins. later, no 2.4 GHz Client connected or able to connect. Last reboot was only 6 hours ago.

Syslog: http://pastebin.com/HXLZ3sg8

Kernel Log: http://pastebin.com/v9d05yXx

A0:F3:... = 2.4 GHz of WDR3600
84:16:... = CPE210 WDS CLient

@openwrt-bot
Copy link
Author

cyablo:

Changed WDS bridge to the other WDR3600, same behavior. So, its not the WDR3600 itself.

This is the Traffic-Data of the CPE210, there should be a constant Stream of Traffic because there are IP-Cams connected to it. WDS Bridge broke yesterday evening, came back to live at night (I didn't do anything) and broke again this morning.

http://imgur.com/a/0kf2q

@openwrt-bot
Copy link
Author

nbd:

What's the last version that you tried?

@openwrt-bot
Copy link
Author

cyablo:

Im running r3204-2711b94 on all 3 devices.

@openwrt-bot
Copy link
Author

cyablo:

r3281-a4d12ae now, problem still exists.

@openwrt-bot
Copy link
Author

nbd:

Please try the latest version, just fixed some more bugs in the airtime fairness code.

@openwrt-bot
Copy link
Author

cyablo:

Going with r3426-4c09f99 now, lets see if this helps.

@openwrt-bot
Copy link
Author

cyablo:

28 Hours now, still going strong without hiccups...

@openwrt-bot
Copy link
Author

cyablo:

Sorry to say: Its getting worse again after around 42-43 Hours:

Ping to CPE210 WDS Client is getting unstable, till its completely gone, all 2.4 GHz Stations disconnected from the WDR3600, none is able to reconnect. The Peaks on the left were right before last sysupgrade to r3426-4c09f99.

Client still spamming "Authentication timed out", AP saying "Neighbor lost.". After WDR3600 reboot, everything is fine again.

Edit: Will try setting airtime_flag to 0, to see if this helps.

@openwrt-bot
Copy link
Author

cyablo:

Not fixed! (see above)

@openwrt-bot
Copy link
Author

cyablo:

This Problem seems to be unrelated to the airtime fairness, even with airtime_flag to 0 on all devices, it got worse a few minutes ago, just after about ~14 Hours.

@openwrt-bot
Copy link
Author

dtaht:

We did not test WDS very much at all... there is a new statistics file that you might want to look at tho (aqm for the station)... but it doesn't look like ATF or fq_codel from here.

I find it interesting that your disconnects were during business hours....

Can you get an aircap?

@openwrt-bot
Copy link
Author

cyablo:

I can not nail this problem on any sepcific time. If you look at the ping graph of the whole week i attached.

Here is aqm stat file from WDS client:

access name value
R fq_flows_cnt 4096
R fq_backlog 0
R fq_overlimit 0
R fq_overmemory 0
R fq_collisions 0
R fq_memory_usage 0
RW fq_memory_limit 4194304
RW fq_limit 8192
RW fq_quantum 300

I will pull the stats from the station as soon as it gets unstable again.

Aircap could be a bit complicated, the station produces over 20 GB traffic a day.

@openwrt-bot
Copy link
Author

cyablo:

So, here are the aqm stats from the Station after it went bad:

access name value
R fq_flows_cnt 4096
R fq_backlog 0
R fq_overlimit 932
R fq_overmemory 7076
R fq_collisions 89535
R fq_memory_usage 0
RW fq_memory_limit 4194304
RW fq_limit 8192
RW fq_quantum 300

@openwrt-bot
Copy link
Author

bjonglez:

Daniel, can you share your /etc/config/network and /etc/config/wireless config from both the WDS AP and station?

Also, do you see any kernel messages (in dmesg) when this occurs, other than the messages about br-lan?

@openwrt-bot
Copy link
Author

cyablo:

Sure,

Network AP: http://pastebin.com/2ft3N3iP
Wireless AP: http://pastebin.com/CaLv4U1V

Network Station: http://pastebin.com/FUjKsMPp
Wireless Station: http://pastebin.com/T08ZGTyA

Sorry, no messages besides br-lan in kernel log.

@openwrt-bot
Copy link
Author

bjonglez:

I am trying to reproduce the issue, and I have another question about the setup: you have a WDS AP, on which you connect another LEDE device (WDS STA). But you also have regular (non-WDS) STA connected to the same WDS AP?

Do you see the same issue if only the WDS STA is connected to the AP? (i.e. forbid any other STA to associate)

Also, I noticed that you use STP, did you try to disable it? Does your network topology have a physical loop when you enable the WDS bridge? Maybe STP sometimes decides to cut the WDS bridge and then it breaks connectivity for all STA?

@openwrt-bot
Copy link
Author

cyablo:

//But you also have regular (non-WDS) STA connected to the same WDS AP?//

Yes

//Do you see the same issue if only the WDS STA is connected to the AP? (i.e. forbid any other STA to associate)//

Never tried that. Anyway, it's working fine with only normal stations connected. Will try to only the WDS Station connected to that AP.

//Also, I noticed that you use STP, did you try to disable it?//

There is no physical loop but for my knowledge (Cisco certificated) STP enabled on WDS devices is kinda best practice. Also I've never seen unwanted behaviors on non redundant network topologys caused by enabled STP. I'll try to disable STP on the AP's and my core Switch.

@openwrt-bot
Copy link
Author

bjonglez:

Ok, thanks.

I tried the same setup with two WR841N, and even without STP, the AP was crashing every few hours. I managed to get a crashlog (see attached file) but it's completely unreadable.

Did you also see crashes (i.e. reboots) of the AP?

@openwrt-bot
Copy link
Author

bjonglez:

Here is a better crashlog:

<4>[ 2581.160504] Trap instruction in kernel code[#1]: <4>[ 2581.165297] CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.52 #0 <4>[ 2581.171239] task: 8042ef58 ti: 80428000 task.ti: 80428000 <4>[ 2581.176814] $ 0 : 00000000 804a0000 00000000 fffffffe <4>[ 2581.182253] $ 4 : 00afc0cc 009f844e 0000007f a0e7c980 <4>[ 2581.187691] $ 8 : 009f849c 00000052 0000004a 01000000 <4>[ 2581.193129] $12 : 80000000 8000004e 00000000 00000002 <4>[ 2581.198559] $16 : 81b27000 80df7000 81a15000 81942f00 <4>[ 2581.203989] $20 : 00000000 00000000 8042a4e0 80d18400 <4>[ 2581.209428] $24 : 00000000 8007d2d4 <4>[ 2581.214866] $28 : 80428000 81809a58 8042a4e0 80272104 <4>[ 2581.220305] Hi : 00000019 <4>[ 2581.223280] Lo : 00000000 <4>[ 2581.226272] epc : 80272128 __dev_queue_xmit+0x2d8/0x4c4 <4>[ 2581.231852] ra : 80272104 __dev_queue_xmit+0x2b4/0x4c4 <4>[ 2581.237428] Status: 1100f403 KERNEL EXL IE <4>[ 2581.241785] Cause : 00800034 (ExcCode 0d) <4>[ 2581.245924] PrId : 0001974c (MIPS 74Kc) <4>[ 2581.249972] Modules linked in: ath9k ath9k_common pppoe ppp_async ath9k_hw ath pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug <4>[ 2581.309535] Process swapper (pid: 0, threadinfo=80428000, task=8042ef58, tls=00000000) <4>[ 2581.317710] Stack : 80ecf420 8187992c 81809b2c 81879900 81b27068 00000001 80430000 fffffff4 <4>[ 2581.317710] 80ece09c 80ed7000 0000004a 80ed7000 80d18400 80ed7000 00000000 8035b5a4 <4>[ 2581.317710] 00000000 80df7000 00000000 80905e00 80f81000 80430000 80df7000 80271cd0 <4>[ 2581.317710] 80428000 00440002 00000020 0000000e 00000003 80df7000 8042a4e0 00000000 <4>[ 2581.317710] 80d18400 80271718 80430000 803fc6cc 8190af00 80ece090 80430000 80df7000 <4>[ 2581.317710] ... <4>[ 2581.354706] Call Trace: <4>[ 2581.357239] [<80272128>] __dev_queue_xmit+0x2d8/0x4c4 <4>[ 2581.362495] [<8035b5a4>] vlan_dev_hard_start_xmit+0x98/0x128 <4>[ 2581.368341] [<80271cd0>] dev_hard_start_xmit+0x2a8/0x354 <4>[ 2581.373834] [<80272210>] __dev_queue_xmit+0x3c0/0x4c4 <4>[ 2581.379068] [<8034a004>] br_dev_queue_push_xmit+0x16c/0x1a0 <4>[ 2581.384828] [<8034a074>] br_forward_finish+0x3c/0xb0 <4>[ 2581.389963] [<8034a304>] __br_forward+0xa8/0x114 <4>[ 2581.394741] [<8034bc5c>] br_handle_frame_finish+0x4f4/0x550 <4>[ 2581.400500] [<8034c038>] br_handle_frame+0x380/0x418 <4>[ 2581.405662] [<8026e540>] __netif_receive_skb_core+0x42c/0x898 <4>[ 2581.411768] [<80da0a04>] ieee80211_attach_ack_skb+0x103c/0x19e8 [mac80211] <4>[ 2581.418913] <4>[ 2581.420454] <4>[ 2581.420454] Code: 02002025 10000014 00008825 <8e020074> 00431024 ae020074 1000000f 00008825 8e020000 <4>[ 2581.430831] ---[ end trace 8251bd1741f30e24 ]--- <0>[ 2581.437731] Kernel panic - not syncing: Fatal exception in interrupt

@openwrt-bot
Copy link
Author

cyablo:

//Did you also see crashes (i.e. reboots) of the AP?//

Not for me, no crashes, no reboots. The 2.4 GHz radio just gets unstable till all clients disconnect and are not able to reconnect. 5 GHz radio does work well meanwhile.

Edit: Installed latest trunk, disabled STP on all devices and testing now.

@openwrt-bot
Copy link
Author

bjonglez:

Ok, the issue might be somewhat different then, I opened a new bug report FS#615

Note that I could reliably trigger crashes even without regular STA and without STP.

@openwrt-bot
Copy link
Author

cyablo:

Problem also persists for me without using STP.

@openwrt-bot
Copy link
Author

cyablo:

Seems to be solved. Either by newer Lede Version or by adding a Interface just for WDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant