Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#44 - Ath9k - TX performance regression on greater coverage settings #6030

Closed
openwrt-bot opened this issue Jul 5, 2016 · 12 comments
Closed
Labels

Comments

@openwrt-bot
Copy link

xback:

Synopsis:

When increasing coverage distance on the transmitting device, frame aggregation is heavily reduced.
Even when conditions are perfect.

This is regression compared to the BarrierBreaker release.

Hardware used:
2x cns3xxx GW-2388, each containing a single SR71-15

Setup:

  • 2 devices are 1 meter apart.
  • RSSI = -40
  • Completely free channel
  • Using wpa_supplicant, AES2, HT20 rates, IBSS

Repro rate:
100%

Repro steps:

  • Set coverage distance on device 1 to >10000m
  • Set Device 1 as iperf client
  • Set Device 2 as iperf server

Obervations:

  • iwinfo and iw wlanx station dump reveal 130Mbit/s raw link rate and 45Mbps net throughput
  • The actual throughput is stuck @ 7Mbit/s
  • Checking debugfs-xmit reveals that frame aggregation is at a rate of 2 frames/s
  • When setting the coverage range of device 1 to 50m, the aggregation rate is easily >5000 frames/s
  • Switching to HT40 yields the same issues.

Other observations:

  • Setting coverage to 12000m and starting the iperf yields 7Mbit/s
  • setting coverage to 50m and starting iperf yields >55Mbit/s
  • Setting coverage to 50m, starting iperf and increase coverage to >10000m while running reduces the rate to 25Mbit.
    When stopping and restarting iperf it drops again to 7Mbit/s

In barrierbreaker release, increasing coverage distance to 10000m dropped to the throughput to ~50Mbit iso 7

@openwrt-bot
Copy link
Author

xback:

Gathered more data:

  • Set distance to 50m
  • Start iperf
  • Get data from debug/xmit
  • wait exactly 5000ms and grab again
  • subtract values

--> same procedure for 14400m

It seems less frames get queued in the TX queue for some reason ..

See attached PDF

@openwrt-bot
Copy link
Author

Bluse:

Hi,

While reading through your report, the following question comes to my mind.
If your devices are 1m apart from each other on your desk and you increase the coverage distance, hence the underlying ack time out, I would expect a decrease in throughput as you run a closed-by system with long-range settings .. no ?

Greetings Bluse

ps: Please provide the Lede version/commit your test setup is running on in order to reproduce on my side

@openwrt-bot
Copy link
Author

xback:

Hi Bluse,

Thanks for your response.

You can take the latest available trunk when reading this.

No,
The driver is allowed to wait "up-to" before the ACK should arrive.
When it arrives sooner, the proces should continue.

In my setup using BarrierBreaker release, the expected (and measured) performance penalty is about 10%

  • Distance set to 50m, HT20, LGI: ~55Mbit
  • Distance set to 10000m, HT20, LGI: ~50Mbit

On the latest trunk, the penalty is 83%

Also, before anyone suggests:

  • Using Dynack is not possible in our usecase.
    --> Our mesh hw consists of 4x 90° wlan sector antenna's in order to form a 360° highpower field.
    --> When using dynack, the weighted avg is kept low due to "self inductance" of the 3 other wlan sides, as dynack does not seem to keep avg on a per-station base

I've spend nearly a week trying to fix it, but until now only can report some details seen:

  • On 50m, the avg A-MPDU frames per xmit is 8
  • On >10000m, the avg A-MPDU frames per xmit is 0

The attached log in previous comment shows that actually less frames get scheduled, so it makes sense less frames get aggregated.

3 thinks I can think about causing it:

  • The code always waits the max ACK time between scheduling frames, even when an ACK is received sooner
  • The code is waiting somewhere, based on the calculated tx_time
  • The minstrel algo provides a lower TX rate to the driver, based on calculated tx_time (while reporting a higher rate to userland)

Thanks for you help.
If you need more info, please let me know.

Koen

@openwrt-bot
Copy link
Author

xback:

Hi Bluse,

Were you able to reproduce this?

Thanks again,

Koen

@openwrt-bot
Copy link
Author

Bluse:

Hi Koen,

Your observation is by design of the standard and not a malfunction, check IEEE 802.11-2007 17.3.8.6 [[http://www.ie.itcr.ac.cr/acotoc/Ingenieria/Lab%20TEM%20II/Antenas/Especificacion%20802%2011-2007.pdf|IEEE 802.11-2007]]:
"Where dot11RegulatoryClassesRequired is true, the value of the slot time
shall be increased by the value of 3 μs × coverage class."

So the coverage class set by UCI distance will increase the ack timeout and the slottime. A increased slottime will decrease throughput as the channel access timings are relaxed.
Since this commit [[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f77f8234409978fefa0422b12a497451173e39b3|add coverag class to slot time]], the slot time is a function of the coverage class. So I guess your other observation with an older barrier braker release was using a mac80211 subsystem without this patch, hence only ack time out was effected by coverage class settings.
You could revert the patch from your current release and test the performance in your setup. But keep in mind, the idea of increasing the slot time with increased coverage class, hence increased propagation delay, is based on the fact, that traffic typically flows bidirectional and therefore an increased slot time provides highes channel access probability to the other direction.

Greatings from Berlin
Stoffel & Bluse

@openwrt-bot
Copy link
Author

xback:

Hi Bluse,
Hi Stoffel,

Thanks again for your time looking into this.

When I encountered this "bug", this was the first part I checked :)
This patch has been applied in //both// builds (barrierbreaker & trunk)

So it's not the reason for the regression.

I'll upload the mac80211 (incl ath9k) from both my builds tomorrow and provide a link.

I've checked and compared the full ath9k driver to search for the rootcause,
and added some prints in xmit.c (ath_tx_start()) where frames get scheduled in the aggregation queue.

When this bug occurs (coverage increases) mac80211 is re-calling this functions at a lower pace.
Every frame entering gets scheduled in the aggregation queue, but it only contains 1 .. 2 frames max before the session time ends and the frames in this queue get transmitted.

Thanks again,

Koen

@openwrt-bot
Copy link
Author

Bluse:

Hi Koen,

I just set up a two router experiment on my desk.
Lastest LEDE HEAD, r1227 on both.
2,4 GHz HT20, Laptop_A ->ETHERNET-> AP -> STA ->ETHERNET-> Laptop_B
Router-A: PCengies APU 2c4 & Microtik 2x2 ath9k
Router-B: TPLink tl-wr1042 v3 3x3 ath9k
on Laptop A: iperf -s -u
on Laptop B: iperf -c IP(Laptop_A) -u -l 1400 -b 100M -t 30

on AP: disbale Minstrel_HT ratecontrol and set fixed rate to MSC 15
echo 72 > /sys/kernel/debug/ieee80211/phy0/rc/fixed_rate_idx

scenario 1: distance set to 50
udp throughput is ~42MBit/s
aggregates ~ 4000 (from xmit)
max frames per aggregate: 32 (from /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/MAC_ADRESS/rc_stats)

scenario 1: distance set to 14000
udp throughput is ~20MBit/s
aggregates ~ 2000 (from xmit)
max frames per aggregate: 31 (from /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/MAC_Adress/rc_stats)

So throughput got halved when I go to 14000m and aggragation shows the same amount of packets per aggregate and obviously half of overall aggregates as throughput is halved. I do not have a feeling of how much decrese in throughput to expect when coverage class goes up and slottime is increased, but it looks like a results I would expect somehow.

Could you re-run your experiment with the slottime adjustment patch reverted ?
(I am short on time but could test this maybe in the next days)

Greetings from Berlin
Bluse

@openwrt-bot
Copy link
Author

xback:

Hi Bluse,

//2,4 GHz HT20//

I'm testing on 5GHz IBSS mode

//on Laptop A: iperf -s -u
on Laptop B: iperf -c IP(Laptop_A) -u -l 1400 -b 100M -t 30//

Could you try:
Laptop A: iperf -s
Laptop B: iperf -c IP -i 1 -t 30

//scenario 1: distance set to 14000
udp throughput is ~20MBit/s
aggregates ~ 2000 (from xmit)
max frames per aggregate: 31 (from /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/MAC_Adress/rc_stats) //

Did you restart the iperf client on B after increasing the Cov Class?

In my case:

  • When I start on 50m i get ~55Mbit/s
  • When changing to 14000m on Device B while iperf is running, it drops to ~30Mbit/s (normal)
  • When i stop iperf client on B, and restart it again, I'm stuck at 6Mbit/s

When checking this part:

//aggregates ~ 2000 (from xmit)
max frames per aggregate: 31 (from /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/MAC_Adress/rc_stats) //

In step1: Avg Aggregated frames: ~6.5
In step2: Avg Aggregated frames: ~2.5
In step3: Avg Aggregated frames: 1.0

step1 bw:

[ 4] 17.00-18.01 sec 8.57 MBytes 71.4 Mbits/sec 0 400 KBytes
[ 4] 18.01-19.01 sec 8.39 MBytes 70.5 Mbits/sec 0 400 KBytes
[ 4] 19.01-20.00 sec 7.93 MBytes 66.8 Mbits/sec 0 400 KBytes
[ 4] 20.00-21.00 sec 8.50 MBytes 71.1 Mbits/sec 0 400 KBytes

step2 bw:

[ 4] 13.00-14.00 sec 8.49 MBytes 71.3 Mbits/sec 0 387 KBytes
[ 4] 14.00-15.00 sec 8.59 MBytes 71.9 Mbits/sec 0 387 KBytes
[ 4] 15.00-16.00 sec 8.42 MBytes 70.7 Mbits/sec 0 387 KBytes
[ 4] 16.00-17.00 sec 3.47 MBytes 29.2 Mbits/sec 0 387 KBytes (iw phy0 set distance 14000)
[ 4] 17.00-18.00 sec 3.31 MBytes 27.7 Mbits/sec 0 387 KBytes
[ 4] 18.00-19.00 sec 3.05 MBytes 25.6 Mbits/sec 0 387 KBytes

Step3 bw:

[ 4] 215.00-216.00 sec 905 KBytes 7.41 Mbits/sec 0 31.1 KBytes
[ 4] 216.00-217.00 sec 970 KBytes 7.95 Mbits/sec 0 31.1 KBytes
[ 4] 217.00-218.00 sec 781 KBytes 6.39 Mbits/sec 0 31.1 KBytes
[ 4] 218.00-219.00 sec 823 KBytes 6.74 Mbits/sec 0 31.1 KBytes

IBSS wpa_supplicant config settings for both ends:

ctrl_interface=DIR=/tmp/run/wpa_supplicant GROUP=root
update_config=1
ap_scan=2
country=BE
beacon_int=100
network={
ssid="TESTCHAN136"
frequency=5680
bssid=F8:79:77:57:AA:7D
mode=1
disable_ht=0
disable_ht40=0
disable_sgi=1
proto=WPA2
key_mgmt=WPA-PSK
pairwise=CCMP
group=CCMP
psk="debugsessionpass"
}

Please let me know if you need more info,
Thanks again,

Koen

@openwrt-bot
Copy link
Author

Bluse:

Hi Koen,

In order to troubleshoot your performance decrease I suggest to disable control loops on differnt network layers which try to adapt the throughput in a certain manner.

-use UDP and not TCP for your testsetup, to saturate the IP packet transfer
-disable rate control in the MAC layer (by setting a fixed_rate_idx)

What does 50m and 14000m provide on throughput / packets per aggregate ?
How does it run with the patch in question reverted ?

Greetings from Berlin
Bluse

@openwrt-bot
Copy link
Author

xback:

Bug in kernels starting from 3.19.0.
This commit is the reasons.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=v3.19-rc1&id=605ad7f184b60cfaacbc038aa6c55ee68dee3c89

I've made a patch which fixes the issue waiting for approval from the initial patch creator.

@openwrt-bot
Copy link
Author

xback:

The patch will not be accepted as it basically reverts this commit.

The main issue is that TCP does not ramp-up as mac80211 queueing introduces delay on TCP ack which is amplified if coverage distance is increased.

Therefore TCP is stuck at this low rate.

I exchanged some mails with Eric Dumazet, and he points out that mac80211/ath9k should complete TX faster in order to ramp-up TCP speed.

At this point, I lack the detailed knowledge to properly investigate/fix this.

Hopefully later on ..

@openwrt-bot
Copy link
Author

xback:

For archival reasons, I'll post my custom patch below which fixes single-stream TCP performance issues on greater coverage distances. [1]
It allows to enlarge the TCP TX buffer on completion delays.

Since the change in the TCP stack which causes this, an ACK should be received within 1ms in order to ramp-up speed which is nearly impossible to satisfy on this kind of setup.

Also, a nice discussion can be found here: [2]

After this post, I'll request for closure as this is not a LEDE issue.

[1]

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2065,7 +2065,7 @@ static bool tcp_small_queue_check(struct
unsigned int limit;

limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
  • limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
  • limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
    limit <<= factor;

    if (atomic_read(&sk->sk_wmem_alloc) > limit) {

[2]

https://patchwork.kernel.org/patch/5779661/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant