Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#4098 - MESH-SAE-AUTH-BLOCKED #9082

Open
openwrt-bot opened this issue Oct 20, 2021 · 8 comments
Open

FS#4098 - MESH-SAE-AUTH-BLOCKED #9082

openwrt-bot opened this issue Oct 20, 2021 · 8 comments
Labels

Comments

@openwrt-bot
Copy link

nemesisdev:

  • Device problem occurs on: reported by multiple users on [[https://github.com/MESH-SAE-AUTH-FAILURE libremesh/lime-packages#837|different devices]], I am using [[http://www.win-star.com/en_us/product/WS_WN552K1_WN552K2_WN552K3.html|a mediatek based one]]

  • I am experiencing this on current master, revision r2857+4-9d994f35b4

  • Steps to reproduce: it randomly occurs some times when the root node of a mesh using plain 802.11s (mesh mode) + SAE/PSK2 authentication is rebooted (or a power outage), in order to replicate it, one would have to keep on rebooting aggressively until it happens. Maybe turning off and on wifi may be able to replicate it as well

What happens?

Some times, the devices in a mesh can't connect each other after a power outage or a reboot of the root node (the node which is connected to the gateway and allows the rest of the mesh to connect to the internet).

Log lines:

Oct 20 13:04:40 OpenWrt wpa_supplicant[1335]: mesh0: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1a
Oct 20 13:04:47 OpenWrt wpa_supplicant[1335]: mesh1: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1b
Oct 20 13:04:59 OpenWrt wpa_supplicant[1335]: mesh0: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1a
Oct 20 13:05:01 OpenWrt wpa_supplicant[1335]: mesh1: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1b
Oct 20 13:05:11 OpenWrt wpa_supplicant[1335]: mesh0: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1a
Oct 20 13:05:12 OpenWrt wpa_supplicant[1335]: mesh1: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1b
Oct 20 13:05:24 OpenWrt wpa_supplicant[1335]: mesh0: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1a
Oct 20 13:05:24 OpenWrt wpa_supplicant[1335]: mesh0: MESH-SAE-AUTH-BLOCKED addr=*0:3f:5d:::1a duration=300
Oct 20 13:05:26 OpenWrt wpa_supplicant[1335]: mesh1: MESH-SAE-AUTH-FAILURE addr=*0:3f:5d:::1b
Oct 20 13:05:26 OpenWrt wpa_supplicant[1335]: mesh1: MESH-SAE-AUTH-BLOCKED addr=*0:3f:5d:::1b duration=300

When this happens, the links show up in "iw mesh0 station dump" or "iw mesh1 station dump" but in BLOCKED state.

Rebooting the nodes which have their link blocked at the same time fixes the issue, which seems to rule out an interference issue, because how can a reboot fix an interference issue?

I also tried setting "cell_density '1'" in the configuration of the radios, but the problem keep happening, it doesn't happen often, but when it happens it can wreak havoc.

The mesh configuration is the following:

config wifi-device 'radio0'
option type 'mac80211'
option channel '11'
option hwmode '11g'
option path '1e140000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
option htmode 'HT20'
option disabled '0'
option log_level '0'
option legacy_rates '0'
option country 'US'
option cell_density '1'

config wifi-device 'radio1'
option type 'mac80211'
option hwmode '11a'
option path '1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0'
option htmode 'VHT80'
option disabled '0'
option log_level '0'
option channel '40'
option country 'US'
option cell_density '1'

config wifi-iface 'wifi_mesh0'
option device 'radio0'
option ifname 'mesh0'
option mode 'mesh'
option encryption 'psk2+ccmp'
option key ''
option mesh_id '
'
option network 'lan'
option mesh_fwding '1'
option mesh_rssi_threshold '-80'

config wifi-iface 'wifi_mesh1'
option device 'radio1'
option ifname 'mesh1'
option mode 'mesh'
option encryption 'psk2+ccmp'
option key ''
option mesh_id '
'
option network 'lan'
option mesh_fwding '1'
option mesh_rssi_threshold '-80'

config wifi-iface 'wifi_wlan0'
option device 'radio0'
option ifname 'wlan0'
option mode 'ap'
option encryption 'psk2'
option key ''
option ssid '
'
option network 'lan'
option ieee80211r '1'
option ft_psk_generate_local '1'
option rsn_preauth '1'
option reassociation_deadline '20000'
option ft_over_ds '1'

config wifi-iface 'wifi_wlan1'
option device 'radio1'
option ifname 'wlan1'
option mode 'ap'
option encryption 'psk2'
option key ''
option ssid '
'
option network 'lan'
option ieee80211r '1'
option ft_psk_generate_local '1'
option rsn_preauth '1'
option reassociation_deadline '20000'
option ft_over_ds '1'

@openwrt-bot
Copy link
Author

nemesisdev:

The exact commit of my OpenWrt master build is ade56b8d9e.

@openwrt-bot
Copy link
Author

Steve-Newcomb:

We have this problem too. It occurs in two of our three meshes. It is much more frequent lately. I do not know whether it is merely coincidental that we recently upgraded from 21.01 to 21.02.

My current solution is to maintain a pair of openssh tunnels between each dhcp server (in which gw_mode='server') and each client (in which gw_mode='client'). If a dhcp server finds itself with no clients that are (still?) in contact with it, it reboots. If a client finds itself with no dhcp server that is (still?) in contact with it, it reboots. It's a ridiculously heavy solution which is a lot of trouble to set up in a secure manner, but it has the advantage that each node can detect whether it is in contact with the node(s) with which it has one or more critical relationships.

I suspect this problem is actually a driver issue. These are all Archer [CA]7 v [245] routers (affordable!) with QCA "wave1" radios. I haven't been able to use the -CT (Candela Technologies) driver for those radios in a mesh; perhaps I haven't understood the advice I've received about that, or perhaps the advice just doesn't work. Therefore, I have to use the stock (QCA) driver's inherent 802.11s implementation, which has quirks. For example, it always fails, usually with hours or minutes, if I have tweaked the radio's built-in MAC address. Therefore, I suspect the QCA firmware may be insufficiently hardened against the depredations of real-world environments.

On the other hand, this could be a real OpenWRT bug. I have no explanation as to why it is suddenly so much more frequent. If anyone can suggest debugging instrumentation that I haven't already tried, I'll be grateful for the advice.

@openwrt-bot
Copy link
Author

EelcoV:

Currently I am not using openwrt, but I have/had a similar issue. This had to do with "too" many clients trying to connect to the mesh peer at the same time. It then also got into the PLINK_BLOCKED state.

First of all, I removed setting the PLINK_BLOCKED state when authentication fails several times (couldn't find it in the ieee802.11 standard anyway...). Then I noticed a lot of "anti-clogging" messages (see also chapter 12.4.6 in ieee802.11 standard). This mechanism will start sending tokens along with frames to reduce the number of peers which are allowed to perform authentication at the same time. This then led to peers getting blocked because they were not allowed to authenticate.

Maybe you can check your logs for this kind of messages; Also, when you try to reproduce the issue, make sure you have a lot of peers (I had to have more than 5 peers...)

I have posted my original issue here, maybe this helps to get more insight into the issue. http://lists.infradead.org/pipermail/hostap/2021-December/040095.html

@yogo1212
Copy link
Contributor

yogo1212 commented Oct 9, 2022

I'm getting the same messages since a recent rebase.
It was working in fbf6992f2b8960cbca36cd652bcdc71d69931076.

The setup is: mesh point with fwding=0,ttl=1 with batman for routing on various mt7621 devices (Cudy WR2100, Cudy M1800, Dual-Q M721), wpad-openssl.
A script generates the config from defconfig and selects only what's needed. That's mostly the same.
The wifi configuration is also generated and it's identical in both broken and working states.

There were two seemingly unrelated updates of openssl and bunch more for mt76. Hostapd is the same.
Might be a timing issue?

@yogo1212
Copy link
Contributor

yogo1212 commented Jan 6, 2023

In my test setup with two mt7621 devices, wpa_supplicant is constantly at 10% CPU - even if both devices have their peer in the 'blocked' state. The moment the other device is turned off or its radio disabled (wifi down), cpu usage goes down again.

 2040  2037 network  R     5500   5%  12% /usr/sbin/wpa_supplicant -n -s -g /var/run/wpa_supplicant/global
  690     1 ubus     S     1360   1%   4% /sbin/ubusd

I'm giving it another go. If only coz was easy to use on openwrt..

@yogo1212
Copy link
Contributor

yogo1212 commented Jan 6, 2023

I went at it using printf-debugging in wpa_supplicant. Modifications were only made on one router.
All day, there were no successful handshakes at all.

I started from handle_auth_sae and iterated to sae_sm_step. I added this statement:

 
       sae_set_retransmit_timer(hapd, sta);
     } else {
+wpa_printf(MSG_ERROR, "sae_sm_step state confirmed, accepting client. send_confirm: %u", (unsigned) sta->sae->send_confirm);
       sta->sae->send_confirm = 0xffff;
       sae_accept_sta(hapd, sta);
     }
     break;

.. it didn't hit and I decided to give up for the day.
Just for the sake of it, I copied over the printf-ridden wpad binary to the other router as well - and TADA:

Fri Jan  6 15:50:22 2023 daemon.notice wpa_supplicant[24246]: mesh24_0: mesh plink with 00:0c:43:26:46:08 established
Fri Jan  6 15:50:22 2023 daemon.notice wpa_supplicant[24246]: mesh24_0: MESH-PEER-CONNECTED 00:0c:43:26:46:08

Reliable handshakes!

My only changes are printfs, nothing else. No compiler flags or configuration change 🤔 🤯
Revert back to the original binary on one router and it stops working immediately.

So atm, I'm thinking 'race condition'. I wont have time for this until next week, so I'm leaving this here for the moment.

@yogo1212
Copy link
Contributor

yogo1212 commented Jan 9, 2023

The single one statement that decides whether the handshake succeeds or not (provided functioning wifi, correct password etc.) is:

--- a/wpa_supplicant/mesh_rsn.c
+++ b/wpa_supplicant/mesh_rsn.c
@@ -357,6 +357,7 @@ int mesh_rsn_auth_sae_sta(struct wpa_sup
        struct rsn_pmksa_cache_entry *pmksa;
        unsigned int rnd;
        int ret;
+wpa_printf(MSG_ERROR, "mesh_rsn_auth_sae_sta\n");
 
        if (!ssid) {
                wpa_msg(wpa_s, MSG_DEBUG,

replacing it with a sleep now and will try to move it 'forward' in hope to come closer to the step susceptible to the timing.

@yogo1212
Copy link
Contributor

yogo1212 commented Jan 9, 2023

i'm giving up :-(

the printf only becomes 'effective' when there's a logread -f | grep wpa_suppl running (SSH over ethernet) and i haven't found a good duration for sleep (assuming something between 1 and 10 ms).
this probably requires more knowledge about the SAE handshake and engineering skill than i have.

should anyone want to pick this up:
i was "moving" a call to nanosleep forward through the code handling mesh auth requests.
my hope was to find a piece of code within hostapd after which the sleep would be ineffective (not altering the success of the handshake).

run ln -sf /tmp/wpad /usr/sbin/wpa_supplicant once on both routers.
deploying each iteration on both routers the changes like this:

#!/bin/sh -e

make package/hostapd/compile V=s

copy() {
  ssh "root@$1" rm /tmp/wpad
  scp -O build_dir/target-mipsel_24kc_musl/hostapd-wpad-full-openssl/hostapd-2022-07-29-b704dc72/ipkg-mipsel_24kc/wpad-openssl/usr/sbin/wpad "root@$1:/tmp/wpad"
  ssh "root@$1" sh -c "'killall wpa_supplicant ; wifi'"
}

copy router_a
copy router_b

# if you like;
tmux new-session -d -s debug_mesh_auth ssh router_a
tmux split-window -t debug_mesh_auth:0.0 -h ssh router_b
tmux attach -t debug_mesh_auth:0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants