OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version All
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Polynomdivision - 07.10.2021

FS#4066 - ipq40xx: Switch (ar40xx) freezes

The ar40xx on an ipq40xx device freezes from time to time. The issue is hard to reproduce since it was seen on multiple devices on a very different time scale. We run a mesh network with olsr (IPv4) and babeld (IPv6) as routing daemons. The affected devices are mainly Fritz!Box 4040 and Fritz!Box 7530. Typically, on setups with a huge load, the switches begin to freeze. As an example, a church that acts as a central point of our mesh network and connection to the internet experiences the freeze daily. We can easily reproduce and test solutions in that location. Devices with fewer clients and just one mesh connection, also crash but it needs some time (30 days). Some devices with almost no traffic or client do not crash.

As a workaround, we wrote the naywatch daemon, which checks for ipv6 link-local connectivity. We also allow collecting debug output from it. With that, we can show what a diff of the swconfig looks like when a freeze happens. So the switch is already in a frozen state. As you can see the CPU Port 0 is not sending any frames (TX is not visible) to the CPU anymore. The switch still receives frames.

Port 0:
        mib: Port 0 MIB counters
-RxBroad     : 472793
+RxBroad     : 472882
 RxPause     : 0
-RxMulti     : 36772
+RxMulti     : 36854
 RxFcsErr    : 0
 RxAlignErr  : 0
 RxRunt      : 0
 RxFragment  : 0
-Rx64Byte    : 2075611
-Rx128Byte   : 20964917
-Rx256Byte   : 16459560
-Rx512Byte   : 623700
-Rx1024Byte  : 907303
-Rx1518Byte  : 48003389
+Rx64Byte    : 2075618
+Rx128Byte   : 20964980
+Rx256Byte   : 16459583
+Rx512Byte   : 623752
+Rx1024Byte  : 907344
+Rx1518Byte  : 48003397
 RxMaxByte   : 21422694
 RxTooLong   : 0
-RxGoodByte  : 107718798589
+RxGoodByte  : 107718869396
 RxBadByte   : 0
 RxOverFlow  : 0
 Filtered    : 0
@@ -51,38 +51,38 @@
        link: port:0 link:up speed:1000baseT full-duplex txflow rxflow

I would suggest:
- Switch Broken?
- Switch wrongly configured

However, there seems to be an DSA implementation of the ar40xx that should be released some day. So maybe it is better to switch to DSA before fixing this issue. I already wrote with blocktrron and to my understanding, he was also able to experience a freeze.

Rest of diff:

root@emma-core:~# diff -u 1633480443-swconfig\ dev\ switch0\ show.log 1633480463-swconfig\ dev\ switch0\ show.log
--- "1633480443-swconfig dev switch0 show.log"  2021-10-06 02:34:03.000000000 +0200
+++ "1633480463-swconfig dev switch0 show.log"  2021-10-06 02:34:23.000000000 +0200
@@ -7,22 +7,22 @@
        linkdown: ???
 Port 0:
        mib: Port 0 MIB counters
-RxBroad     : 472793
+RxBroad     : 472882
 RxPause     : 0
-RxMulti     : 36772
+RxMulti     : 36854
 RxFcsErr    : 0
 RxAlignErr  : 0
 RxRunt      : 0
 RxFragment  : 0
-Rx64Byte    : 2075611
-Rx128Byte   : 20964917
-Rx256Byte   : 16459560
-Rx512Byte   : 623700
-Rx1024Byte  : 907303
-Rx1518Byte  : 48003389
+Rx64Byte    : 2075618
+Rx128Byte   : 20964980
+Rx256Byte   : 16459583
+Rx512Byte   : 623752
+Rx1024Byte  : 907344
+Rx1518Byte  : 48003397
 RxMaxByte   : 21422694
 RxTooLong   : 0
-RxGoodByte  : 107718798589
+RxGoodByte  : 107718869396
 RxBadByte   : 0
 RxOverFlow  : 0
 Filtered    : 0
@@ -51,38 +51,38 @@
        link: port:0 link:up speed:1000baseT full-duplex txflow rxflow
 Port 1:
        mib: Port 1 MIB counters
-RxBroad     : 107158
+RxBroad     : 107267
 RxPause     : 0
-RxMulti     : 1147
+RxMulti     : 1173
 RxFcsErr    : 0
 RxAlignErr  : 0
 RxRunt      : 0
 RxFragment  : 0
-Rx64Byte    : 1536
-Rx128Byte   : 68123
-Rx256Byte   : 20668
-Rx512Byte   : 4912
-Rx1024Byte  : 10327
-Rx1518Byte  : 90128
-RxMaxByte   : 9851
+Rx64Byte    : 1555
+Rx128Byte   : 68180
+Rx256Byte   : 20673
+Rx512Byte   : 4925
+Rx1024Byte  : 10332
+Rx1518Byte  : 90209
+RxMaxByte   : 9860
 RxTooLong   : 0
-RxGoodByte  : 166901240
+RxGoodByte  : 167051151
 RxBadByte   : 0
 RxOverFlow  : 0
-Filtered    : 118
-TxBroad     : 856851
-TxPause     : 1074
-TxMulti     : 89960
+Filtered    : 307
+TxBroad     : 856940
+TxPause     : 3442
+TxMulti     : 90042
 TxUnderRun  : 0
-Tx64Byte    : 38606
-Tx128Byte   : 206906
-Tx256Byte   : 56950
-Tx512Byte   : 27986
-Tx1024Byte  : 60718
-Tx1518Byte  : 677209
+Tx64Byte    : 40978
+Tx128Byte   : 206965
+Tx256Byte   : 56973
+Tx512Byte   : 28038
+Tx1024Byte  : 60759
+Tx1518Byte  : 677217
 TxMaxByte   : 74048
 TxOverSize  : 0
-TxByte      : 1186157213
+TxByte      : 1186378970
 TxCollision : 0
 TxAbortCol  : 0
 TxMultiCol  : 0
@@ -95,38 +95,38 @@
        link: port:1 link:up speed:1000baseT full-duplex txflow rxflow auto
 Port 2:
        mib: Port 2 MIB counters
-RxBroad     : 170588
+RxBroad     : 170832
 RxPause     : 0
-RxMulti     : 11717
+RxMulti     : 11767
 RxFcsErr    : 0
 RxAlignErr  : 0
 RxRunt      : 0
 RxFragment  : 0
-Rx64Byte    : 28452
-Rx128Byte   : 6337895
-Rx256Byte   : 408749
-Rx512Byte   : 54975
-Rx1024Byte  : 62053
-Rx1518Byte  : 343977
-RxMaxByte   : 130630
+Rx64Byte    : 28455
+Rx128Byte   : 6338150
+Rx256Byte   : 408825
+Rx512Byte   : 55001
+Rx1024Byte  : 62066
+Rx1518Byte  : 344149
+RxMaxByte   : 130646
 RxTooLong   : 0
-RxGoodByte  : 1345452080
+RxGoodByte  : 1345783813
 RxBadByte   : 0
 RxOverFlow  : 0
-Filtered    : 574
-TxBroad     : 793647
-TxPause     : 1151
-TxMulti     : 79418
+Filtered    : 1135
+TxBroad     : 793736
+TxPause     : 3519
+TxMulti     : 79500
 TxUnderRun  : 0
-Tx64Byte    : 97711
-Tx128Byte   : 603920
-Tx256Byte   : 173282
-Tx512Byte   : 76156
-Tx1024Byte  : 104686
-Tx1518Byte  : 3582480
+Tx64Byte    : 100079
+Tx128Byte   : 603969
+Tx256Byte   : 173305
+Tx512Byte   : 76208
+Tx1024Byte  : 104727
+Tx1518Byte  : 3582488
 TxMaxByte   : 7391371
 TxOverSize  : 0
-TxByte      : 16528781103
+TxByte      : 16529001614
 TxCollision : 0
 TxAbortCol  : 0
 TxMultiCol  : 0
@@ -139,38 +139,38 @@
        link: port:2 link:up speed:1000baseT full-duplex txflow rxflow auto
 Port 3:
        mib: Port 3 MIB counters
-RxBroad     : 159441
+RxBroad     : 159671
 RxPause     : 0
-RxMulti     : 18188
+RxMulti     : 18257
 RxFcsErr    : 0
 RxAlignErr  : 0
 RxRunt      : 0
 RxFragment  : 0
-Rx64Byte    : 9911
-Rx128Byte   : 3051611
-Rx256Byte   : 711251
-Rx512Byte   : 377985
-Rx1024Byte  : 459307
-Rx1518Byte  : 46523339
-RxMaxByte   : 21133903
+Rx64Byte    : 9932
+Rx128Byte   : 3052044
+Rx256Byte   : 711326
+Rx512Byte   : 378003
+Rx1024Byte  : 459361
+Rx1518Byte  : 46523507
+RxMaxByte   : 21133924
 RxTooLong   : 0
-RxGoodByte  : 100988636934
+RxGoodByte  : 100989010961
 RxBadByte   : 0
 RxOverFlow  : 0
-Filtered    : 36461
-TxBroad     : 804786
-TxPause     : 6016
-TxMulti     : 72948
+Filtered    : 37251
+TxBroad     : 804875
+TxPause     : 8384
+TxMulti     : 73030
 TxUnderRun  : 0
-Tx64Byte    : 1661972
-Tx128Byte   : 18816233
-Tx256Byte   : 15830078
-Tx512Byte   : 276436
-Tx1024Byte  : 524862
-Tx1518Byte  : 3198326
+Tx64Byte    : 1664343
+Tx128Byte   : 18816282
+Tx256Byte   : 15830101
+Tx512Byte   : 276488
+Tx1024Byte  : 524903
+Tx1518Byte  : 3198334
 TxMaxByte   : 426878
 TxOverSize  : 0
-TxByte      : 9485704082
+TxByte      : 9485924799
 TxCollision : 0
 TxAbortCol  : 0
 TxMultiCol  : 0
@@ -202,19 +202,19 @@
 RxBadByte   : 871047744
 RxOverFlow  : 0
 Filtered    : 99
-TxBroad     : 909305
-TxPause     : 4319
-TxMulti     : 67716
+TxBroad     : 909394
+TxPause     : 6687
+TxMulti     : 67798
 TxUnderRun  : 0
-Tx64Byte    : 335196
-Tx128Byte   : 1832541
-Tx256Byte   : 520952
-Tx512Byte   : 320853
-Tx1024Byte  : 412422
-Tx1518Byte  : 42645040
+Tx64Byte    : 337564
+Tx128Byte   : 1832588
+Tx256Byte   : 520975
+Tx512Byte   : 320905
+Tx1024Byte  : 412463
+Tx1518Byte  : 42645048
 TxMaxByte   : 13773617
 TxOverSize  : 0
-TxByte      : 84230298783
+TxByte      : 84230519096
 TxCollision : 0
 TxAbortCol  : 0
 TxMultiCol  : 0


Polynomdivision commented on 07.10.2021 14:28

To clear up any misunderstandings. In the diff, you see the state when the switch is frozen. Both timestamps are then the router can no longer be reached, or the router can not reach anything, so the switch stopped working. The first time and the second time are 10 seconds apart.

Polynomdivision commented on 18.10.2021 14:14

Maybe we found a workaround for it. It could be that uboot initializes the switch with some config that is overwritten by swconfig again:

Since changing the configs to

config switch
	option name 'switch0'
	option reset '0'
	option enable_vlan '1'

we did not receive any crash since 3 days.

Polynomdivision commented on 08.11.2021 22:45

The DSA driver could fix the issue:
https://github.com/openwrt/openwrt/pull/4721

Polynomdivision commented on 12.11.2021 22:03

Maybe a combination of ipq40xx device with ubiquiti products using the cisco discovery protocol could be problematic?

Polynomdivision commented on 12.11.2021 22:17

Maybe some important message:
Fri Nov 12 21:19:20 2021 daemon.err babeld[2608]: netlink_read: recvmsg(): No buffer space available

Polynomdivision commented on 15.11.2021 17:52

Enabled offloads:

# ethtool -k lan1 | grep on
rx-checksumming: on [fixed]
tx-checksumming: on
	tx-checksum-ip-generic: on [fixed]
scatter-gather: on
	tx-scatter-gather: on [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on [fixed]
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: on [fixed]
	tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
tx-lockless: on [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
hw-tc-offload: on
Polynomdivision commented on 15.11.2021 17:54

Example diff

diff -u 1636876105-ethtool\ -S\ lan1.log 1636876125-ethtool\ -S\ lan1.log 
--- "1636876105-ethtool -S lan1.log"	2021-11-14 09:57:54.296129780 +0100
+++ "1636876125-ethtool -S lan1.log"	2021-11-14 09:57:54.762796277 +0100
@@ -1,40 +1,40 @@
 NIC statistics:
-     tx_packets: 37118865
-     tx_bytes: 39940635954
+     tx_packets: 37119048
+     tx_bytes: 39940687530
      rx_packets: 17896842
      rx_bytes: 2765004610
-     RxBroad: 618724
+     RxBroad: 618736
      RxPause: 0
-     RxMulti: 71642
+     RxMulti: 71666
      RxFcsErr: 0
      RxAlignErr: 0
      RxRunt: 0
      RxFragment: 0
-     Rx64Byte: 81401
-     Rx128Byte: 16481482
-     Rx256Byte: 266421
-     Rx512Byte: 74709
-     Rx1024Byte: 147338
-     Rx1518Byte: 919906
+     Rx64Byte: 81421
+     Rx128Byte: 16481568
+     Rx256Byte: 266434
+     Rx512Byte: 74718
+     Rx1024Byte: 147342
+     Rx1518Byte: 919914
      RxMaxByte: 87542
      RxTooLong: 0
-     RxGoodByte: 3173218205
+     RxGoodByte: 3173248165
      RxBadByte: 0
      RxOverFlow: 0
-     Filtered: 161894
-     TxBroad: 9615221
-     TxPause: 15710
-     TxMulti: 811563
+     Filtered: 162034
+     TxBroad: 9615315
+     TxPause: 18185
+     TxMulti: 811634
      TxUnderRun: 0
-     Tx64Byte: 948123
-     Tx128Byte: 14871136
-     Tx256Byte: 893717
-     Tx512Byte: 444792
-     Tx1024Byte: 907971
+     Tx64Byte: 950607
+     Tx128Byte: 14871198
+     Tx256Byte: 893747
+     Tx512Byte: 444849
+     Tx1024Byte: 907996
      Tx1518Byte: 10961510
      TxMaxByte: 14193799
      TxOverSize: 0
-     TxByte: 39901619727
+     TxByte: 39901830561
      TxCollision: 0
      TxAbortCol: 0
      TxMultiCol: 0
@@ -42,5 +42,5 @@
      TxExcDefer: 0
      TxDefer: 0
      TxLateCol: 0
-     RXUnicast: 17368433
-     TXunicast: 32778554
+     RXUnicast: 17368537
+     TXunicast: 32778572

rx-counters of the hardware still increase, but interface/host counter does not increase. I suspect some offload failure?

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing