OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very Low
  • Reported Version openwrt-21.02
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Michal Pomorski - 22.11.2021

FS#4146 - e1000e: Detected Hardware Unit Hang, Reset adapter unexpectedly

System is a Fujitsu Esprimo C5731 with Intel Core2Duo E7500 and 4 GB RAM.

The problem NIC:

00:19.0 Ethernet controller [0200]: Intel Corporation 82567LF-3 Gigabit Network Connection [8086:10df] (rev 02)

Openwrt:
OpenWrt x86_64 21.02.1 r16325-88151b8303

System is configured as a simple router with the e1000e NIC as WAN and a skge NIC [Ethernet controller [0200]: D-Link System Inc Gigabit Ethernet Adapter [1186:4c00] (rev 11)] as LAN.
When doing a speedtest through the router (bredbandskollen.se) the hang occurs during the upload test (when the e1000e NIC sends data and the skge NIC receives data). The download test does not cause the error.

Similar (identical?) problems were reported previously:
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start https://web.archive.org/web/20160205153351/http://ehc.ac:80/p/e1000/bugs/378/

Turning TSO off is a workaround.

ethtool -K eth0 tso off

but pcie_aspm=off does not help.

[49573.954931] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49573.954931]   TDH                  <2>
[49573.954931]   TDT                  <1a>
[49573.954931]   next_to_use          <1a>
[49573.954931]   next_to_clean        <ff>
[49573.954931] buffer_info[next_to_clean]:
[49573.954931]   time_stamp           <100bbf478>
[49573.954931]   next_to_watch        <2>
[49573.954931]   jiffies              <100bbf6f8>
[49573.954931]   next_to_watch.status <0>
[49573.954931] MAC Status             <80083>
[49573.954931] PHY Status             <796d>
[49573.954931] PHY 1000BASE-T Status  <3800>
[49573.954931] PHY Extended Status    <3000>
[49573.954931] PCI Status             <10>
[49575.970909] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49575.970909]   TDH                  <2>
[49575.970909]   TDT                  <1a>
[49575.970909]   next_to_use          <1a>
[49575.970909]   next_to_clean        <ff>
[49575.970909] buffer_info[next_to_clean]:
[49575.970909]   time_stamp           <100bbf478>
[49575.970909]   next_to_watch        <2>
[49575.970909]   jiffies              <100bbf8f0>
[49575.970909]   next_to_watch.status <0>
[49575.970909] MAC Status             <80083>
[49575.970909] PHY Status             <796d>
[49575.970909] PHY 1000BASE-T Status  <3800>
[49575.970909] PHY Extended Status    <3000>
[49575.970909] PCI Status             <10>
[49577.954909] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
[49577.954909]   TDH                  <2>
[49577.954909]   TDT                  <1a>
[49577.954909]   next_to_use          <1a>
[49577.954909]   next_to_clean        <ff>
[49577.954909] buffer_info[next_to_clean]:
[49577.954909]   time_stamp           <100bbf478>
[49577.954909]   next_to_watch        <2>
[49577.954909]   jiffies              <100bbfae0>
[49577.954909]   next_to_watch.status <0>
[49577.954909] MAC Status             <80083>
[49577.954909] PHY Status             <796d>
[49577.954909] PHY 1000BASE-T Status  <3800>
[49577.954909] PHY Extended Status    <3000>
[49577.954909] PCI Status             <10>
[49578.082559] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
[49578.254005] e1000e: eth0 NIC Link is Down
[49581.083429] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
equid0x commented on 19.12.2021 06:40

This is known as the "TX Unit Hang" issue and its allegedly a bug in silicon that can't be fixed. As far as I recall, Intel released an updated microcode(included in driver) for this series of chips that partially mitigates, but does not completely eliminate the issue. This is a very, very old issue.

I believe the workaround is to turn off checksum offloading:

ethtool -K eth0 tx off rx off

The bug is probably reproducible if you use something like iPerf or Netcat to totally flood the affected interface with TX traffic for an extended period of time (several minutes).

I did a cursory search on this out of curiosity and interestingly, there is at least one user who has reported that the issue does not seem to occur while running under kernel 5.11 so its possible someone finally tracked down and fixed a long standing bug in the driver source. This issue has been around since at least 2009(!).

Michal Pomorski commented on 23.12.2021 06:38

The problem does not exhibit in OPNSense. At least not in an overly noticable way. So even if it is a hardware problem, there ostensibly exist a workable workaround.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing