OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version All
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Texot - 29.03.2020

FS#2943 - Kernel panic with dev_net(skb->dev)==NULL in nf_xfrm_me_harder

Hi, I have some questions regarding debugging of random reboot problem caused by kernel panic.

I’m getting random reboot when running a software inserting prerouting iptables rules in order to redirect traffic. My device is a x86_64 router. My openwrt release is compiled by myself from a forked openwrt source at https://github.com/coolsnowwolf/lede . Its kernel version is 4.19.108.

The software causing this problem is called OpenClash. It acts as a transparent proxy. It inserts prerouting rules to redirect all tcp traffic from computers in LAN to its own listening port and sends the traffic through a proxy.

Whenever this software is started, I get random reboots at 1-2 times/day. There was not any abnormal in saved log files because the crash happend in kernel and it caused reboot quickly. So I had to compile the NetConsole kernel module to capture the dmesg when crash happened. You can see the crash logs in crash_dmesg.txt.

The crash happens in the `nf_xfrm_me_harder` function. Decompiling the crash code, I get crash_code.png (The highlighted line is the crash instruction). The crash log mentions illegal memory access at 000000000000113c and the crash code shows that the kernel was accessing [rax+0x113c], so I think the problem is rax==0, which should not be happening.

The source code causing crash actually locates in a patch, which is also included in trunk OpenWrt: https://github.com/openwrt/openwrt/blob/master/target/linux/generic/pending-4.19/616-net_optimize_xfrm_calls.patch

After patched, function `nf_xfrm_me_harder` looks like

int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
{
        struct flowi fl;
        unsigned int hh_len;
        struct dst_entry *dst;
        struct sock *sk = skb->sk;
        int err;

        if (skb->dev && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT]) // <-------crash
                return 0;

        err = xfrm_decode_session(skb, &fl, family);
        if (err < 0)
                return err;

This means

dev_net(skb->dev)

sometimes equals to `NULL` .

I’m not familiar with the network mechanism in linux kernel, so I’m not sure how I can find the reason of it being NULL. Is this problem something we can safely ignore by checking its validity like this?

if (skb->dev && dev_net(skb->dev) && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT])

If not, can anyone give me some advice on how I can debug this problem? I understand this may be difficult for you developers to figure out what’s happening by merely reading my description, especially when I’m not using the trunk OpenWrt. So I would love to dig it by myself.

You can see other information of my router in dmesg.txt.

Thanks a lot.

Texot commented on 29.03.2020 16:52

`openclash.png` shows iptables rules added by OpenClash.

Texot commented on 31.03.2020 03:27

Another crash showing shared backtrace from tcp_xmit_retransmit_queue to nf_xfrm_me_harder

Admin
Petr Štetiar commented on 31.03.2020 09:27
Texot commented on 31.03.2020 10:05

Hi,

Thank you for your help. I'll give it a try and tell you my result.

I have found another similar ticket, which was also caused by this patch (it was a previous version of this patch) and also happening when heavy redirect iptables rules is used:
https://dev.archive.openwrt.org/ticket/18462.html

And here is its fix:
https://dev.archive.openwrt.org/changeset/43567.html

This makes me curious: is skb→dev and dev_net(skb→dev) being NULL a normal thing? Isn't this some bug deep in the other parts of kernel?

Texot commented on 05.04.2020 04:19

Hi Petr,

I can confirm that it has fixed my problem. No performance degradation has been observed.

Thanks.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing