06.06.20203154KernelBug ReportVery LowHighXFRM state insert failure with AES-GCMopenwrt-19.07Unconfirmed Task Description

X86_64 arch, kernel fails to insert XFRM states with AES-GCM as transform.
Testable with
ip x s add proto esp dst src spi 0×07 mode transport reqid 0×07 replay-window 32 aead ‘rfc4106(gcm(aes))’ 0x44434241343332312423222114131211f4f3f2f1 128 sel src dst proto tcp

Works on Arch.
Result on X86_64 OpenWRT 19.07.3:
RTNETLINK answers: No such file or directory

On Arch 5.6.15-arch1-1, works (no output, ip x s shows the state).
Also fails 100% of the time when tested using an IKE keying daemon, e.g. strongSwan

29.03.20202943KernelBug ReportVery LowMediumKernel panic with dev_net(skb->dev)==NULL in nf_xfrm_me...AllUnconfirmed Task Description

Hi, I have some questions regarding debugging of random reboot problem caused by kernel panic.

I’m getting random reboot when running a software inserting prerouting iptables rules in order to redirect traffic. My device is a x86_64 router. My openwrt release is compiled by myself from a forked openwrt source at . Its kernel version is 4.19.108.

The software causing this problem is called OpenClash. It acts as a transparent proxy. It inserts prerouting rules to redirect all tcp traffic from computers in LAN to its own listening port and sends the traffic through a proxy.

Whenever this software is started, I get random reboots at 1-2 times/day. There was not any abnormal in saved log files because the crash happend in kernel and it caused reboot quickly. So I had to compile the NetConsole kernel module to capture the dmesg when crash happened. You can see the crash logs in crash_dmesg.txt.

The crash happens in the `nf_xfrm_me_harder` function. Decompiling the crash code, I get crash_code.png (The highlighted line is the crash instruction). The crash log mentions illegal memory access at 000000000000113c and the crash code shows that the kernel was accessing [rax+0x113c], so I think the problem is rax==0, which should not be happening.

The source code causing crash actually locates in a patch, which is also included in trunk OpenWrt:

After patched, function `nf_xfrm_me_harder` looks like

int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
        struct flowi fl;
        unsigned int hh_len;
        struct dst_entry *dst;
        struct sock *sk = skb->sk;
        int err;

        if (skb->dev && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT]) // <-------crash
                return 0;

        err = xfrm_decode_session(skb, &fl, family);
        if (err < 0)
                return err;

This means


sometimes equals to `NULL` .

I’m not familiar with the network mechanism in linux kernel, so I’m not sure how I can find the reason of it being NULL. Is this problem something we can safely ignore by checking its validity like this?

if (skb->dev && dev_net(skb->dev) && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT])

If not, can anyone give me some advice on how I can debug this problem? I understand this may be difficult for you developers to figure out what’s happening by merely reading my description, especially when I’m not using the trunk OpenWrt. So I would love to dig it by myself.

You can see other information of my router in dmesg.txt.

Thanks a lot.

