OpenWrt/LEDE Project

  • Status Closed
  • Percent Complete
    100%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity High
  • Priority Very Low
  • Reported Version lede-17.01
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Tony Ambardar - 07.05.2017
Last edited by Hans Dedecker - 09.05.2017

FS#766 - Intermittent SIGSEGV crash of dnsmasq-full

I’ve just noticed seeing the following several times within the last day or so:

[1461327.495159] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
[1461327.504081] epc = 0040f28d in dnsmasq[400000+2c000]
[1461327.509252] ra  = 0040f273 in dnsmasq[400000+2c000]

I’m running the latest LEDE stable, with all updates applied as of 2017-05-05, and have been using DNSSEC for a while:

  • LEDE Reboot 17.01.1 r3316-7eb58cf109
  • D-Link DIR-835 rev. A1
  • dnsmasq-full - 2.76-6

The most recent upgrade in the same time frame was to odhcpd-2017-04-28-9268ca65-1.

After several restarts by procd and subsequent crashes, dnsmasq will be disabled, leaving me without name resolution until I notice and restart manually.

To get a little more info, I rebuilt the stable LEDE and dnsmasq-full with a “-g” CFLAG option. After installing this package, I captured the following crash details:

[1562749.817613] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
[1562749.826522] epc = 0040f295 in dnsmasq[400000+2c000]
[1562749.831681] ra  = 0040f27b in dnsmasq[400000+2c000]

Checking further with gdb yields:

(gdb) info line *0x0040f27b
Line 278 of "forward.c" starts at address 0x40f275 <forward_query+204>
   and ends at 0x40f281 <forward_query+216>.

(gdb) info line *0x0040f295
Line 281 of "forward.c" starts at address 0x40f295 <forward_query+236>
   and ends at 0x40f29b <forward_query+242>.

And the relevant source (forward.c) looks like:

275               blockdata_retrieve(forward->stash, forward->stash_len, (void *)header);
276               plen = forward->stash_len;
277
278               if (find_pseudoheader(header, plen, NULL, &pheader, &is_sign, NULL) && !is_sign)
279                 PUTSHORT(SAFE_PKTSZ, pheader);
280
281               if (forward->sentto->addr.sa.sa_family == AF_INET)
282                 log_query(F_NOEXTRA | F_DNSSEC | F_IPV4, "retry", (struct all_addr *)&forward->sentto->addr.in.sin_addr, "dnssec");
283     #ifdef HAVE_IPV6
284               else
285                 log_query(F_NOEXTRA | F_DNSSEC | F_IPV6, "retry", (struct all_addr 

Any similar reports from others? I’ll keep monitoring in the meantime but this is difficult to reproduce on demand. It *seems* to happen more with web browsing.

Closed by  Hans Dedecker
09.05.2017 06:43
Reason for closing:  Different project
Additional comments about closing:  

Issue has been reported upstream (http://lists.thekelleys.org.uk/piper mail/dnsmasq-discuss/2017q2/011465.html< /a>) and fixed in http://thekel leys.org.uk/gitweb/?p=dnsmasq.git;a=comm it;h=09f3b2cd9c7b5b5e0e96ba41f666e698088 62620

Tony Ambardar commented on 08.05.2017 05:57

After a little more investigation, this is definitely a bug that also exists in the latest lede/master which uses dnsmasq-2.77test5. It is easily triggered via a common mozilla DNS query, and appears related to using split DNS and DNSSEC.

A minimal, standalone dnsmasq.conf that is vulnerable:

listen-address=192.168.1.1
port=55553
bind-interfaces
no-daemon
no-hosts
no-resolv
log-queries=extra
server=8.8.8.8
server=/cloudfront.net/50.22.147.234
dnssec
dnssec-check-unsigned
trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E54A1607371607A1A41855200FD2CE1CDDE32F24E8FB5
trust-anchor=.,20326,8,2,E06D44B80B8F1D39A95C0B0D7C65D08458E880409BBC683457104237C7F8EC8D

Removing either of these config lines results in no SIGSEGV:

server=/cloudfront.net/50.22.147.234
dnssec-check-unsigned

The bug can be triggered from a DNS client simply (e.g.a blank Firefox page!):

ubuntu$ nslookup -port=55553 tiles-cloudfront.cdn.mozilla.net 192.168.1.1
;; Question section mismatch: got cloudfront.net/DS/IN
;; connection timed out; no servers could be reached

I also captured a dnsmasq core file from my router and ran it through gdb:

ubuntu$ ./staging_dir/toolchain-mips_24kc_gcc-5.4.0_musl-1.1.16/bin/mips-openwrt-linux-gdb -d ./build_dir/target-mips_24kc_musl-1.1.16/dnsmasq-full/dnsmasq-2.77test5/src/ -n ./staging_dir/target-mips_24kc_musl-1.1.16/root-ar71xx/usr/sbin/dnsmasq dnsmasq.757.11.1494218146.core
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
Reading symbols from ./staging_dir/target-mips_24kc_musl-1.1.16/root-ar71xx/usr/sbin/dnsmasq...done.
[New LWP 757]
...
Core was generated by `dnsmasq -C crash-dnsmasq.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  forward_query (udpfd=<optimized out>, udpaddr=udpaddr@entry=0x7fc1d930,
    dst_addr=<optimized out>, dst_iface=dst_iface@entry=0,
    header=header@entry=0x7c8010, plen=43, plen@entry=50,
    now=now@entry=1494218146, forward=0x77cabd90, ad_reqd=ad_reqd@entry=0,
    do_bit=do_bit@entry=0) at forward.c:281
281               if (forward->sentto->addr.sa.sa_family == AF_INET)
(gdb) bt
#0  forward_query (udpfd=<optimized out>, udpaddr=udpaddr@entry=0x7fc1d930,
    dst_addr=<optimized out>, dst_iface=dst_iface@entry=0,
    header=header@entry=0x7c8010, plen=43, plen@entry=50,
    now=now@entry=1494218146, forward=0x77cabd90, ad_reqd=ad_reqd@entry=0,
    do_bit=do_bit@entry=0) at forward.c:281
#1  0x00410275 in receive_query (listen=listen@entry=0x77cbffe0,
    now=now@entry=1494218146) at forward.c:1443
#2  0x00412825 in check_dns_listeners (now=now@entry=1494218146)
    at dnsmasq.c:1565
#3  0x004047db in main (argc=<optimized out>, argv=<optimized out>)
    at dnsmasq.c:1044
(gdb)

The dnsmasq config file, log file, and client log are attached. I'm not sure I can go any further, so would appreciate the dnsmasq package maintainer taking a look and advising.

Thanks!

Anonymous Submitter commented on 08.05.2017 12:36

I've forwarded your message including the replication procedure to the dnsmasq list. I was able to replicate with ease following your instruction, in fact all I needed to do was add server=/cloudfront.net/50.22.147.234 to my existing config. This makes me think it's a particular type of server that's provoking the issue.

Let's see what happens

Kevin

Tony Ambardar commented on 09.05.2017 07:47

Hehe, very optimistic of you to close this...

I saw the update from Simon Kelley (thank you!) on the Dnsmasq-discuss mailing list and built an updated LEDE dnsmasq-2.77rc1 package to test. (see required patch attached)

The prior minimal test-case passed, but the original production config file now creates a horrible SIGSEGV crash-loop (log attached):

Mon May  8 22:59:46 2017 kern.info kernel: [1738736.539480] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
Mon May  8 22:59:46 2017 kern.info kernel: [1738736.548375] epc = 0040e79b in dnsmasq[400000+2d000]
Mon May  8 22:59:46 2017 kern.info kernel: [1738736.553564] ra  = 0040e773 in dnsmasq[400000+2d000]

Stack trace indicates something to do with logging:

(gdb) core-file dnsmasq.18906.11.1494309586.core
[New LWP 18906]
...
Core was generated by `dnsmasq -C /var/etc/dnsmasq.conf.cfg02411c --no-daemon'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0040e79b in search_servers (now=now@entry=1494309586,
    addrpp=addrpp@entry=0x0, qtype=qtype@entry=32768, qdomain=<optimized out>,
    type=type@entry=0x7fd02c74, domain=domain@entry=0x7fd02c78,
    norebind=norebind@entry=0x0) at forward.c:222
222           log_query(logflags | flags | F_CONFIG | F_FORWARD, qdomain, *addrpp, NULL);
(gdb) bt
#0  0x0040e79b in search_servers (now=now@entry=1494309586,
    addrpp=addrpp@entry=0x0, qtype=qtype@entry=32768, qdomain=<optimized out>,
    type=type@entry=0x7fd02c74, domain=domain@entry=0x7fd02c78,
    norebind=norebind@entry=0x0) at forward.c:222
#1  0x00410759 in reply_query (fd=<optimized out>, family=<optimized out>,
    now=now@entry=1494309586) at forward.c:938
#2  0x004127dd in check_dns_listeners (now=now@entry=1494309586)
    at dnsmasq.c:1560
#3  0x004047db in main (argc=<optimized out>, argv=<optimized out>)
    at dnsmasq.c:1044
(gdb) print logflags
$1 = 32800
(gdb) print flags
$2 = <optimized out>
(gdb) print *qdomain
value has been optimized out
(gdb) print addrpp
$3 = (struct all_addr **) 0x0
(gdb)

This turns out to be easy to reproduce. Simply add

domain-needed

to the prior standalone config file.
Then trigger the crash from a client with:

$ nslookup -port=55553 google.com 192.168.1.1
;; connection timed out; no servers could be reached

I attached all the relevant logs, configs and patches.


										    
  				
Project Manager
Hans Dedecker commented on 09.05.2017 07:53

The bug record has been closed as this is an upstream issue in the dnsmasq project; meaning the issue has to be reported on the dnsmasq mailing list and needs to be fixed by the dnsmasq maintainer. Therefore it makes no sense to keep this bug record open on the Lede project.

Tony Ambardar commented on 10.05.2017 09:02

@Kevin Darbyshire-Bryant:
I wanted to report that I tested the patch Simon created in response to my second bug report, and can confirm this now resolves the issue. Thank you for liaising with the DNSMASQ mailing list, and please pass on this note and my thanks also to Simon Kelley.

Since these crash-loop bugs are fairly serious and difficult to troubleshoot, do you expect someone would be able to back-port the fixes to LEDE-17.01 once Simon releases dnsmasq-2.77 in the near future?

Thanks,
Tony Ambardar


@Hans Dedecker:
I was asked to open this ticket by Matthias Schiffer, and rightly so. As a bug present in a base package of the LEDE stable release, it should clearly be reported here according to LEDE policy: LEDE Project - Reporting Bugs. And until it is fixed in a LEDE package the issue can impact others and should remain open.

Your suggestions that the issue is unrelated to LEDE and doesn't belong here are misleading, unhelpful, and serve to dissuade others from volunteering their time to improve LEDE. I'd like to think that is not your intention. Am I wrong to think so?

Anonymous Submitter commented on 10.05.2017 09:14

No problem. Thanks for doing the hard work with gdb! I've already got a pull request on standby for when Simon tags rc2 - we'll see how quickly the release gets released and I guess someone will make a decision on how to get that into LEDE17.01.

Project Manager
Hans Dedecker commented on 10.05.2017 21:03

@Tony Ambardar:
By no means it was intended to give the impression to discourage people from improving LEDE.
But the major issue here is that the upstream dnsmasq owner won't read the flyspray bug records; therefore it's more advisable to report the issue directly upstream or someone has to find time and act as liaison like Kevin Darbyshire-Bryant did.
At the same time we don't have a clear policy how to handle bugs in core packages which are not owned by LEDE; I agree it's not a bad idea to keep the bug open as long as we can keep it maintainable

Christian Kujau commented on 21.08.2017 10:16

For the record, this is still an issue with 17.01.2 and Dnsmasq version 2.77:

kernel: [ 2860.890789] 
kernel: [ 2860.890789] do_page_fault(): sending SIGSEGV to dnsmasq for invalid write access to 00552000
kernel: [ 2860.899402] epc = 77cd488c in libc.so[77c62000+92000]
kernel: [ 2860.904552] ra  = 00406c41 in dnsmasq[400000+21000]
kernel: [ 2860.909537] 

I came across this one while playing around with dnseval from the dnsdiag package. Simply calling dnseval foo was enough to make dnsmasq crash :-|

But, as this crashes the lastest git checkout from dnsmasq too, I shall report this upstream, of course. have reported this to the dnsmasq-discuss mailing list.

If somebody wants to take a stab at the MIPS core dump (attached), please do, as I don't have a LEDE build environment set up yet.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing