OpenWrt/LEDE Project

  • Status Closed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Sven Schönhoff - 15.01.2017
Last edited by Ted Hess - 07.02.2017

FS#391 - dnsmasq stops working properly if the fastest upstream DNS server returns a server failure

- Device problem occurs on
Reproduced on TP-Link 1043nd v1 and TP-Link Archer C7 v2.

- Software versions of LEDE release, packages, etc.
Reboot (SNAPSHOT, r2961-5b089e4)
Dnsmasq version 2.76

- Steps to reproduce
Fastest upstream DNS server returns a server failure.

My provider is having some difficulties with his DNS servers this week. I noticed that if the fastest DNS server returns a server failure dnsmasq stops working properly because it ignores the replys of the slower DNS servers.

In google chrome

ERR_NAME_RESOLUTION_FAILED

appears and nslookup returns

** server can't find google.com: SERVFAIL

I don’t use strict-order and it doesn’t matter if the faulty upstream DNS server is the first or the last entry in the config as long as it returns the fastest reply.
I had to delete the upstream DNS server which returns the server failure from my config to get dnsmasq working again.

I was able to create a tcpdump and syslog while the DNS server 83.169.185.162 returned a server failure today.
- You can see in syslog.txt that the reply messages are missing until I delete 83.169.185.162 from the config.
- The tcpdump wan.pcap shows that 83.169.185.162 returns the fastest reply with a server failure and that the other DNS servers work properly but dnsmasq seems to ignore their replys.

Closed by  Ted Hess
07.02.2017 23:04
Reason for closing:  Fixed
Project Manager
Stijn Tintel commented on 16.01.2017 04:58

Something similar happens to me from time to time. I'm a Gentoo user, and one of the nameservers for the gentoo.org domain seems to be unreliable. When it is down, it's near impossible for me to resolve anything in said domain.

Someone in OpenWrt also had this problem where he was unable to resolve most records of a domain when connected to hist OpenWrt router. The problem did not occur when he was directly connected.

While searching for possible solutions, I came across the –all-servers option:

By default, when dnsmasq has more than one upstream server available, it will send queries to just one server. Setting this flag forces dnsmasq to send all queries to all available servers. The reply from the server which answers first will be returned to the original requester.

Can you test if enabling it helps? Can be enabled in /etc/config/dhcp:

config dnsmasq
        option allservers '1'
        ...
Sven Schönhoff commented on 16.01.2017 17:29

Hi Stijn,

the problem still occurs with the allservers parameter.

My ISPs DNS servers are working fine right now but I created a Ubuntu VM in Virtualbox, installed a bind9 DNS server and assigned the manual IP address 192.168.3.2 without gateway address (which means it can't connect to the internet to resolve the request) to reproduce the DNS server failure.

sven@sven-VirtualBox:~$ nslookup facebook.com 192.168.3.2
Server:         192.168.3.2
Address:        192.168.3.2#53

** server can't find facebook.com: SERVFAIL

dnsmasq is now using the google public DNS servers and my faulty DNS server for the allservers parameter test.

root@FlensNet:~# uci set dhcp.@dnsmasq[0].allservers=1
root@FlensNet:~# uci commit dhcp
root@FlensNet:~# /etc/init.d/dnsmasq restart
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'facebook.com': Try again

Syslog shows that the requests are forwarded to all DNS servers as expected but the reply messages are still missing:

Mon Jan 16 17:55:13 2017 daemon.info dnsmasq[1070]: exiting on receipt of SIGTERM
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: started, version 2.76 cachesize 500
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-DNSSEC no-ID loop-detect inotify
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: DNS service limited to local subnets
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq-dhcp[1933]: DHCP, IP range 192.168.3.100 -- 192.168.3.249, lease time 12h
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using local addresses only for domain lan
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: reading /tmp/resolv.conf.auto
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using local addresses only for domain lan
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 8.8.8.8#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 8.8.4.4#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: using nameserver 192.168.3.2#53
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /etc/hosts - 4 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /tmp/hosts/odhcpd - 0 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq[1933]: read /tmp/hosts/dhcp.cfg02411c - 2 addresses
Mon Jan 16 17:55:17 2017 daemon.info dnsmasq-dhcp[1933]: read /etc/ethers - 0 addresses
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 query[A] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 1 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 query[AAAA] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 2 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 query[A] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 3 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 query[AAAA] facebook.com from 127.0.0.1
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 8.8.8.8
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 8.8.4.4
Mon Jan 16 17:55:29 2017 daemon.info dnsmasq[1933]: 4 127.0.0.1/46400 forwarded facebook.com to 192.168.3.2

DNS starts to work again after I remove 192.168.3.2 from the DNS servers list.

Baptiste Jonglez commented on 16.01.2017 23:50

Did it start happening after the update to dnsmasq 2.76, in May 2016?

These commits look relevant:

http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=51967f9807665dae403f1497b827165c5fa1084b (introduced in dnsmasq 2.69)
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=4ace25c5d6c30949be9171ff1c524b2139b989d3 (introduced in dnsmasq 2.76)

So, the first commit introduced the issue you see as a "feature" (but there was a bug in the implementation, so it didn't work), while the second commit made the first commit actually work starting from dnsmasq 2.76.

I'm not sure what is the right behaviour, but it indeed sounds strange to treat SERVFAIL as a valid response.

Eric Luehrsen commented on 17.01.2017 03:45

Simon Kelley took note of this. It might be a necessary though annoying behavior for a stub resolver using DNSSEC.

https://www.mail-archive.com/dnsmasq-discuss@lists.thekelleys.org.uk/msg10901.html

Sven Schönhoff commented on 17.01.2017 19:53

@Baptiste: I can't reproduce the issue with OpenWrt Chaos Calmer 15.05.1 r49389 and Dnsmasq version 2.73. My internet outages started with Lede and Dnsmasq 2.76.

@Eric: I don't use DNSSEC and thus treating SERVFAIL as a valid response sounds strange to me.

I've spend some time with the attempt to add some logging messages and revert the changes mentioned above. I created a patchfile in package/network/services/dnsmasq/patches:

--- a/src/forward.c
+++ b/src/forward.c
@@ -821,9 +821,15 @@ void reply_query(int fd, int family, tim
     }   
    
   server = forward->sentto;
+
+  if (option_bool(OPT_LOG) && RCODE(header) == SERVFAIL)
+    my_syslog(LOG_INFO, _("received SERVFAIL"));
+  if (option_bool(OPT_LOG) && RCODE(header) == REFUSED)
+    my_syslog(LOG_INFO, _("received REFUSED"));
+
   if ((forward->sentto->flags & SERV_TYPE) == 0)
     {
-      if (RCODE(header) == REFUSED)
+      if (RCODE(header) == REFUSED || RCODE(header) == SERVFAIL)
 	server = NULL;
       else
 	{
@@ -853,7 +857,7 @@ void reply_query(int fd, int family, tim
      we get a good reply from another server. Kill it when we've
      had replies from all to avoid filling the forwarding table when
      everything is broken */
-  if (forward->forwardall == 0 || --forward->forwardall == 1 || RCODE(header) != REFUSED)
+  if (forward->forwardall == 0 || --forward->forwardall == 1 || (RCODE(header) != REFUSED && RCODE(header) != SERVFAIL))
     {
       int check_rebind = 0, no_cache_dnssec = 0, cache_secure = 0, bogusanswer = 0;
 

Now it is working as I would expect it. If the fastest DNS server returns SERVFAIL the next DNS server that returns NOERROR will be used for a valid response.

Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 query[AAAA] bugs.lede-project.org from 127.0.0.1
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.161
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.225
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.8.8
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.4.4
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 192.168.3.2
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: received SERVFAIL
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 24 127.0.0.1/54663 reply bugs.lede-project.org is 148.251.78.235
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 reply bugs.lede-project.org is 2a01:4f8:202:43ea::3

@DEVs: Please feel free to use the patch.

Dave Täht commented on 04.02.2017 21:46

A patch much like this was folded into lede a day or three back.

Anonymous Submitter commented on 04.02.2017 22:09

For clarity:

A patch much like that and updating to dnsmasq 2.77test1 was pulled into jow's staging tree. It is not yet in master, much less backported to 17.01.*

Sven Schönhoff commented on 07.02.2017 17:33

I just tested a selfbuilt with https://git.lede-project.org/?p=source.git;a=commit;h=3bef96ef18a6fb20401313dfa6e88057d56b16ad and can't reproduce this issue anymore.

I would like to suggest to cherry pick this commit for Lede v17.01.0 because it will prevent internet outages for users with unreliable DNS servers like me.

PS: Simon Kelley included the fix in dnsmasq-2.77test2: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=68f6312d4bae30b78daafcd6f51dc441b8685b1e

Thanks for your support.

Anonymous Submitter commented on 07.02.2017 17:43

I already have a pull request in for 2.77test2 https://github.com/lede-project/source/pull/794

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing