- Status Closed
- Percent Complete
- Task Type Bug Report
- Category Base system
- Assigned To No-one
- Operating System All
- Severity Medium
- Priority Very Low
- Reported Version Trunk
- Due in Version Undecided
-
Due Date
Undecided
- Private
Opened by Sven Schönhoff - 15.01.2017
Last edited by Ted Hess - 07.02.2017
FS#391 - dnsmasq stops working properly if the fastest upstream DNS server returns a server failure
- Device problem occurs on
Reproduced on TP-Link 1043nd v1 and TP-Link Archer C7 v2.
- Software versions of LEDE release, packages, etc.
Reboot (SNAPSHOT, r2961-5b089e4)
Dnsmasq version 2.76
- Steps to reproduce
Fastest upstream DNS server returns a server failure.
My provider is having some difficulties with his DNS servers this week. I noticed that if the fastest DNS server returns a server failure dnsmasq stops working properly because it ignores the replys of the slower DNS servers.
In google chrome
ERR_NAME_RESOLUTION_FAILED
appears and nslookup returns
** server can't find google.com: SERVFAIL
I don’t use strict-order and it doesn’t matter if the faulty upstream DNS server is the first or the last entry in the config as long as it returns the fastest reply.
I had to delete the upstream DNS server which returns the server failure from my config to get dnsmasq working again.
I was able to create a tcpdump and syslog while the DNS server 83.169.185.162 returned a server failure today.
- You can see in syslog.txt that the reply messages are missing until I delete 83.169.185.162 from the config.
- The tcpdump wan.pcap shows that 83.169.185.162 returns the fastest reply with a server failure and that the other DNS servers work properly but dnsmasq seems to ignore their replys.
Something similar happens to me from time to time. I'm a Gentoo user, and one of the nameservers for the gentoo.org domain seems to be unreliable. When it is down, it's near impossible for me to resolve anything in said domain.
Someone in OpenWrt also had this problem where he was unable to resolve most records of a domain when connected to hist OpenWrt router. The problem did not occur when he was directly connected.
While searching for possible solutions, I came across the –all-servers option:
Can you test if enabling it helps? Can be enabled in /etc/config/dhcp:
Hi Stijn,
the problem still occurs with the allservers parameter.
My ISPs DNS servers are working fine right now but I created a Ubuntu VM in Virtualbox, installed a bind9 DNS server and assigned the manual IP address 192.168.3.2 without gateway address (which means it can't connect to the internet to resolve the request) to reproduce the DNS server failure.
dnsmasq is now using the google public DNS servers and my faulty DNS server for the allservers parameter test.
Syslog shows that the requests are forwarded to all DNS servers as expected but the reply messages are still missing:
DNS starts to work again after I remove 192.168.3.2 from the DNS servers list.
Did it start happening after the update to dnsmasq 2.76, in May 2016?
These commits look relevant:
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=51967f9807665dae403f1497b827165c5fa1084b (introduced in dnsmasq 2.69)
http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=4ace25c5d6c30949be9171ff1c524b2139b989d3 (introduced in dnsmasq 2.76)
So, the first commit introduced the issue you see as a "feature" (but there was a bug in the implementation, so it didn't work), while the second commit made the first commit actually work starting from dnsmasq 2.76.
I'm not sure what is the right behaviour, but it indeed sounds strange to treat SERVFAIL as a valid response.
Simon Kelley took note of this. It might be a necessary though annoying behavior for a stub resolver using DNSSEC.
https://www.mail-archive.com/dnsmasq-discuss@lists.thekelleys.org.uk/msg10901.html
@Baptiste: I can't reproduce the issue with OpenWrt Chaos Calmer 15.05.1 r49389 and Dnsmasq version 2.73. My internet outages started with Lede and Dnsmasq 2.76.
@Eric: I don't use DNSSEC and thus treating SERVFAIL as a valid response sounds strange to me.
I've spend some time with the attempt to add some logging messages and revert the changes mentioned above. I created a patchfile in package/network/services/dnsmasq/patches:
Now it is working as I would expect it. If the fastest DNS server returns SERVFAIL the next DNS server that returns NOERROR will be used for a valid response.
@DEVs: Please feel free to use the patch.
A patch much like this was folded into lede a day or three back.
For clarity:
A patch much like that and updating to dnsmasq 2.77test1 was pulled into jow's staging tree. It is not yet in master, much less backported to 17.01.*
I just tested a selfbuilt with https://git.lede-project.org/?p=source.git;a=commit;h=3bef96ef18a6fb20401313dfa6e88057d56b16ad and can't reproduce this issue anymore.
I would like to suggest to cherry pick this commit for Lede v17.01.0 because it will prevent internet outages for users with unreliable DNS servers like me.
PS: Simon Kelley included the fix in dnsmasq-2.77test2: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=68f6312d4bae30b78daafcd6f51dc441b8685b1e
Thanks for your support.
I already have a pull request in for 2.77test2 https://github.com/lede-project/source/pull/794