New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#388 - odhcpd: A default route is present but there is no public prefix on br-lan thus we don't an #5491
Comments
dtaht: root@lorna-gw:~# ip -6 route I will be rebooting this box in a bit. In an earlier run, I WAS getting a /60 from upstream of me. Now I only have a /64 (and thus dhcpv6-pd should be refused) but I shouldn't lose all connectivity elsewhre. while being refused.... |
dtaht: In dhcp, adding this
"fixed it". The uap gets a ula/64. I am no longer getting a /60 from comcast (40+ reboots today), but perhaps that will fix itself overnight, and I can re-re-resume testing. That said, I think the default route and available prefix check in odhcpd is busted. |
dtaht: And this morning I got my first ever dhcp-pd request inside this network to succeed on a real /60. I will go back and recomplexify, re-enabling the mcast bridge code, moving the pd requestor back to the other side of the bridge, and try to see if being out of ips in the request pool was also a 'cause of the problem. This issue was originally triggered by my trying to put an edgerouter in place (which still crashes horribly) running their default os. I'll put up another couple lede boxes on the net instead, each requesting a subnet. I'm not in a position to hack directly on odhcp in lede at the moment. |
EricLuehrsen: Confirm "me too." odhcpd will delegate for a few minutes then withdraw. TP-LINK WDR3600 and Archer C7. I configured odhcpd+unbound (no dnsmasq) on the head (C7), and dnsmasq-full (no odhcpd) on the extender (WDR3600). I find that while dnsmasq doesn't have all the IP6 features of odhcpd, the DHCPv6 / RA features it does have are more flexible and work more robustly including off-standard clients. "ra_default 1" will announce a default route, even if WAN6 fails to get an address. Therefore fail over will need to time out on each new web page, rather than assume IP4. --Side note on Comcast DHCP to get you testing faster. The leases are for 7 days and default delegation is /64. So if you don't pre-configure /60 request before you plug a box in, youre temporarily stuck, and you need to follow few steps. (1) unplug the cable modem. (2) change your WAN MAC. (3) set your prefix request /60. (4) plug in your cable modem. Comcast Modem -> C7 (AP) -> Wifi -> (Client) WDR (AP) -> Wifi |
dedeckeh: Can you attach the complete output of logread in the described error situation ? |
dtaht: btw: how do you change something to confirmed, and up the priority? Being able to DDOS an entire ipv6 network with a single legit packet, emitted by default by a lede install, is not good. Updating this comment: the DDOS was essentially coming from the edgerouter on the network. Still hosing the whole net to this extent (netifd's state machine?), was kind of bad... and I haven't added back in all the stuff I ripped out yet. |
dtaht: I have a theory. In testing the edgerouter box, it (using wide I think), goes crazy flooding odhcpd. This is of course, bad in itself (I'll report to those guys), but somewhere in there the ra announcement on the local net goes away. (not sure if this is before or after) Could odhcpd not be servicing packets from other sources in it's select loop? (e.g. always returning to the first file descriptor returned by select) or the dhcpv6-pd section of the code staying within it's "bit" and not letting other stuff odhcpd is supposed to handle be serviced? yes, there's also lede issues. I think I have a couple, but need to test each one individually. |
dtaht: OK, I have narrowed it down still further - taking the local pd attempt out of the loop. Down to just the one comcast modem - lede router - one linux client. As to why ipv6 was working at all, all this time, for anyone on comcast (or this particular modem) my linux box will see a retraction, and its expires timer will run to -30 seconds, then expire. And odhcp almost always gets a new one out there in roughly a minute. (except when getting flooded or other problems that I was triggering that made it more obvious!). The "not announcing" thing we started with was a symptom... I think. Even with ra_default 1 I'm still watching my ra on the client appear and disappear. Attached are two tcpdumps the dhcpv6 one is what I'm seeing from comcast, the local-comcast one is what I'm sending internally. tcpdump -i br-lan -w /tmp/local-comcast.cap icmp6 or udp port 546 or udp port 547 packet 17 makes things go bye-bye, packet 38 brings things back to life. Mon Jan 16 17:25:47 2017 daemon.debug odhcpd[1074]: Received 144 Bytes from fe80::201:5cff:fe63:e446%eth0 Something happened here. Also the route to the main ULA disappearslorna-gw: # fdaf:dc63:6de9:8::/64 via fe80::822a:a8ff:fe86:3417 dev br-lan proto static metric 1024 pref medium vanishes and there is no fe80::822a:a8ff:fe86:3417? The switch?Mon Jan 16 17:25:55 2017 daemon.info dnsmasq[4724]: read /etc/hosts - 4 addresses and this sends that short ra packet presumably retracting everything and my expires timer starts counting down 0, -1, -2 ...Mon Jan 16 17:25:57 2017 daemon.debug odhcpd[1074]: Received 144 Bytes from fe80::201:5cff:fe63:e446%eth0 And then we're back! |
dtaht: I've done a clean reboot. Got rid of everything (the above was after mucho hacking). Got rid of all firewalling for ipv6 icmp also. Andddddd... I have not seen it reload dns or odhcp in 10 minutes. Perhaps the netifd state got corrupted by the other stuff. So, now I have a nice, simple, 1 machine network, with dhcp-pd supplying up to a /60 to that lone machine. Not particularly useful, but it is kind of nice to see an RA last for 10 whole minutes! |
dtaht: and I got a single lede box up internally with dhcp-pd. :woot: given all the hell I just went through rebooting the universe, I think it would be nice to be able to advertise even ULAs with a shorter lifetime than forever. Can't always reboot everything. And be able to override the default supplied leasetime (doesn'tseem to respect "leasetime" in the dhcp file)
|
dtaht: OK. I am going to leave it as is overnight, and finally run a few flent tests to stress out the devices I'd intended to stress out. I will go back to adding more lede routers tomorrow, after writing everything down and backing things up. |
dtaht: And I got tempted to deal with the simple config... and a couple reboots of the second dhcp-pd lede router... I'm back where ra's come and go again. At one level, I'm weirdly happy. I can reproduce this at leisure... ... but all I wanted to do originally was run a few ipv6 tests under flent overnight. So. Rebooting the universe again.... |
dtaht: recursively: https://plus.google.com/u/0/107942175615993706558/posts/N6qm9YrJBc4?sfc=true |
dtaht: I (accidentally) tested this again. It takes 3 reboots of an internal lede router requesting dhcp-pd for the ra disappearance to start showing up with the typical pattern of this, repeating. Thu Jan 19 11:03:21 2017 daemon.debug odhcpd[1080]: Received 136 Bytes from kernel%netlink I do not know where the state machine is or what it should look like.. BUT, a CLUE! I see while it is down ifstatus shows the external ipv6 is gone and ifstatus while it is up - shows the external In either case, I do have ULAs on the link and these show as gone, too. Let me put in some commented ifstatus messages. |
dtaht: this is where life is "normal" (meaning that I'm in a failure mode that is retracting and resubmitting ras), me doing an ifstatus lan, then a ifstatus wan ifstatus lan {
} This is wan address: {
} |
dtaht: And this is during the ra retraction. I will reboot in a bit to see what things are like while things are actually working. {
}
} |
dtaht: what had seemed to be significant is the wan interface not having an ipv6 addrs. Well, that's on wan6. diffing the above, yielded nothing, diffing the ifstatus wan6 results, nothing either. I will add in a known good one after a reboot..... |
dtaht: openwrt/odhcpd#79 has the same bug report essentially. And it was even bisected! Yea!!!! Big bisect. Booo! This totally explains why I'd not seen this (very often) before trying to deploy lede-head at more scale. |
dedeckeh: odhcpd has been patched to create extra syslog traces for trouble shooting (https://git.lede-project.org/?p=project/odhcpd.git;a=summary). |
dtaht: I will get on this friday at the latest. Thx |
dtaht: I have temporarily abandoned odhcpd to dnsmasq in favor of getting stuff getting addresses/prefixes from outside (6rd, odhcp6c) working well first. and giving up on PD. |
dtaht:
Supply the following if possible:
I do have dnsmasq-full installed and this:
config odhcpd 'odhcpd'
option maindhcp '0'
option leasefile '/tmp/hosts/odhcpd'
option leasetrigger '/usr/sbin/odhcpd-update'
(I will back off dnsmasq-full now that I've learned that the flakyness was triggered by the ra going away, but I had long assumed dnsmasq won't do ras unless you tell it accept_ra. Should I try making odhcpd the main dhcpv4 server?)
Topology:
ComcastModem -> archerc7v2 -> uap-lite -> wifi
odhcpd every ~minute says "A default route is present but there is no public prefix on br-lan thus we don't announce a default route!", even though showing ip -6 addr show shows
24: br-lan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fdaf:dc63:6de9:10::1/60 scope global noprefixroute
valid_lft forever preferred_lft forever
inet6 2601:646:4180:elided::1/64 scope global noprefixroute dynamic
valid_lft 332453sec preferred_lft 332453sec
inet6 fe80::32b5:c2ff:fe75:7faa/64 scope link
valid_lft forever preferred_lft forever
as fast as I can poll it.
The router announcement shows an advertised lifetime of 0 alternating with 64k every 30sec or so.
But for no reason I understand that is seemingly ok. My IPv6 connectivity seems to keep working.
BUT:
Add a client attempting to get a dhcpv6-pd address further downstream, and failing,
and all hell breaks loose.
Snipping from the log
At this point (a few seconds before I see the solicit in the log) the ra induced route is withdrawn and all hosts lose ipv6 connectivity for about 30 seconds. (the pd request also fails)
And it repeats about every 1-2 minutes. Believe me, having ipv6 working only 50-75% of the time is maddening!
http://www.taht.net/~d/dhcpv6bug/ipv6advert.png - the normal advert
http://www.taht.net/~d/dhcpv6bug/ipv6retract.png - then a short one with lifetime 0
There's a packet capture in the same dir.
...
While debugging and simplifying this today I also eliminated the multicast-unicast code as proximate causes.
...
I've seen a few other bug reports like this around, perhaps I've made some progress. (I was originally triggering this chaos with the edgerouter with dhcp-pd requests, now it's lede-head throughout, and pure ethernet rather than a wifi bridge) There has occasionally been prefixes available, but not at the moment, and the effect is the same with or without a prefix being offered.
So my guess is odhcp is not successfully polling for the addresses on the interface (sometimes). Could be subtler.
The text was updated successfully, but these errors were encountered: