Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#986 - ARV752DPW missing default switch config #5931

Open
openwrt-bot opened this issue Aug 27, 2017 · 9 comments
Open

FS#986 - ARV752DPW missing default switch config #5931

openwrt-bot opened this issue Aug 27, 2017 · 9 comments
Labels

Comments

@openwrt-bot
Copy link

Oliver:

Hi

On the ARV752DPW, with LEDE SNAPSHOT, r4723-4fce22e, and u-boot as linked to in the OpenWRT Wiki [http://www.galax.is/files/802/flash-uboot.bin] luci gives the warning message "Switch has unknown structure".
The structure should be WAN, 1, 2, 3, 4, CPU.]

Also, swconfig shows that any configuration in (etc(config/network will be loaded on top of a default configuration that already exists at boot time. The switch does not appear to be properly initialized. Is this still supposed to happen in u-boot or should LEDE do this now?

e.g. /etc/config/network with:
...
config switch
option name 'switch0'
option reset '1'
option enable_vlan '1'

config switch_vlan
option device 'switch0'
option vlan '1'
option ports '1 2 3 4 5t'

config switch_vlan
option device 'switch0'
option vlan '2'
option ports '0 5t'
...

leads to this after reboot:

VLAN 0:
vid: 0
ports: 0 5t
VLAN 1:
vid: 1
ports: 1 2 3 4 5t
VLAN 2:
vid: 2
ports: 0 5t
VLAN 3:
vid: 3
ports: 3 5t
VLAN 4:
vid: 4
ports: 4 5t
VLAN 5:
vid: 5
ports: 0 1 2 3 4

@openwrt-bot
Copy link
Author

mkresin:

... luci gives the warning message “Switch has unknown structure”.

Yeah, there is no default switch config bundled for the ARV752DPW in LEDE. Would you please give https://kresin.me/patches/add_ARV752DPW_switch.patch a try.

The switch does not appear to be properly initialized. Is this still supposed to happen in u-boot or should LEDE do this now?

The [[https://git.lede-project.org/?p=source.git;a=blob;f=target/linux/generic/files/drivers/net/phy/rtl8306.c;h=7c70109e633693b657f0c2ff61949cbb208b413e;hb=HEAD#l467
|rtl8306 linux kernel driver]] is supposed to reset any vlan settings done by the bootloader.

I have the ARV7506PW11 here, which uses the same rtl8306 switch chip. But till now I haven't checked if the switch driver really resets any switch config done by the bootloader. Might be that you hit a bug.

@openwrt-bot
Copy link
Author

Oliver:

Would you please give https://kresin.me/patches/add_ARV752DPW_switch.patch a try.<<

I would love to. Unfortunately I am abroad and do not have the resources to set up a build environment here.
So I tried to just change the file in the overlay filesystem. It did not work. I guess, these scripts are executed from ROM?

But till now I haven't checked if the switch driver really resets any switch config done by the bootloader.

At least in Oct 2013 it did not. It did not even initialize the switch properly.
So when OpenWRT was started by a broken bootloader, OpenWRT had no network connectivity.
This was the reason for the patch back then.
I thought it might have changed by now. But maybe it has not.

@openwrt-bot
Copy link
Author

mkresin:

I would love to. Unfortunately I am abroad and do not have the resources to set up a build environment here.
So I tried to just change the file in the overlay filesystem. It did not work. I guess, these scripts are executed from ROM?

root@LEDE:/# vim /etc/board.d/02_network root@LEDE:/# rm /etc/board.json root@LEDE:/# rm /etc/config/network root@LEDE:/# reboot

Also, swconfig shows that any configuration in /etc/config/network will be loaded on top of a default configuration that already exists at boot time. The switch does not appear to be properly initialized

The switch is properly initialized. It is the rtl8306 linux driver which adds the funky vlan config. According to the comments in the switch driver, it is done to isolate the ports. Which doesn't make much sense to me, since the vlan functionality is disabled by default but a default vlan config applied anyway => should never work. Regardless what I've tried (enable vlans without user vlan config, disabled vlan config) packages are leaking perfect fine between the switch ports.

As soon as you enable the vlan functionality and setup vlans, the default config isn't resetted. The user config is only applied on top. It is strange that it works at all but it works for me (or I've chosen a bad testcase).

@openwrt-bot
Copy link
Author

mkresin:

So when OpenWRT was started by a broken bootloader, OpenWRT had no network connectivity.

This ticket isn't really the right place for tracking this issue but anyway.

I gave the ARV752DPW u-boot from the OpenWrt wiki a try on my ARV7506PW11 and it works. I was able to use tftp as well as the http recovery interface.

But I can confirm that there is an issue with u-boot and the switch chip. I managed to get the switch into a state where it can not be found by u-boot (and later on not by LEDE) two times. Unfortunately I'm not able to trigger/reproduce the bug.

The rtl8306 switch chip has a pin to hold the chip in reset. For the ARV7506PW11 and most likely for the ARV752DPW as well the pin is connected to GPIO 19 of the danube SoC. During boot, u-boot toogles the pin to reset the switch (vlan) config:

searching for rtl8306 switch ... found Reset Hard Done

If the rtl8306 can not be found by u-boot the (hard) reset isn't done any more.

Net: searching for rtl8306 switch ... failed

As soon as I set the direction of the GPIO connected to the reset pin to "out", the switch comes back to live:

root@LEDE:/# echo 499 > /sys/class/gpio/export root@LEDE:/# echo out > /sys/class/gpio/gpio499/direction root@LEDE:/# [ 193.245095] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 193.253203] br-lan: port 1(eth0.1) entered forwarding state [ 193.257427] br-lan: port 1(eth0.1) entered forwarding state [ 193.266290] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready [ 195.260755] br-lan: port 1(eth0.1) entered forwarding state

I've no idea why it fails nor why the direction change brings the switch back to live (GPIO value of 1 usually releases the reset on my board).

Judging based on the output, the hard reset limbo is only done if the switch can be found/identified. It could be some kind of interaction with the LEDE driver where an important register has a bad value and locks the chips. It might be that the reset limbo fails (due to a race condition/missing wait) and the rtl8306 chip stucks in reset. Albeit, for the later case I would expect that forcing the switch into reset from LEDE would trigger the bug as well, which it doesn't.

Nevertheless we should take care of this issue. As long as I'm not able to reproduce/trigger the issue I can not be really helpful.

One way to fix the switch state could be to add a device tree binding for the rtl8306 with a gpio-reset property. The GPIO referenced by the gpio-reset property can be pull during early driver load to release a possible reset. The same property can be used to issue an hard reset on driver load instead of resetting the vlan config only.

Maybe you are able to trigger the bug reliable and check the gpio registers and so on, to get a clue what is real issue here.


Sidenote: The (hard) reset is missing in the lantiq u-boot for the ARV752DPW and will cause a not working u-boot network if you setup vlans in LEDE and do a reboot. You might want to have a look at the ARV7506PW11 u-boot, which has the (unconditonal) hard reset on load.

@openwrt-bot
Copy link
Author

Oliver:

root@LEDE:/# rm /etc/board.json
root@LEDE:/# rm /etc/config/network

Yes, this does the trick.

This ticket isn't really the right place for tracking this issue

We can take it off-line or into the forum if you want.

I was able to use tftp as well as the http recovery interface.

I am pretty sure that Wireshark did not see a single packet when I tried tftpboot. But I can try that again, when the router is not in use.

why the direction change brings the switch back to live (GPIO value of 1 usually releases the reset on my board).

Yes, according to the RTL8306M datasheet the reset (pin 40) is an active low input.
They might have added a pull-down resistor to keep the RTL8306M in a safe reset state until the danube drives pin 40 high. In this case making the danube pin an input would be just as good as switching it low.
Or they might have saved a few bucks and not wired any pull-down at all, so that pin 40 would be floating as long as the connected danube pin is an input (Hi-Z).
In any case the connected danube pin should be made an output early in the boot sequence and not be configured as input.

The (hard) reset is missing in the lantiq u-boot for the ARV752DPW and will cause a not working u-boot network if you setup vlans in LEDE and do a reboot.

OK, this may explain why I did not have any network at all in u-boot. Of course the box had some vlan config on it, when I took it in order to replace a burnt-up EB 803.

On the ARV752DPW with running LEDE, pin 19 is an input and driven (or pulled) high. Weird. Why input? I will at least try to follow the trace and see where it goes, before I change the GPIO. I don't want to cause smoke signals because I make it an output low, while it is driven high by another chip. ;-)

@openwrt-bot
Copy link
Author

mkresin:

For now I will use the ticket for this bug as well.

I'm now able to trigger the bug.

The whole rtl8306 configuration is done via mii. The rtl8306 has paging implemented to access different values via the same mii register. The precompiled u-boot doesn't have support for paging and expects the first page to be selected/active.

After running "swconfig dev switch0 show && reboot" a different page is active, the bootloader fails to read the chip id via mii (the selected register returns 0x0000) and fails to bring up the switch.

I have a commit (beside other rtl8306 fixes) in my [[https://git.lede-project.org/?p=lede/mkresin/staging.git|staging tree]] to workaround the issue. I do not intend to commit this workaround, since it is a bug/limitation of the precompiled u-boot.

I've still no real clue why the switch is in reset in case the precompiled u-boot fails to initialize the rtl8306 (and do the hard reset via GPIO). I'm not able to reproduce it with the current LEDE u-boot, even if I do not touch any GPIOs.

I can confirm that the danube SoCs reset value for GPIO#19 is input, which does not hold the rtl8306 in reset.

The GPIO#19 config with the current LEDE u-boot without any GPIO config (dumped during early kernel load):

DIR: input | PULLING: down | PULL: disabled | DRAIN: open drain | ALTSEL0: 0 | ALTSEL1: 0 | OUT: 0 | IN: 1

And the GPIO#19 config with precompiled u-boot and failed rtl8306 init:

DIR: input | PULLING: down | PULL: disabled | DRAIN: open drain | ALTSEL0: 0 | ALTSEL1: 0 | OUT: 0 | IN: 1

I'm able get the switch (and GPIO#19 values) to the same state by doing the following:

echo "499" > /sys/class/gpio/export

reset

echo 1 > /sys/class/gpio/gpio499/value

switch back to input

echo in > /sys/class/gpio/gpio499/direction

switch to output (default reg value releases reset)

echo out > /sys/class/gpio/gpio499/direction

I've managed to compile the old u-boot code (commit id 0dd64c5) and will give the binary a try.

Might be that this version already shows the issue. Of course, it could be that the person who provides the precompiled u-boot has applied some local changes. A perfect explanation would be GPIO#19 is set as CONFIG_SWITCH_PIN (output + high) and CONFIG_BUTTON_PIN (input).

@openwrt-bot
Copy link
Author

Oliver:

I dug in backups of the ancient past and found the attached patch. Issue here was that the CPU port was not enabled. Maybe the "old" u-boot in the Wiki does not have this?

Anyway, looking into the source gave me confidence that danube pin 19 controls the reset of the rtl8306, also on ARV752DPW. So I became brave and toggled the direction to output - and got immediately yelled at.
Network connection was gone. ;-)
I'll pull the router tomorrow and play with it.

@openwrt-bot
Copy link
Author

Oliver:

First experiments:

"swconfig dev switch0 show && reboot" did not trigger any issue.

Setting pin 19 to output resets the switch (output default value is '0'). After setting it to '1' the switch comes back with all 16 vlans configured (CPU port is not connected). Removing all vlans with swconfig and running "/etc/init.d/network restart" leads to the same switch configuration as a cold start: vlans 0 to 5 are configured, even though only 1 and 2 are configured in /etc/config/network.
See attached file.

@openwrt-bot
Copy link
Author

mkresin:

Removing all vlans with swconfig and running "/etc/init.d/network restart" leads to the same switch configuration as a cold start: vlans 0 to 5 are configured, even though only 1 and 2 are configured in /etc/config/network.

That is expected. You most likely have a reset '1' in /e/c/network. If the rtl8306 linux driver init and/or reset function is called, the default "port isolation" switch config is applied. I have a fix for that in my staging tree.

I'm able to reproduce the issue with the u-boot compiled from 0dd64c5.

The only difference between u-boot compiled from 0dd64c5 and the u-boot from the wiki is that the wiki one has working network in u-boot.

The switch in u-boot (0dd64c5) isn't completely uninitialized. I see ARP requests and after a few tries a single TFTP read request but thats it.

Long story short, I've no idea on which code the u-boot linked in the wiki is based on or which patches are applied. But it isn't based on the last (old u-boot) vanilla version which I found in the repo.

Due to the fact that I can reproduce the switch hang with an u-boot I have code for, it's possible to get why the switch is in reset. Albeit it is more or less interesting for educational purposes.

Seem to me my method of triggering the bug isn't that reliable. Occasionally I fail to trigger it as well.

Could be that running swconfig multiple times before doing the reboot does the trick. Might be that a user specific vlan config is required as well. Hard to say since I'm to lazy to check the code for register reads requiring a page switch. If you have the patch linked in my first comment applied, at least your vlan config should match the one I'm using.

Feel free to mail me if your experiments show something interesting. You can find my mail address in the SoB line of my commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant