This article originally started out to be an exploration into how to make routers fail. The goal was to develop a router "torture test" that would reveal which routers would be more (or less) reliable over time. It turns out that this is harder to do than you might think.
Given the complexity of the code in today's routers and the wide range of internet services that they have to play nice with, revealing a particular router's failure modes is a tough job. It's easy to hammer a router with a traffic stream to see how fast it will move packets from WAN to LAN and back again. But doing that alone usually doesn't do anything other than produce a huge test file that shows that the router handled all the packets just fine. (I know, I've tried.)
It's much more difficult to devise a test scenario that causes a router to lock up or drop its WAN connection intermittently, behavior that many consumers experience all too frequently.
In the course of my research, I remembered the folks at QA Cafe. This small group of self-described "packet geeks" makes a line of products built around its flagship CDRouter router test solution. CDRouter is a Linux-based router test suite that currently includes over 400 tests in its base product. The test suite can be expanded with the optional modules shown below to add even more test cases.
CDRouter Product Family
I was going to use the free demo to see if CDRouter would do the job. But after an initial discussion with the QA Cafe guys, they offered to loan me one of their CDRouter-in-a-box appliances, the NTA1000. This turned out to be a good decision on their part, because it let me bypass the hassle of configuring a Linux system with enough Ethernet ports and go straight to learning how to configure CDRouter.
QA Cafe NTA1000
It's important to point out that CDRouter is not intended as a stress test tool. Its key focus is as a regression test tool for router designers. CDRouter's test suite can verify that a wide set of router functions work in the first place and also ensure that they are still working when firmware is changed. But given the extensive range of tests and the ability to repeat tests, loop the entire sequence of tests and even shuffle the order of tests, CDRouter can also provide some level of router stress testing.
Since I had to drop back from my original plan of pushing routers to their breaking point, I decided to take a different approach. As I've noted in the past, many users swear that alternative firmware like DD-WRT or Tomato is more reliable than what comes loaded on routers. If I only had a dime for every snarky comment I've seen about how bad router factory firmware is, I'd be on a beach somewhere instead of writing this!
There is no question that alternative router distros like DD-WRT, Tomato, etc. provide access to broader feature sets than factory firmware. But do undesirable features (AKA bugs) also ride along with the new tricks that these alternative firmwares enable? I decided to find out.
The object of my attention was the ever-popular ASUS RT-N66U "Dark Knight" router. I chose it because, yes, it's popular. But it also has a unique alternative firmware available for it that takes a very different approach. Asuswrt-Merlin (AKA Merlin) has bug fixing as its top priority and only adds "a few basic features and tweaks to the original firmware". (Check Scott's Merlin review for the details.) I thought Merlin was a good firmware to test to see whether it lives up to its claim of fixing, not making bugs.
The test approach was:
- Determine the CDRouter test suite
- Flash the RT-N66U with each firmware
- Run the test suite
- Compare the results
This is admittedly a simple approach, but would at least determine if the alternative firmware tested takes a "first, do no harm" approach.
Step 1 was the most difficult. I had to choose a test suite that tested as much as it could, but that also didn't test things that the factory firmware wasn't designed to do in the first place. I also had to work with the confines of a fixed router setup. CDRouter doesn't have the ability to access the router admin GUI to change settings during testing. Since I wasn't up to the task of finding a scripting platform that could do both that and also run CDRouter, I had to use the same router setup for each firmware.
Fortunately, the guys at QA Cafe were very helpful. Given their experience with testing a lot of routers, they were able to tip me to common router weaknesses and help me determine whether the test failures I saw were real or a result of test setup misconfiguration.
The firmwares tested are listed below. To be exact, I'm showing the actual firmware files loaded.
ASUS factory: RT-N66U_126.96.36.199_260.trx
- Asuswrt-merlin: RT-N66U_188.8.131.52_266.23b.trx
- Tomato (Shibby): tomato-K26USB-1.28.RT-N5x-MIPSR2-104-AIO-64K.trx
- DD-WRT: dd-wrt.v24-19342_NEWD-2_K2.6_mega.bin
Before the each test, I made sure each router was configured as follows
Set router address to 192.168.1.1
- Dynamic DNS set for DynDNS service, test.domain.com Host name, qacafe username and qacafe123 password
- Set WAN to clone router MAC address
- Enabled LAN DHCP server, range 192.168.1.2 - 253
- WAN type set to DHCP client
After some intial runs using the factory firmware to debug my test plan, I reflashed the router with each firmware and ran the test suite once on each router. In cases where I suspected problems with test configuration, I reran the suite after correcting the router configuration.
The graphic below, taken from the CDRouter BuddyWeb console, shows all the available test modules, with the ones included in my test suite checked. Some tests are simple and take mere seconds to run, like the tests verifying VPN passthrough. Others, like the renum_dhcp group that verify that the router continues to function properly through a WAN DHCP lease renewal, take 5 minutes or longer due to the wait for DHCP lease renewal.
If you want to see more information about the tests in each group, use the online list. Although I didn't use all of the 400 available tests, I think you'll see that the tests I ran hit most of the key duties that any router must perform flawlessly.
The Test List
CDRouter unfortunately doesn't have an easy way to compare test results. So I used its export feature to get the test results into Excel, sorted the Passes and Fails and summarized only the failed tests into the table below. The table is arranged to provide a separate row for each failed test. If you see the test name in a row, it failed. If you don't, it didn't.
|sip_42||sip_42||[all sip tests]|
|sip_45||sip_45||sip_45||[all sip tests]|
|sip_60||sip_60||sip_60||[all sip tests]|
|sip_61||sip_61||sip_61||[all sip tests]|
|sip_62||sip_62||sip_62||[all sip tests]|
|sip_63||sip_63||[all sip tests]|
|sip_73||[all sip tests]|
CD Router failed test summary
It's clear that UPnP failures are present in all four firmwares, as are failures in the SIP ALG (Application Level Gateway). But some of the failures common to all four firmwares can be considered "features", as we'll see in the commentary below.
To start us off, I passed some of the failures back to QA Cafe for review. Their comments, edited for clarity, are below with my additional comments preceeded by "SNB" and in italics.
dhcp_server_700: this looks like a legitimate failure if the ASUS claims that it supports the DHCPINFORM message.
SNB: I didn't confirm this with ASUS. But since the failure is common across all firmwares, the problem isn't unique to ASUS and could be considered a "feature"
firewall_2: this looks like a legitimate failure. The ASUS firewall allowed an inbound TCP connection from a host on the WAN to a client on the LAN at port 80. This test should normally pass, and the firewall appears to be enabled based on the other firewall test results.
SNB: More discussion on this below.
app_14: this test is failing because when CDRouter tries to verify that the FTP port has been closed, it finds that it is actually still open. To properly run this test you need to know what the ASUS uses for an FTP port timeout.
SNB: As noted, this is more a test configuration issue. But I left the test in to see if any of the other firmware made it go away. The test value for the timeout is 80 seconds. The actual timeout is longer than this.
ipsecpt_30: this looks like a simple configuration issue. It looks like the ASUS supports passthrough of unknown IP protocol packets.
SNB: Again, I left this test in to see if the other firmware distros exhibited the same behavior. They do since the test fails for all distros.
ipsecpt_110: this looks like a legitimate failure.
SNB: More discussion on this below.
upnp_36 and upnp_41: these look like legitimate failures, if ASUS claims support for creating port mappings using specific source IP addresses via UPnP.
SNB: More discussion below.
dyndns_10: this is a simple config mismatch. The ASUS is reporting a user agent string of ez-update-3.0.11b5 unknown  (by Angus Mackay). However, your config file does not match this - it is missing the square brackets. To fix this, just edit the testvar "dynDnsAgent" and set it to a value of "ez-update-3.0.11b5 unknown  (by Angus Mackay)".
SNB: The failure described above is accurate for the ASUS stock, Merlin and DD-WRT. But for Tomato, the test failed because the DynDNS client did not send a DynDNS update request.
dns_45 and dns_46: it looks like the ASUS does not failover on certain error codes
SNB: This error is common to all firmwares, but the failure codes varied. All firmwares had problems with error code 4. Tomato also failed on error code 8. DD-WRT was the worst, failing on codes 1, 3, 4, 6-15.
Like the cdrouter_dhcp_server_700 test, the cdrouter_ipsecpt_110 test might be also considered a "feature". Or at least, a missing feature. The notes for this test indicate that the test is only supported by routers using SPI tracking features to build NAPT mappings for return IPSEC/ESP traffic. Some routers attempt to serialize the return connections and associate incoming SPIs with outgoing connections. While this does not work if packets are reordered or lost, some routers do use this technique to attempt to allow multiple LAN clients to use the same VPN server. Since this test failed for all firmwares, this technique is obviously not supported.
The firewall_2 failure is surprising and appears on both ASUS and Merlin firmware, but not Tomato or DD-WRT. The test sends a TCP Syn packet from a host on the WAN (184.108.40.206) to a client on the LAN (192.168.1.107). The packet has a source port of 10525 and destination port of 80. This packet was actually forwarded by the router to the LAN client, implying that port 80 for that client (at a minimum) is open to hosts on the WAN.
QA Cafe noted that a port scan probably won't find this issue, because the port scan targets the WAN IP of the router under test. The CDRouter test doesn't probe the WAN for open ports; it just assumes that port 80 is open to a LAN side IP address.
The likelihood of this causing a problem in real life use is minimal, because private IP addresses are not routable on the public internet. So even though the RT-N66U does not drop private IP packets on the WAN, all of the other devices on the public internet which the ASUS is connected to should. However, most devices have drop rules on their WAN interface(s) to specifically account for this and the ASUS and Merlin firmwares don't. So this should be fixed.
The large number of SIP test failures mainly tell us that all the SIP functions that CDRouter tests are not supported by the different firmwares. But DD-WRT's failure (this is the MEGA version, which is supposed to include SIP support) to pass any of the SIP tests caused me to go back to make sure that there wasn't a software switch that I hadn't thrown. I didn't find one and reran the SIP tests and they still didn't pass. I'm by no means a DD-WRT expert, so I'll be happy to retest if someone points out any error that I have made.
The UPnP failures have a story similar to the SIP failures, i.e. they mainly reflect differences in UPnP handling implementation. Tomato differentiates itself in the least number of UPnP failures. But both Tomato and DD-WRT pass upnp_50 and upnp_202, which stock ASUS and Merlin firmware fail.
Of the two, upnp_50 is more interesting, since it tests that ports opened by UPnP are properly closed when their lease time expires. The test adds a UPnP port mapping with a lease time of 30 seconds, verifies that the mapping works, then waits 45 seconds and checks to see that the mapping is no longer there. The fact that the stock ASUS and Merlin firmware fails this test says that those firmwares don't pay proper attention to automatic UPnP port mappings.
Although it failed for all four firmwares, upnp_220 didn't fail the same way for all firmwares. This test checks that the maximum number of UPnP event subscriptions (10) can be created and then shut down. Stock ASUS and Merlin firmware were able to open ten subscriptions, but were only able to cancel subscriptions for the first three. Tomato was able to create all ten subscriptions and cancel eight of them. DD-WRT performed the best for UPnP, able to subscribe and unsubscribe all ten events. The test failed only because one UPnP NOTIFY even was not received within a 120 second window.
A big failure for DD-WRT is the renumber_2 test. This test checks that a TCP connection can be reestablished after a WAN DHCP renumbering. Failure of this test could indicate that a router could require a reboot to get going again after a WAN DHCP cycle that results in a new IP address.
The multiple scale tests failures for Tomato indicate trouble with its LAN DHCP server. I set the DHCP server range as wide as possible (2-253), so you might want to avoid this distro if you are going to have > 200 DHCP clients.
Tomato is also unique in its dyndns_1 and dyndns_2 failures. These indicate basic failures in the dynamic DNS function, i.e. the ability to send an update request when the router's WAN IP address changes and the ability to not send a request when the WAN IP doesn't change. In the first case, Tomato made the request update request, but sent the wrong WAN IP address. In the second case, Tomato tried to update DynDNS when it shouldn't have.
Finally, the failures of dns_70 and dns_120 tests in both Tomato and DD-WRT could be considered esoteric issues to some. But the fact that they failed in those firmwares while they did not in the stock ASUS and Merlin, once again indicates less complete implementations.
The dns_60 failure unique to DD-WRT is a bigger deal, however. The test logfile showed that when the WAN link was dropped and reestablished, DD-WRT properly registered the new DNS servers. But when a LAN client sent a DNS request to the router's DNS proxy, it didn't respond within the alloted 5 seconds. If that were your LAN client, your browser would have hung or returned a "can't reach the internet" message. Not a good thing.
While I have focused on failures in this story, it's important to step back and look at the big picture. All firmwares passed tests for NAT, packet forwarding and DOS protection with flying colors—all things that are essential to router function. On the other hand, I didn't test port forwarding, WAN connections other than DHCP, DMZ, URL filtering, triggered port forwarding, multicast and a bunch of other stuff. What demons lay undiscovered there will have to wait for another time.
But while the number of actual failures (vs. "features") is relatively small compared to the 200 tests in the suite, the tests that did fail tend to show that this experiment was worthwhile.
As I was grinding through the analysis of the tests that failed, I sometimes thought "Who cares?" about some of the issues. But, in the end, that's not for me, or router makers, to decide. As I said at the top of this piece, there is a wide range of internet services out there. And often what is a "who cares?" issue for one ISP or user can be a critical issue to another.
What this simple experiment has shown is that the answer to the question "Does Alternative Firmware Break Your Router?" is yes, yes it can. But whether what it breaks makes a difference to you is something only you can decide.
So go ahead and thump your chest about how great your router is now that you've gotten rid of the horrible, crappy, [insert favorite pejorative term here] factory firmware and replaced it with your wonderful favorite alternative. But just remember that along with the zippy new features you got, you just might have gained some unwanted stowaways, too.
Special thanks go to QA Cafe for the loan of their NTA1000, their help in getting me up to speed on it and interpreting its test results and especially for their patience in how long it took me to get this article done!