Today was really fun. Ferdie, the local technical lead, had observed that registration of XO’s succeeded at two schools and failed at two other schools. He also observed that “idmgr list-registration” displayed the serial numbers of XO’s that had never successfully been registered.
I examined the python code being executed at both ends of the registration process (documented here yesterday) and put in a print statement into the code running on the XS end, after the return from the attempt to register an XO. I verified that the XS was sending back to the XO a responsse of SUCCESS. Yet the XO reported failure during the same transaction.
My current hypothesis is that the xmlrpclib (xml remote procedure library) opened a bare socket to communicate with the XS. If my guess is correct, there would be no tcp connection to keep open the path between the XS and the XO. Ferdie had been using the XS-AU distribution which had the DNS and DHCPD stripped out (in Australia, they were adding the XS to a pre-existing network which already had these services). As a work-around, Ferdie had configured the access point as a router, letting it do NAT and DHCP. My unproved guess is that the router keeps the socket open for a short period, and then closes it, if there is no response.
Once the connection is closed, any response from the XS, arriving at the WAN port of the AP has no place to go. The NAT mechanism no longer can figure out which hardware address should receive the response packet from the XS. My initial thought was to turn on DHCP in the XS, turn off DHCP in the AP’s, rewire the network so that the NAT/routing function is disabled. But, there was no internet access at Catalaan, so it was very difficult to download the required DHCPD.
So what we did was abandon XS-AU, install Daniel Drake’s XS-0.7, turned off DHCP in the AP’s, and everything worked. Registration was reliable.
Later, after a good night’s sleep — If the AP had acted strictly as a router, it should have worked. So it must have been that we had not discovered a configuration of the AP which turned off NAT.
Still later, after installing XS-0.7 at another school, but before we got the internet access going, we had similar failures (in the sense that the XS had registration in its database, but the XO reported failure). When we got the long distance modem configured, the failure messages went away. New hypothesis is that Named somehow gets called, and delays an answer to the XO, which causes it to give up and declare failure.