So what internal top level domain do you use?

May 25, 2025

When creating an Intranet there's always a point at which you feel you have too many IPs listed to not have a DNS system. So you resolve to do the right thing and create a DNS zone with them in. But what should the TLD be of your Intranet's nodes? For IPv4, nobody outside your network will need those domains. For IPv6... it's a little more complicated, but you'll probably want only a handful of boxes to have externally resolvable domain names anyway. The handful of machines outside of your intranet that actually need access to the ones that aren't externally resolvable can probably be configured to look at your DNS server anyway.

I once used .link, because I'd read that .link was an in a proposed IETF doc for things like routers. But that doc was never blessed and it became an official, subdomains purchasable, domain in the 2010s.

The IETF's official advice is something like “Use a real domain you bought, knock it off with these non-routable ones you idiot but if you must, use “.intranet”, “.internal”, “.private”, “.corp”, “.home”, or”.lan”.” (RFC 6762) which, OK, but it makes a point that this is simply what people are doing right then, not a recommendation. In other words, when EvilRegistry asks for “.home”, ICANN will give it to them – that RFC does not forbid it – and then name clashes are inevitable.

The real domain thing, that requires you either buy a new domain (boo!) or create a subdomain of one you own. Which is fine... if you want to refer to computers on your network as “server78.lan.poundquerydotinfo.name”

I see one other option not mentioned anywhere which is to use a TLD that ICANN will absolutely never probably perhaps not allow to be registered. Maybe a single letter TLD? I think Musk is trying to get “.x” and if he succeeds then we're screwed. But in theory, at least, only three-or-more letter GTLDs have ever been allowed. Another might be a two character domain for a country that doesn't exist (.xx?) All two letter TLDs are reserved for countries right? Or two character? Add a digit to a single letter (.g0, .h1) and I assume the chances of these ever existing are low.

Now at this point there are some counter arguments to not following the IETF's advice. One is “ICANN might actually allocate the TLD you're using to a registrar, and then someone will register “socialnetwork.g0”, and it'll become the most popular website on the planet, and you won't be able to reach it!“, which I just kind of mentioned, but I guess a lot of it is “How much notice will I get if I actually go with “.g0” as an internal TLD and ICANN decides to sell it to a registrar?”

Chances are something that changes policies (all gTLDs must have three characters or more) would generate quite a bit of debate, and you'd hear about it, and have plenty of time to change over. And if you pick a suitably ugly one then hopefully it'll never happen even if ICANN changes the policy.

Another is “What about Let's Encrypt”? That's actually a tougher one than you think. Suppose you want an Intranet web server to have SSL – maybe it's your web mail or something. There are basically four options: (1) Forget having a non-standard TLD. (2) install a CA cert on every PC for your own private CA and... urgh, no, you know you won't do that, I won't either. (3) Just ignore the warning from Firefox. and... (4) use a different domain... for the Intranet service – not the server, the service. If you have a real domain as well, say, poundquerydotinfo.name as mentioned earlier, just create a DNS record like “www-intranet” => server78.g0. Then get a cert for that server using DNS, and it should now be viable and work well. You'll have to access the server at www-intranet.poundquerydotinfo.name, but that should be easy enough.

Let's go back though to using a real domain for the whole thing, eg '.lan.poundquerydotinfo.name'. Maybe that could work? The IETF recommends it. But why, if it's so clumsy?

Well, as many point out, you can make it less clumsy by including “lan.poundquerydotinfo.name” as your DNS search suffix. But the problem there is you still need to do many of the hacks above. For example, if you type:

https://www-intranet

into your browser, it'll correctly find the IP for www-intranet.lan.p... etc and connect to it, but your browser will immediately see that the server doesn't have a suitable certificate because it will look for one that's just “www-intranet”, which you can't get from... anywhere other than perhaps a custom CA. So the certificate issues aren't really any different with .g0 or a real domain, you still have the same problem and will still need to set up some DNS based certificate configuration with Lets Encrypt, and will probably want to omit the “.lan.” part of the domain too so it's not unnecessarily long.

TL;DR – It is probably safe to use a TLD that's one or two letters, ideally with a digit in it (eg. “.a1”, etc.) Be prepared to point a real domain at internal machines with that TLD if it provides services that require an SSL certificate.

Anyway, those are my thoughts. Please tag @poundquerydotinfo@forum.virctuary.com if you have any related thoughts.

OpenWRT + Route64 w/WireGuard

May 20, 2025

This is not an endorsement of any of the above, but Route64 is a free IPv6 brokerage (among other things) and OpenWRT a very advanced, if sometimes fiddly, open source router firmware stack. WireGuard is a VPN system (to clarify, because the term “VPN” has been hijacked, a method to connect computers to a virtual network, not an anonymizing proxy) and you can use it to tunnel your IPv6 connection to another ISP.

You can read about all three here:

https://www.route64.org https://openwrt.org https://wireguard.org

Anyway... combine the three and you get a set of IPv6 prefixes for your network that are publicly routable. If you don't have IPv6 from your ISP, or it's crippled, this is an option. HOWEVER, there's a but, there's always a but... you do need an unchanging IP from your ISP at the time of writing. This is probably for security reasons but it means you can't use a failover connection and it'll be dubious over CGNAT. There are solutions to that too, involving a VPS, but we'll address that later.

At first glance it's not obvious how to set up OpenWRT with Route64 using Wireguard as the transportation mechanism. There are multiple solutions online which are just quite not right enough to break things, and several on OpenWRT's own site require creating shell scripts which means the functionality is hidden from the web interface. Let's break it down.

1. Set up a tunnel at Route64

This bit's fairly straightforward. Once you're registered, go to https://manager.route64.org/ and from the left pick IPv6 Tunnel Broker –> Add new tunnel broker.

Interface: This actually refers to where the tunnel provider is, the listing of what city/country maps to what acronym is listed below the configurations. Pick whatever seems likely nearest to you.

Tunnel Type: Pick “Wireguard”

Remote Endpoint: Put your current fully routable IPv4 address here. (Some also take IPv6 addresses, and for some people behind CGNAT but with a suboptimal IPv6 service that might work. You can, currently, only add one IP.

You can ignore the GRE key prompt. Just hit Create New Tunnel Broker Service and you'll get the information you need. Keep that tab open.

2. Set up the Wireguard connection on your public facing OpenWRT router

2.1 Install Wireguard

Wireguard is fully supported by OpenWRT with an official implementation but it's not included in the base install. But it's available on the OpenWRT repository so installing what you need is easy.

Log in to your OpenWRT router. Go to System –> Software. Hit the “Refresh Lists” button. Then search for “wireguard”. You will want to install:

wireguard-tools
luci-proto-wireguard
kmod-wireguard

Once installed, you may need to reboot your router. I did, as Luci didn't show the new protocol until I did.

2.2 Set up the transport interface (Wireguard VPN)

You'll be setting up two interfaces, a transport one for carrying data, and an IPv6 layer for carrying the main prefix data. So start by going to Network –> Interfaces, stay in the Interfaces tab, and create a new Interface, select “Wireguard VPN” as the protocol. Call it what you want, i suggest something like “vpnr64” – that's what I'll call it here.

Sidenote: Always give OpenWRT interfaces lowercase names. Misleadingly it coverts them to uppercase in Luci, but you'll run into problems with case sensitivity if you don't.

Hit “Create” and before setting anything else up, on the General Settings tab select “Import Configuration” It'll now give you a text box for importing information from Route 64, so pop back to the latter's tab (told you to keep it open), and get the config (if you're at the list of tunnels, on the right hand side there's a “...” button next to the prefix information, hit that, and select Show Config which at the time of writing is the only menu option there.)

The config is shown on a page under the heading “Tunnel configuration for Linux”, and should look a bit like this:

[Interface]
PrivateKey = gkkljgrkjnraKJjkedjkf8idssd8dsds88=
Address = 2222:3333:4444:5555::2/64

[Peer]
PublicKey = fsaknmaf898afoafjiafsAJKNfjnasj=
AllowedIPs = ::/1, 8000::/1
Endpoint = 111.222.33.44:12345
PersistentKeepAlive = 30

Copy this and paste it into the box on the OpenWRT Import Configuration box, and hit “Import Settings”.

Now save, and exit. At this point OpenWRT should try to bring up the connection and within a minute you should see that it's up and that it has an IPv6 address. If you want you can ssh into your box and ping the gateway listed on the Route64 page, like this:

ping -I vpnr64 2222:3333:4444:5555::1

(Substitute the gateway for 2222:3333:4444:5555::1)

2.3 Set up the IPv6 WAN interface

While Wireguard supports IPv6 packets, it either doesn't contain a method to send configuration information about IPv6 prefixes or it does but the OpenWRT implementation doesn't support it so we have to implement this ourselves as a interface that uses the Wireguard interface as its transport. There is, reportedly, a way of running DHCPv6 PD over Wireguard (which would mean you would always get the latest configuration from the Tunnelbroker), but alas Route64 doesn't do that, so we're going to instead hard code things.

So within OpenWRT, create a new Interface (same tab as earlier), call it “wanr64” (or whatever you want, but these instructions will refer to it as wanr64 – reminder, all names should be given in lower case), and select “Static Address” as the protocol.

So this is where it gets a little messy. First, from your VPN configuration, you'll see there's an address line, which in my example was 2222:3333:4444:5555::2/64

The part of the address before the :: is the routing prefix, but NOT the network prefix. It's just an internal prefix that's being used to help with routing packets.

Under IPv6 address, add ROUTINGPREFIX::3, eg I'd have entered 2222:3333:4444:5555::3 Under IPv6 gateway, put in ROUTINGPREFIX::1, eg I'd have entered 2222:3333:4444:5555::1 Under IPv6 routed prefix do NOT put in ROUTINGPREFIX. Instead go back to Route64, click on List IP Subnets, and copy the enter address under Subnet for this network. It'll look something like 6666:7777:8888:9900::/56 – copy it from start to finish including the /56 part.

In the “Advanced” tab make sure “Use default gateway” is checked, as is “Delegate IPv6 prefixes”, everything else except Force Link, which can be left alone as you prefer, should be blank.

Under Firewall settings, you'll either want to add a new IPv6 firewall or use the existing 'wan' firewall setting. (I personally prefer putting all the IPv6 stuff in its own thing as OpenWRT's defaults seem to cripple IPv6 a little, but it's up to you. If you do add an IPv6 firewall zone, remember to configure the Firewall such that the lan and IPv6 zone have full access to one another.)

Leave the DHCP server as unconfigured.

2.4 Set up routes and (optionally) the firewall

Because the interface that configures the IPv6 prefix is static, it doesn't automatically add routes for IPv6 packets. So you'll need to do that. Head to Network –> Routing, and then to the “Static IPv6 Routes” tab, and click “Add”.

Select “vpnr64” as the interface, unicast as the packet type, ::/0 as the target (yes, it's the same as suggested entry, but the suggested entry isn't the default), and leave the Gateway field blank. And now save the results.

For the firewall, if you didn't place wanr64 under the existing 'wan' firewall, you'll need to allow traffic between “lan” and the zone you created. Go to Network –> Firewall, look for the line with the new zone you created (which currently goes... nowhere) and hit Edit. In the form that comes up, have “Accept” be the routing policy for at least the Input and Output fields, leave the check boxes blank, and select “lan” for both “Allow forward to destination zones” and “Allow forward from source zones”. Save, then Save & Apply.

3. Now it should “Just work™”

You can verify the configuration is working by checking for an IP address with the new prefix on your computer's network interface, and typing something like “ping -6 www.microsoft.com”

Credits

I got many of the ideas from Craig Miller's guide on tunneling IPv6 through WireGuard on OpenWRT. The main difference between what he did and what I'm saying above is he was able to control both ends of the tunnel (he's talking about sharing your IPv6 block with others), and as such was able to run DHCPv6 PD over the WireGuard link. Alas that's not possible with Route 64.

Setting up a failover system with OpenWRT

April 22, 2025

OK, this is less a tutorial and more a “I did it this way, let me document it before I forget” thingie, but let's try it anyway, it might help someone out there.

Here's what I'm trying to do:

Failover IPv4 (two Internet services, a primary and a less favored secondary)
Non-failover IPv6 (primary Internet only)

I'm assuming we want everything to go through the primary unless it's down, because for most of us on a regular budget the secondary service is going to be metered, for example T-Mobile's back-up service (which is what I use.)

Why not failover IPv6? Because it doesn't work like that alas. If your secondary service supports IPv6 properly, you might be able to set up a system so your computers failover automatically based upon... metrics? I'm not entirely sure. But IPv6 itself doesn't lend itself to failover without the cooperation of both ISPs – that is you can't have the routing for a prefix suddenly change ISP. At best you can use NAT with the ULA range which might help IPv6-only hosts use the Internet, but most of us don't have those.

I would have looked into it more, but the secondary service is T-Mobile, and T-Mobile doesn't give out prefixes. I don't know why. They're using CGNAT for everything, so they should. I suspect it's because some marketing guy looked at the fact they're providing mobile Internet and said “Why not provide at home Internet in exactly the same way” which is horrible.

Anyway if you feel the above simplification is wrong, feel free to let me know by mentioning @poundquerydotinfo@forum.virctuary.com on your favorite ActivityPub based social media site. I'm open to being educated, and if there is a good way to use the limited IPv6 service provided by T-Mo I'm all ears.

While on the subject of IPv6, I'm assuming here you're using an IP connection to your primary ISP that uses DHCP-PD to provide IPv6 services. If your ISP uses PPP/PPPoE/etc, then look elsewhere. I'm also assuming regular DHCP is used to provide a real, routable, IP.

The basic steps we're going to take with this:

Configure the hardware
Make sure IPv4 is OK with the primary Internet service
Make sure IPv6 is OK with the primary Internet service
Configure mwan3 and add the secondary Internet service

1. What you need.

You will need an OpenWRT router with multiple LAN ports and a WAN port.

You will need to copy some stuff from your old router. The obvious ones are things like your current port forwarding rules. The less obvious ones are:

The MAC address of the router's WAN port. You'll be cloning that.
The external (WAN) IPv6 address. You'll need that at the beginning.
The IPv6 gateway.
The current IPv6 prefix.

That last one may be rough to get from the router itself, but you can look at things connected to it and see what prefix they have. For example, on GNU/Linux, typing 'ip -6 addr' will list some addresses that look something like 2608:329:129a:9312:cc99:1924:3eff:ff00. You're looking for something that doesn't begin with fe or fd, and not 2002 either. If you find the key address starts with 2002, then you're not using DHCP-PD, you're using something called 6to4, which is a great technology, but alas not relevant here.

Anyway, once you've found an address, the prefix bit is easy. If you look at it, you'll see it's divided into chunks separated by colons. There might be eight chunks, in which case you're interested in the first four, or there may be a double colon, in which case you're interested in the part before the double colon.

Either way, make a note of what that prefix is, put a colon or double colon after it, and write “/64” after that, eg:

2608:329:129a:9312:/64

Why do we need the current IPv6 settings? Well, because some ISPs don't react kindly to being asked for a prefix delegation when they've already sent one to your old router. So we're going to set this up statically, and when it stops working, we'll change it to use DHCP-PD.

2. Checking the router works and can be used for this project.

Set up the new router, use a spare computer that's hooked directly up to it plugged into the LAN port furthest from the WAN port (the reason for this will become clear in a bit.) You may need to manually give the spare computer an IP address, 192.168.1.20 is usually fine.

Log in using 'luci' (the web interface, https://192.168.1.1 – be prepared to tell the web browser to trust the unsigned certificate for that site), Username is root, no password, immediately go into System->Administration and set a password.

Make it all look nice, and navigate to the Network –> Interfaces tab. Click on “Devices” and you should see a device called “br-lan”. This is basically the “Internal” part of the Internet and the first thing we're going to do is see if this project is even possible by removing one of the LAN ports from it.

So go to “Configure” and you should see a “Bridge Ports” drop down. Deselect the port physically nearest the WAN port on your router (I deselected LAN4 on mine and will refer to that port as LAN4 from here out.)

Hit SAVE and then SAVE AND APPLY.

Were you able to get this far? Then you're probably OK. If you couldn't remove a LAN port from br-lan, then at this point you need to try a different router. There may be other solutions, I just don't know them.

3. Configure the LAN

My LAN uses its own DHCP server and DNS server and is on the 10.0.x.x private IP range. Your needs may vary from that. I'll try to make these instructions generic but bear in mind my assumptions might not fit your local network.

Go back to the Interfaces tab, and edit LAN. Make the following changes (or check they're already set):

In General Settings, set the intranet IPv4 address you want for the router. Add /8, /16, or /24 to the IP as appropriate. This doesn't have to be the same as your existing router, as I'll get to in a minute.
In Advanced Settings, make sure “Delegate IPv6 prefixes” is off, and “IPv6 assignment length” is 64. The defaults for everything else should be fine.
Firewall settings should be fine, and just show “lan” in green.
For DHCP Server, if you run your own DHCP server, do not check “Ignore interface”, instead in “DHCP Server –> Advanced settings” clear the “Dynamic DHCP” setting. This will make sure the IPv4 server doesn't mess up your network, while still providing DHCP6 and RADV.
Still in DHCP Server, in IPv6 Settings make sure RA-Service and DHCPv6 Service are server mode. Designated Master can be clear. If you're using your own DNS server, find the IPv6 address of your DNS server that starts with “FE80” and put that in the “Announced IPv6 DNS servers” (otherwise it'll give out the ones from the ISP.) NDP Proxy should be “relay mode”, learn routes checked, NDP Proxy slave unchecked.
Again still in DHCP Server, IPv6 RA Settings – default router should be “automatic”, Enable SLAAC on, and other settings leave at their defaults.

Save, and if you're ready to make this live Save and Apply. If you run your own DHCP server and cleared the Dynamic DHCP setting you can now unplug your computer and plug the router into your main network, otherwise you'll need to figure out how to make sure your spare computer can connect. That's if you hit Save and Apply. Which you don't need to yet.

4. Configure WAN and WAN6

Go back to the Devices tab, and hit “Configure” next to “wan”. Change the MAC address to the MAC address of your old router's WAN port (See above!), and make sure “Enable IPv6” is checked. Save (and Apply if you want.)

You'll notice the MAC address is now listed on the Devices tab in bold.

Go to the Interfaces tab, and hit Edit on WAN.

For “Use gateway metric”, set it to “10”.

I set Client ID to the MAC address of the old router's WAN port but honestly I don't know if that's necessary. (Remove the colons if you do that.) Use Default Gateway should also be set.

On the DHCP server, make sure “Ignore interface” is on.

Save (and apply if you want.)

Now for WAN6! This part of the process WILL involve taking your connection to the Internet down. Although if things work then, well, you won't need to restore the old router.

So first, let's check the WAN6 settings are reasonable.

So... first, go to your existing router and change it's IP address to something else.

Then go to Interfaces on the OpenWRT router, and edit WAN6:

The first page should be “Protocol: DHCPv6 client”, device “wan”, bring up on boot checked, request IPv6 address “try”, and “Request IPv6 prefix of length: Automatic”.

For most people, those should be fine. I'm told you can set “Request IPv6 prefix of length” to “60” for Comcast, but 64 should work and most readers of this blog will have no reason to set it to anything lower than 64.

On Advanced Settings, use default gateway should be checked, as should IPv6 source routing and Delegate IPv6 prefixes. IPv6 assignment length should be disabled. IPv6 Prefix filter should remain at “please choose”. The other settings aren't important here.

Firewall settings should put it in the WAN and WAN6 group (or WAN, WAN6, and WANB if you skipped ahead in these instructions.)

Save if need be, and now edit LAN. On the IP address section, add the IP of your old router. You don't have to change the IP of the new one, just add the new IP. This means your current web session will remain up.

Save and Apply everything at this stage. We're going live.

Unplug the plug from the old router. Plug it into the OpenWRT router.

You should:

See WAN get an IPv4 address
See WAN6 get an IPv6 address.
See WAN6 also get an IPv6-PD.

If this is what you see, skip the next section. You may want to do some pinging etc from your computers. Be aware pinging IPv6 addresses may be an issue if your old router is still online, you may have to manually reset your computer's route. 2001:4860:4860::8888 is a good address to test with. Pinging IPv4 addresses might also not work right away as the routing information your computer caches may still point at the older router. GIve it a few minutes, reboot if you have to.

5. WAN6 IPv6-PD issues

First, to be clear, if you can't see an IPv6 address, I don't really have any help to offer. If you don't see one, but do see an IPv6-PD, then you probably don't need to worry anyway, just verify IPv6 is working for your network clients. Again, you may need to reset the computer you're testing it with to make sure it picks up the new delegation.

But as for the IPv6 PD being missing, that's the issue I ran into:

There are two good reasons why the default, automatic, configuration for WAN6 might fail;. The first is a bug in some versions of OpenWRT that means if you set a ULA (an IPv6 address allocation intended for internal networks, similar to 192.168.x.x or 10.x.x.x) then in some cases OpenWRT won't try to get a prefix delegation from the ISP. The second is uglier, and is the reason why I told you to note the router's IPv6 information.

How do we find out?

Go to the Global Network Options tab and clear the IPv6 ULA-Prefix. Save and Apply.

Now reboot the router, log back in, and go to Network –> Interfaces. Do you now see IPv6-PD information for WAN6? If so, great, the the ULA was the issue.

If not, don't put the ULA back yet, as you might have both problems. Instead you're going to change WAN6 to a static configuration.

From the Interfaces tab, edit WAN6 and in the General Settings tab change the protocol to “Static address”. It'll come up with a “Really change protocol?” question, click on the Blue button to change it.

Leave all the IPv4 settings alone, and put the settings you gathered in section 1 in for the IPv6 address, the IPv6 gateway, and the IPv6 routed prefix. The prefix should be formatted as including a trailing colon and /64, eg:

2608:329:129a:9312:/64

Now save, Save and Apply, and restart the Interface. At this point it may look like it's working. But you don't know yet, because all those settings will show as the status even if the cable's unplugged. So you'll need to test connectivity, and again, you may have to reboot the computer or something similar but more complicated to explain to confirm IPv6 connectivity is working.

If this doesn't work, I don't have any other options.

If this works, then the likelihood is that your ISP won't give out prefixes until they expire and that's the problem. To fix this you'll just have to wait until your IPv6 connectivity stops working. That'll mean the PD has expired, and you can then change your settings back to use DHCP6 to configure WAN6 and it should work at this point. This will probably take several hours, days, or possibly weeks or months.

6. mwan3

At this point, your network should be correctly set up. You should have IPv4 connectivity, you might have IPv6 connectivity or you might have given up on it but decided to plod on regardless. You need now to install a package called 'mwan3' which does the failover (or load balancing if you prefer, but we're doing failover) To install this:

Go to System –> Software, hit the Update Lists... button, and then when it's finished search for 'luci-app-mwan3' and hit install. Installing this installs the web interface and also installs the core mwan3 package.

Go back to Network, and look at the Interfaces tab. There's a good chance it's added 'WANB' and 'WANB6' there. If so, delete WANB6, and edit WANB to match the settings you'll be using for your secondary service.

In particular, * in General Settings, set the Device to the physical LAN port closest to the WAN port. * In Advanced Settings, set the “Use gateway metric” to 20. In * Firewall Settings, add WANB to the firewall zone containing WAN and WAN6.

If you don't see WANB, then create one using the “ADD NEW INTERFACE” button. * In General Settings, It will normally be DHCP client, the Device being the physical LAN port closest to the WAN port * Bring up on boot should be checked * In Advanced Settings, set the “Use gateway metric” to 20. * Again Firewall Settings should assign it to the same zone as WAN and WAN6.

The very, very, important stuff is that WANB has the Device being the physical LAN port closest to the WAN port, the gateway metric should be 20, and the Firewall settings should assign it to the same zone as WAN and WAN6.

7. Configure mwan3 for failover

The default configuration for mwan3 has a lot of choices that aren't really necessary and may interfere with what you actually want to do. Despite what you may have read elsewhere, mwan3 does not appear to use each interface's metric setting, but it's own metric configuration instead. So let's quickly run through the MultiWAN configuration at Network –> MultiWAN Manager

The first, default, Globals tab doesn't actually have anything on it you need to worry about.

The Interfaces tab should show the two WAN devices you'll be using, WAN and WANB. You can delete WAN6 if it appears, and WANB6 which you probably haven't set up.

Both interfaces should be enabled, if either aren't (WANB wasn't in my case) enable it using EDIT.

The Member tab specifies multiple versions of each Interface with a different metric and weight attached. You should, by default, see wan_m1_w3 (a configuration of the WAN device with a metric of 1 and a weight of 3) and wanb_m2_w2 (WANB, metric 2, weight 2.) Those are the only two we care about. Create them if they don't exist.

(The “metric” here is the one mwan3 actually seems to care about. The lower the number, the greater priority is given to using the interface over other interfaces. If two members have the same metric, the weight is used to determine what fraction of the traffic goes over it. For our purposes we only care about the metric. So what we're making sure of is that we have a member for wan with a metric of one, and a member for wanb that has a metric of two, so we can make sure traffic always runs over wan if it's available and only over wanb if wan isn't.)

You can ignore the other members, you don't have to delete them. Just make sure wan_m1_w3 and wanb_m2_w2 exist.

The Policy tab lists policies, these are basically a group of members to route stuff to (taking into account their metrics and weights) and a last resort action if it can't route data via the members. The only thing you need to make sure of here is that wan_m1_w3 and wanb_m2_w2 are in a policy, which they should be: there should be one called wan_wanb. It groups those together, and says the router should respond the network is unreachable if both are unable to route anything. If it doesn't exist, you can create it. If it contains IPv6 members, feel free to delete them.

Finally, how are these policies used? Well, they're implemented by the Rules, the penultimate tab. The Rules tab should show three rules, only one of which you actually want. One is an IPv6 rule (why?), one is an example HTTPS rule. You can delete both. The only one you actually want (assuming it's there) is “default_rule_v4”. Right now it's almost certainly set up for load balancing, so edit it and change the Policy (now you see how they're used!) to the wan_wanb policy. (Once everything is working, if there's actually anything you want to be specifically routed differently, you can return here and set that up.)

Save and Apply.

8. Check MultiWAN configuration

At this point, in all honesty, it probably all works already. You can verify by visiting http://checkip.amazonaws.com/. Assuming the primary WAN is working, you should get the IP of the primary WAN returned back to you.

If it's not, you can verify the settings are correct:

Go to Network –> MultiWAN Manager –> Interface

It should show metrics of 10 and 20 for WAN and WANB respectively. If it suggests the metrics are missing, go back to the Network –> Interfaces and edit the respective interface.

Go to the Rules tab:

Make sure there is only one rule, default_rule_v4, and that it's set up with the wan_wanb policy.

Fixes for Mastodon

March 4, 2025

In order for the Fediverse ActivityPub-based microblogging network usually known as Mastodon to succeed as a general purpose social network that can challenge Threads, BlueSky, or X, I sincerely believe the following changes are needed, even if they may upset many people who would prefer a smaller network with zero discoverability for anti-abuse reasons. (Those people would, of course, still have the option of that, by posting on instances that don't include these improvements.)

1. Improved group communications

1.1 See all replies

Mastodon needs to recognize that replies to comments constitute a group conversation around that comment. At the very minimum clicking on a comment should show the part of the discussion it pertains to (all visible posts higher in the reply chain if there are any – eg direct parent, parent of parent, etc, until the comment that started it is shown, to provide context, plus all replies to the comment itself that are visible.)

(Visible means posts that are public, unlisted, or from people the viewer follows. Whether unlisted should be included in this definition is a matter for discussion, certainly contextual unlisted posts should be included, but there's an argument replies that are unlisted could be hidden under certain circumstances.)

Having 15 people all reply the same comment to a comment doesn't help. Ultimately the only way, usually, to determine whether it's even worth replying is to open the comment on the comment author's server, which is clunky.

1.2 Allow moderation of replies

This is a controversial suggestion, originally made by famed programmer and nightclub owner jwz, that if you think a comment replying to your post is inappropriate, you should be able to remove it – at least from those viewing the thread itself. But the principle, that you should have some control over who replies to your comments, is actually already in Mastodon. Block someone, and they can't reply to your comments. Extending that to blocking specific replies does not pose any moral or free speech issues, and it deals with everything from abuse to simple off-topic craziness to be moderated without having to get instance admins involved, or the author having to block anyone.

To cover the most common misconception: removing a reply does not imply removing the post itself from Mastodon, and someone proud of their removed post can always boost it in their own timeline. It would just be parentless and, perhaps, marked “Removed from original thread”.

1.3 Quote Posts

This is a feature permanently stuck in limbo because those opposing it are convinced it's disproportionately abused. But there's no evidence it is, most of the time I see a quote post it's not to dunk on someone, but to post a comment related to content on someone's timeline – for example, boosting an AP News post to say how depressed you are about the news reported. There's also plenty of scope for opt-in/opt-out functionality to be associated with quote posts.

And honestly, saying “It could be abused” casts a wide net. Most of the arguments again being able to edit your own posts, for example, are of the form “It could be abused”, but it generally isn't. Abusers will find ways to abuse people, Quote Posts don't make a lot of difference in that respect.

Quote posts should only be allowed for posts marked public. Ideally the author of a post that's been quoted should have the right to remove it from quoted posts or block people from quoting it to begin with, to address the rare cases of abuse that exist.

2. Discoverability

Mastodon's architecture promotes instances as the key way to find and build communities around one another, but in reality this only works well if people are only interested in one or two specific topics. Most users want to find out what people outside their servers are interested in. Mastodon hasn't focussed on discoverability in part because of the technical difficulties, and in part for fear its original user base will be targetted for abuse.

2.1 Group communication fixes (see above)

Fixing group communication issues (see above) is one way to help with that. Over time people will start to read comments from those with similar interests naturally, as they follow the same threads.

2.2 “Relay lite” (BlueSky/AT Proto, not Mastodon “relay” definition here)

BlueSky has a “relay” concept where all posts are stored and archived in a single server where they can be searched.

A straight duplication of this is overkill and has privacy issues, but having instances participate in third party equivalents where those third parties store up to a week's worth of public posts (only public, not unlisted) with a simple interface that ensures those servers can be searched without the user leaving their own instance would certainly help. It might also help with concepts like “Trending topics”, something currently effectively missing from small Mastodon instances. Such servers could also be used to search for people, as long as their profiles are marked as searchable.

3. Adjustable firehose

Mastodon, rightly, eschews the “Algorithm”, the idea of rating posts from people the user isn't following and inserting them into their feed, often prioritizing third party posts over the user's actual interests.

That doesn't mean however that simple chronological is the only option for presenting posts to users. Some accounts, especially automated accounts, post huge amounts of material (often repeated) compared to others, leading to some posters being ignored simply because of the lack of volume of their own posts.

Recognizing that users often pause while doomscrolling, it might make sense to prioritize what they see when they come back.

Certainly the aim should be to ensure users feel it's worth checking Mastodon from time to time without bombarding them with the same stuff over and over again. We don't have to maximize “engagement”, however we don't want users to miss posts they've explicitly expressed an interest in. A simple chronological view has flaws.

4. Responding to content on other instances

Mastodon makes it all too easy to leave your instance trying to read content from other instances, and once you do interacting with that content becomes clumsy, typically involving searching for URIs or telling the instance you're on what server your account is associated with when trying to boost, like, or reply to a comment.

The suggestions in section 1 above will help, but Mastodon also needs to verify whether links in posts are to off-server posts or to non-ActivityPub content, and attempt to keep the user within the instance as much as possible. Links to off-server posts should be changed to internal links.

It's worth noting that not turning external into internal links for posts isn't merely a problem creating a clunky end user experience, it's also a privacy issue. If I'm blocked by the author of the post in question, and the post is on a different instance, then that block is pointless and doesn't add any friction if someone posts a link to their comment off of my instance. If Mastodon can verify a link is to a post by someone who has blocked me, it can hide the link completely, requiring I jump through hoops to find it.

5. Spam/Abuse

Many people would argue that Mastodon does not have a spam or abuse problem. Spammers are typically taken care of quickly, and need to know what they're doing to reach a suitable level of discoverability. Abusers are typically dealt with by regular blocks, and instances that do not handle abuse well are often simply blocked by other instances. But... this isn't very scalable, and it'd be tough dealing with, say, mastodon.social and blocking it if it started to support abusive users.

Other federated networks that, despite federation, were even more centralized than Mastodon, were spam free too until they weren't. Usenet and email provide important lessons in getting ahead of the issue before it becomes a problem.

If Mastodon gains traction, it will be spammed.

This is the one area I don't have specific proposals for except generic handwavy stuff like “Shared blocklists”. But that right there is potentially the way to go, with independent groups sharing information about bad instances.

This is the direction email started to take with disastrous results. However, Mastodon has several things going for it: a block list in Mastodon can take immediate and retroactive affect, so done quickly enough, it doesn't just prevent a spammer or abuser from abusing more victims, it also removes the content they already posted from the servers so people who haven't seen it yet never will.

Right now there's very little in Mastodon that can help an instance administrator prevent their customers from being abused that doesn't require a significant amount of constant vigilance. It's not an issue yet. But it will become one.

A benchmark for Federation

January 28, 2025

I just compared AT and ActivityPub and you can take a look at my thoughts in my prior post. But I thought it might be worth asking the question “How do we know a federated protocol can survive.”

Experience isn't good on this one. Since the Internet became popular, there have been multiple federated systems that were introduced, became popular, and then died for whatever reason. They include:

Usenet – was only ever semi-open. All but dead.
Email – Went from open to semi-open. Crippled.
IRC – All but dead.
XMPP – All but dead.
RSS – All but dead.
“The Web” – under threat and becoming more centralized.

In fact, other than email and the web, virtually every single once-popular federated system has ended up being killed.

So what killed each?

Usenet bulletin board services

Usenet first: Usenet was actually not federated enough for most people. It became easier to set up a BBS on a website than it was to set up a newsgroup moderated the way you wanted it. Sure, there was alt., but a lot of ISPs intentionally blocked some or all of alt. because it was considered an unmoderated wasteground. Larger ISPs also had problems keeping their NNTP servers running – at the time scalability was a new art in the Internet age that few people were familiar with. As if to add insult to injury, spammers had no problems – despite the semi-open nature of Usenet – posting crap all over newsgroups.

Usenet's federated successor is arguably Lemmy, but Lemmy is overshadowed by the proprietary website Reddit, which served as Lemmy's use model.

Conclusion: Usenet was too restricted forcing people to use alternatives, and was unable to cope with spam.

Email

Email didn't die but it is crippled. An unlikely but possible scenario right now is for Google and Microsoft to team up and close off all federation except to each other, in the name of killing spam. It'd work too! Google and Microsoft together own most email through their public email offerings, and services like Office 365. The major issue would be some Microsoft customers still manage their own email – but that's slowly being migrated to Office 365 in the name of ease of management, and a concerted push by Microsoft would push everyone over the edge and force this to happen.

Like I said, unlikely, but only because regulators would take an interest. The question you should ask though is not “Why are you an idiot proposing Microsoft and Google would do this?” but “How did we get to a point where this is possible?”

Answer: Spammers and idiot responses to spam. Spammers overwhelmed email servers early on in the public Internet. System administrators started creating more and more crazy rules, from blocking IPs used by long gone spammers, to blocking ISP customers because supposedly we'd only ever want to send emails via the ISP relays. ISPs started blocking port 25 making it impactical to be part of the federated email system, and, one thing lead to another, and now two companies “own” most email. That's not healthy.

Conclusion: Federated email is threatened by poor technical choices leading to difficulty managing spam and further poor technical choices that lead to a de-facto centralization of email to deal with the first set of poor technical choices. Spam is bad, but dealing with spam must be done carefully.

Internet Relay Chat (IRC) Real time group and personal messaging

IRC largely failed due to overloaded servers and a protocol that just wasn't scalable. It broke up early on into multiple networks, and by all accounts those networks have continued to split. Proprietary chat systems started to move into IRC's space, and people generally preferred them because they were less clunky. Today IRC's federated successor is supposed to be Matrix, though for a variety of reasons it hasn't taken off. Despite Matrix nodes storing a history of every single conversation that occurs on its part of the network, Matrix is considered far more stable and secure than IRC was. But it's far from clear it has the feature set people want from a chat system in 2025, and it's too slow.

IRC, like Usenet, was only semi-federated, servers in each network have to be approved of by others in that same network. But this doesn't appear to be the cause of any damage here.

Conclusion: IRC was built upon poor technical choices and access to it was limited to clients that weren't always user friendly.

XMPP Instant Messaging

XMPP started as an attempt to define a non-proprietary standard for instant messaging. The system was initially successful with Google building Google Talk on XMPP, and some existing instant message services such as AOL's AIM and Yahoo's YIM creating XMPP gateways.

And then they decided they didn't need it and each company closed off XMPP federation and what little was left allowed non-proprietary chat clients to connect to accounts. But you couldn't IM your friend on Google from AOL any more.

Why they disconnected from XMPP looks mostly to be because of misjudgements from the companies involved. Most thought they could “win” an “Instant messenger war” by not federating, but that's not how it works, and instead that generation of instant messaging as a whole became less popular than IRC and Usenet, with services like Slack and Discord reinvigorating the concept years later. Another possible explanation was that XMPP might have helped keep the concept alive, but it just wasn't helping enough, with very few people actually communicating between different services.

Conclusion: An early application that was probably too ahead of its time, and it's unclear it was easy to use given the apparent little use federation had at the time by end users even when it was available.

RSS Feeds

RSS was, actually, successful with a sizable amount of the blogosphere. The basic concept was every blog had a “feed” and you could use clients to subscribe to that feed. You could usually read the full blog entries via your RSS client, and if you couldn't, or you wanted to see if you could leave a comment, you could just visit the page directly on the publisher's website.

RSS feeds didn't stop being published, Instead, a more insidious thing happened. The most popular RSS reader on the Internet was a service called Google Reader. It worked from your web browser, was clean and easy to use, and everyone loved it.

Unfortunately blogs were considered competition for social media, and Google killed Google Reader in an attempt to shore up Google Plus, their exciting new social network service. Google actually did more than kill Reader, they broke virtually every part of their system trying to graft support for Google Plus onto it, which means there was a certain amount of schadenfreude felt across the Internet when Google Plus died because nobody wanted it. Unfortunately killing Google Plus didn't kill Google's attitudes, and Google has never been the same since. And no, they didn't bring back Google Reader.

Conclusion: RSS “died” because almost everyone relied upon one proprietary client to use it, from a fair-weather friend.

“The Web”

For now, the web remains open despite constant efforts from ISPs to restrict what people can use their Internet connections for, and from mobile “apps” which attempt to replace websites. However, large corporate entities do have an oversized influence on the web right now.

Conclusion: None, yet.

Conclusions

I would suggest testing the strength of federation with the following measures:

Is it genuinely open? Can anyone just join in? Or is it ultimately controlled by a small group of server administrators who won't allow people in without permission?
Does it have the tools to manage and restrict abuse, especially spam?
Are multiple clients available? Are they easy to find?
Is it a mature concept? Do we know it's actually what users want, or is it a clone of something simple that was successful but nobody's quite sure if that's the right way of doing things?
Are the majority of users using federation? Or could a node stop federating without most users being affected?

Using the conclusions

These are just notes at this point, points for you to think about. Looking in particular at ActivityPub and AT Protocol – let's call the AP based microblogging network “Mastodon” and AT Protocol based on “Bluesky” for now, although both names are misleading:

Test 1: Openness

Mastodon is completely 100% open. There are no gatekeepers in terms of creating new nodes and adding users to those nodes.

Bluesky is dejure open but not defacto open. The critical problem are relays, which are expensive to manage and maintain, and are critical to ATP working. In fairness, as long as one entity runs an open ATP Relay the entire network remains “open”. But as of right now only one company manages one (Bluesky.) Additionally those configuring relays currently make the decision as to which PDSes to pull from. Just because you stand up a PDS doesn't mean Bluesky's relay (even now, when it is “open”) will automatically start reading it, and there's no protocol as of yet to advertise the existence of PDSes for relay operators to automatically use.

Test 2: Does it have the tools to manage and restrict abuse, especially spam?

Mastodon has several tools to help with spam, and mostly uses a “follow” model for normal content. However, it has the ability to send messages to specific users. In theory, concerted attempts by spammers to send spam to specific users would succeed and make the network far less usable. There's a high risk administrators would start whitelisting nodes rather than blacklisting abusive nodes. Mastodon administrators arguably need to look for easier ways to federate abuse notifications and blocks.

Unlike email once an administrator has been notified that an instance is sending spam, they have easy means to prevent that spam from reaching users who haven't read it yet. So the situation isn't as bad as email or Usenet, but it's still bad.

Bluesky does have the ability to deal with spam although there are ideological reasons why they might not do so in the beginning. Specifically Bluesky's relays are able to avoid PDSes known to send spam, and Bluesky's wider architecture (not just the AT Protocol specified part) allows AppViews to similarly run “algorithms” that, among other things, can filter spam. This is good, but it remains theoretical that it can work, and ideologically Bluesky's designers do not like filtering and blocking and moderation in general.

Test 3: Are multiple clients available? Are they easy to find?

Mastodon has a plethora of clients and servers available and access is not restricted in the slightest. Clients include web clients, mobile apps, and whatever is built into the server. The Mastodon implementation of, uh, Mastodon (heh) even has a published client protocol allowing third parties to easily make custom user interfaces for generic Mastodon servers.

Bluesky likewise has a plethora of App Views and PDSes available. As App Views and PDSes form the basis of the interface Bluesky users have to the Bluesky network, this covers the bases.

Test 4: Is this a mature concept?

Arguably the concept itself is. Twitter was created in 2006, and continued to grow and remained popular until its closure by Elon Musk about two years ago when it was replaced by the neo-nazi microblogging network X. It has spawned multiple successful clones.

Mastodon's implementation, however, is considered lacking. It lacks a global search, and tools to highlight posts and comment upon them (called “Retweet with comment” on Twitter), for example, have been intentionally avoided because of concerns about abuse and promoting dunking on people.

Bluesky has a fuller, richer, implementation of the basic concepts that fits what Twitter looked like two years ago. Some of these are arguably harmful, such as black-box algorithmic views vs chronological.

Bluesky is definitely an implementation of a mature concept. Mastodon's is close but not necessarily what people who want “Twitter but federated” want.

Test 5: Are the majority of users using federation?

In Mastodon's case, yes. I don't think there's a way to use Mastodon where you don't end up following at least one person on another server. Most users are following users on multiple servers. They're federating. Yay.

Right now almost all Bluesky users are using Bluesky (the corporation's) implementation – their servers, their clients, etc. So the chances are very few Bluesky users are actually federating at all.

Final conclusions

Mastodon's main weakness is in not providing the product people are looking for. It federates well and is well protected against corporate takeovers, but the lack of discoverability harms it. It also needs to address spam and abuse before it reaches a large enough size that spammers seriously target it.

Bluesky's main weakness is that almost its entire userbase is using infrastructure provided by just one company. Long term if that changes, it stands a very good chance of becoming strongly federated, but right now the federation is weak.

AT vs AP

January 23, 2025

The biggest source of strife right now in the fediverse is not moderation, blocking policies, or anything like that, it's Bluesky starting up with a rival protocol to ActivityPub (the basis of the microblogging network often called Mastodon after its most popular software) called AT, ATproto, or Authenticated Transfer Protocol. Given the hyperpartisan nature of some of the comments about it, that often descend into outright lies, I figured I'd write down a brief summary of what's happening and then explain – as best as I can – the issues with both.

ActivityPub

ActivityPub is the protocol Mastodon is built upon. It's also the basis of Facebook-alternative Friendica, Instagram alternative Pixelfed, and YouTube alternative PeerTube. It's also used for non-social applications such as Lemmy, a Reddit alternative, and as an alternative to RSS by regular blogging platforms like Wordpress. I've even seen it used to manage comments on blogs.

ActivityPub has no centralization at all. Users post to an “instance” (to use the Mastodon term), and the instance then forwards notifications of new content to other instances that subscribe to those users' accounts. Other instances can also pull content from the source instances. Instances can be single user, multi-user, public, private, etc. There are minimal security modes allowing servers that have pulled messages to restrict who views them though technically the modes are voluntary – once you pass a message to a third party, you're always at that third party's whim as far as privacy goes, and the administrator of an instance that hosts a subscriber's account is one of the people you pass that message to.

Mastodon's implementation has some methods for migrating accounts between instances though they're limited to ensuring accounts do not lose followers. Moving to another server means refollowing friends, and your own posts don't move from the older server. In every real sense the account itself does not move, in much the same way that if you move house, you can forward your mail, but you're not going to move the physical building.

ActivityPub is an open standard managed by the W3C, that builds upon previous attempts in the same space including OStatus and even RSS. It rose to prominence during the GamerGate controversy during the 2010s as marginalized groups, particularly LGBT people, fled harassment on Twitter and sought safer spaces. Mastodon was built in that context, and many of the social norms in the larger parts of the Mastodon network reflect a desire to ensure the network remains a safe place. Mastodon itself customizes some protocol features such as implementing a content-warning tag allowing posts to warn potential readers about, for example, descriptions of transphobia (transpeople disproportionately have gender dysphoria, a condition characterized among other things with excessive suicidal depression, so this is a feature desperately wanted in a place intended to be safe. Being told you're hated over and over again is likely to kill you if you have that condition.)

AT Protocol

The AT protocol is a newer protocol designed by Bluesky. Bluesky originally came out of Twitter as a response to the fall-out of Gamergate but in a rather different way. Jack Dorsey, Bluesky's founder, was concerned that Twitter was having to be heavily involved in moderation, and believed social media would be improved if that responsibility was taken away. To that end, he proposed a post-Twitter federated social media platform where such control was nearly impossible. While Bluesky, for the most part, fulfills that vision, Dorsey considers it a failure because Bluesky itself decided to introduce moderation anyway. Turns out trolls and people whose political views and behavior are indistinguishable from trolls actually damage social networks, who knew?

AT breaks up the core network into three types of component, all of which can be supported independently.

Personal Data Servers store the core information about a user – from their password to what they've posted. Despite the word “Personal”, most PDSes actually manage data for several users, the largest being Bluesky's own which probably manages nearly 30 million user's data.
App Views are essentially the front ends to the network. They provide a view of the network for each user, and allow the user to make new posts which they send to the user's PDS. They can optionally support a variety of features that aren't standardized such as custom algorithms (ie what content gets presented to the user.)
Relays glue these components together, sending the data from the PDSes to the App Views. They crawl the PDSes and store everything they can get, providing that information via queries to App Views.

Private implementations have been made of both PDSes and App Views. At this stage though only one full Relay exists, Bluesky's own. A group Free Our Feeds intends to change this and create a second, independently run, relay. Relays require a huge amount of resources, as the relays need to store all posts made by all 30M AT accounts, they need to keep themselves updated by querying PDSes, and they need to provide timely responses to queries made by App Views.

End users can migrate their accounts from one PDS to another without any interruption in service or any leaving behind of personal data.

The AT Protocol is intended to be an open standard but is currently managed by Bluesky. They've made noises suggesting they would like an independent foundation to take over the AT Protocol.

Bluesky itself is a benefit corporation – while it's for-profit it's not obliged to maximize shareholder value.

Comparison

Bluesky and Mastodon have different goals in mind which significantly affects their approach and lead to Mastodon considering ActivityPub adequate while Bluesky considered it inadequate.

Bluesky sees the network it creates as being essentially a clone of Twitter but without “censorship” – at least, without the ability to ban people from the service. This means it needs to have good discoverability (good search features, etc), and needs to off-load decisions about moderation to end users. To avoid deviating from these needs it wants the network to not be owned by it – though this reflects its original ideological purpose, not necessarily its current corporate structure. Good companies can become bad ones.

Mastodon sees itself as a social network, a network of people who want to talk to one another and would rather third parties stay out of it. Distributing their content across multiple servers provides resiliency and ensures that rules set by one instance manager are always possible to escape from, but that doesn't mean they want unfriendly people to easily find potential targets of harassment.

Much of Mastodon's limitations indeed are reflected by Mastodon's concerns about the latter. Discoverability is an oft-cited complaint about Mastodon, but it's also limited in terms of engagement. For context about how concerned Mastodon is about harassment: The “likes” (“starred”) feature for example isn't used for anything other than giving the author of a piece a head's up that someone appreciated their work – counts are generally not displayed to anyone but the author. Boosts (the Mastodon equivalent of retweets) are OK, but only a limited number of servers in the Mastodon network support the equivalent of quote tweeting. Rightly or wrongly, Mastodon's governing establishment is concerned quote tweeting may be used to dunk on people and target them for harassment.

Preventing a single host owning their network is an issue for both. Mastodon's defenses are largely by trying to keep their main instances run by non-profits, encouraging personal server management, and via a certain amount of community suspicion of large social network businesses joining the ActivityPub universe such as Threads or Tumblr.

Bluesky doesn't have the same incentives to avoid allowing a single party controlling the AT Protocol network as Mastodon, given right now it would be that single party, but due to its original ideological founding basis, it nonetheless has tried. By breaking up their system into three independent components, in theory anyone can create a front end to the AT Protocol network that works the way they would want it to work. And this has happened, there are third party front ends to the AT Protocol network currently being used. All, however, are dependent upon Bluesky's own relay, because the incentives to run a relay are low and the cost is extremely high and getting higher by the minute. Superficially, a relay can be created that only indexes a subset of accounts, but doing so will break things if it's intended to be used by general audiences.

Is Bluesky going to take over their network?

We've just endured a decade in which we've seen virtually every service that wasn't awful already (Facebook?) become awful, including:

Tumblr “banning porn” because it couldn't deal with its child porn issue
Twitter being bought by Elon Musk and replaced by X, a neo-nazi social network
Reddit banning third party clients, breaking APIs, and selling its content to an AI company.
Google Plus introduced, Google breaking its entire system to support it forcing real names everywhere, only to then shut down Google Plus because of course.
Google's search engine permanently broken
Half-assed “AI” introduced everywhere presented as a solution for things it cannot do, and ultimately causing more problems than it would have solved even if it worked.
Ads being added to subscription-funded content.
Every single major tech company overtly supporting the neo-Nazi MAGA movement

Not to mention the numerous open source projects that suddenly stopped being open source.

So it's pretty difficult at this point to trust the corporate world that anything good right now will remain decent in the near future. And that means most criticisms of AT protocol revolve around the fact Bluesky is the dominant provider of AT protocol services, that it's corporate (albeit a benefit corporation), and that in theory it could, tomorrow, just restrict access to the PDS and relay under its control and as a result virtually remove all federation.

This would be difficult but not impossible to do in a clean way that doesn't majorly affect Bluesky's own users, but we've the experience of Reddit, X, et al, is that simply abusing your own users doesn't lose them. Of course, Bluesky's user-base is disproportionately made up of people who did just that with X.

An issue rarely mentioned – though I've raised it in the past addressing concerns about Threads involvement in the Fediverse – is that there's not as big an incentive to close off federation as people think. Generally those concerned about it point at XMPP, the instant messaging system that once joined Google Talk, Microsoft Messenger, and AIM, together in a single network, as an example of where federation was offered and then mysteriously taken away. And that's a great point but where are any of those services today? Defederating didn't make Google or AOL more powerful, it killed interest in chat systems altogether.

In Bluesky's case, removing federation would immediately force many of their users to look for alternatives to Bluesky. And it wouldn't bring in any new users – those cut off by BS's actions would be angry at Bluesky, they'd either stick with the remaining network if it's still viable, or leave. There's also a good reason not to do it – currently if someone just doesn't like Bluesky, they can leave but still remain in contact with their friends by moving to a compatible service provider that they do like. If, on the other hand, Bluesky defederates, and someone leaves for reasons unrelated to Bluesky's defederation, they can't remain in touch, and their friends have good incentives to follow them.

That does not mean Bluesky wouldn't do it. The experience of the last ten years isn't just that corporations do not mind abusing their own customers, employees, and so on, if they think it'll make more money, but that they're comfortable tanking their own companies if their ideology prevents them from seeing that it'll cause mass defections.

All in all the fears of defederation by Bluesky are based mostly on the fact corporate America sucks right now, and you can't even predict that a company will not do something that's definitely going to harm it.

And to be fair, I can't argue against that. I can argue it's not in Bluesky's best interests, and I can argue the protocol itself makes that harder, which it does, but not impossible. But until we have multiple relays, and multiple large front ends to the AT Protocol network, it's going to be impossible to argue that Bluesky can't do it and not see it as absolute suicide to do it.

(As an aside, the obvious thing, and something Bluesky can do because of its status as a public benefit corporation, would be to break itself up into three identical companies each with 1/3 of the user base and their own relay, plus a fourth non-profit foundation to oversee the protocol's development. If it really believes in what it's doing, this is the most obvious way in which it can reassure the community its serious.)

Is Mastodon also susceptible to takeovers?

Mastodon's developers, also responsible for the mastodon.social instance, recently reorganized themselves as a non-profit, to make sure development going forward remains in line with Mastodon's goals. This doesn't mean things can't change but Mastodon's own developers at this point are aligned with the federation vision, as can be proven by the fact that Mastodon is fully federated.

The major threats that are oft-cited are outsiders. Both Tumblr's (vaporware) announcement of ActivityPub federation, and Threads' actual federation, have been sources of controversy within the Mastodon community. Both would introduce a massive number of people to the Mastodon community, a group they could take back.

The counter to this is that they're very unlikely to encourage many users to move from their existing platforms to their own before turning off federation, and an equal number of users is likely to use the federation as an off-ramp to a more friendly environment if they don't like the bland corporate spoon-fed environment of, for example, Threads.

In the end Mastodon's ability to be taken over requires a hostile actor attract a huge proportion of the user base to use its own instance – and not just introduce new users to Mastodon but take users away from other instances. It seems improbable, but stranger things have happened.

Takeovers are not a good idea

I said this in the Bluesky section but I'll say it again. Defederating a network harms the company defederating it. XMPP didn't merely die when it was defederated, the platforms that implemented instant messaging using it died as a direct result.

Defederating does not mean you attract the users who were using other servers before you defederated. They don't like. They hate you. They're not coming.

But the people on your service that were communicating with them? They're likely to leave too.

Again, I raise this not because this means it won't happen, but because if cooler, saner, heads prevail at a company that's considering this, they'll avoid doing it.

T-Mobile Home Internet Backup

October 17, 2024

I live in Florida. Florida has bad weather. You may have seen it on the news. Our home lost Internet (we get it from Comcast) due to Hurricane Milton, which isn't the first time. I've been looking for a backup Internet solution for some years now, especially as I work from home. Finally, T-Mobile has stepped up with a $20/month thing called T-Mobile Home Internet Backup. It's capped at 130G a month, but is otherwise identical to their regular 5G Home Internet service.

So I'm trying it.

And... I'm in two minds.

Let's start with the positives: it's fast. Very fast. I get around 400Mbps down, and 40Mbps up. That's close to Comcast for the downloads, and double (!) for the uploads. Latency is similarly comparable.

The main negative is that it's crude, uncustomizable, and isn't user friendly for its intended purpose.

The basic system is supplied as a combined 5G modem and router with no options. No, really. You can set the “admin” password but can't use it for anything, and determine the SSID and password to the Wifi network you create. But you cannot:

Turn off the Wifi (there are two Ethernet ports so it's not like Wifi is necessary)
Turn off the DHCP server
Set up reserved IPs
Port forward (well, it's probably CGNAT anyway)

If you plug in your own router, it will do nothing to accommodate you. It won't turn off the Wifi. It won't give your router any IPv6 features (DHCP-PD is not provided, I'm not sure any IPv6 functionality is provided even if you connect directly.) You'll certainly have issues with certain types of application.

The unit does have undocumented features, you can download an app called HINT Control which will allow you to turn off the Wifi or restrict it to a single frequency band. If you're feeling more adventurous there's a web app you can install, if you have the environment to run it, that'll make it feel a little more like a regular router, called KVD Admin. The options though are the same as the above app. Advanced features such as port forwarding or IPv6 support are out of the question however.

It's marketed as a backup but it doesn't act like one.

I am a nerd, so I know my needs are more substantial than Joe Smartphone, but it's not clear to me that even users with simpler needs will find it anything but a jarring experience when used as a backup.

The ideal implementation for something like this would sit between someone's router and their home network. Unfortunately that's not generally practical, most people use a combined Wifi/router/gateway, which means you can't just slip something in that reroutes packets when the main Internet is down.

So T-Mobile's solution to this is essentially give up. Instead of putting in any effort at all into allowing end users to integrate the T-Mobile system into their existing home network. The apparent assumption is that you'll simply go through every single device you have – your laptops, your tablets, your smart TV, or any other “smart” devices (heaven help us) you rely on, your Alexa hub, your security camera, etc – and reconfigure them to use the T-Mobile wireless access point. And when the main Internet comes back up, you'll somehow notice and reverse that process and reconfigure every one of your devices back again.

How it should work

An end user, whether nerd me or Joe Sixpack, actually wants this to act more like a switch, where flipping one way means our Wifi router is routing Internet access via the regular route, and flipping the other way routes it via T-Mobile instead.

In an age when every ISP insists on sending customers preconfigured all-in-one boxes with limited customizability, it's tough to offer that in a user friendly way. But at the same time, you feel T-Mobile could have at least tried by providing a solution that meets the majority of configurations. One obvious way would be for T-Mobile's Wifi to be configurable as a pass through. People would connect to it by default, but instead of providing DHCP and routing to the Internet, it would normally pass packets on to the customer's gateway. When the switch is flipped however, it would intercept packets for the gateway and route them via T-Mobile instead.

And obviously, if you're like me, and want to use your own Wifi router, you could plug your router (and the rest of your network) into the T-Mobile box.

How I configured it

For now, my options are limited. I put an old router in front of the T-Mobile box to prevent its DHCP server from touching my network directly, and my ISC DHCP server (I don't use my router's) can be configured if needed to change the default gateway to that router should I lose access to the Internet via Comcast. I can also do it manually, or on a per-machine basis.

This is, to be honest, not ideal. It means every device has to be forced to renew its DHCP lease when there's an Internet issue. Some devices do this easily, I can get my laptop to get the latest information just by reconnecting it to Wifi.

A possible option would be to have an intermediary act as the default router, but I haven't yet figured out the best approach to doing that. It seems extremely inefficient.

Final thoughts

The system continues the industry's desire to control how we use the Internet and ensure it fits into the bizarrely limited world view of those who market ISP services. Which is a shame because the same desire for a “simple Internet” that “just works” also cripples T-Mobile's intended customer base for this specific service. You can't beat the price though, and if you're prepared to duct tape a bunch of kludges together, you can make it work.

Some thoughts about RAID in 2024

October 4, 2024

I made a strange decision I didn't think I would when building two new servers. Instead of arranging the SATA SSDs I bought as a raidset, I just set them up individually. Different VMs are on different drives. There's a reason for it, and I'm not entirely sure it was the right decision, but I didn't get any feedback suggesting I was fundamentally wrong.

When are you supposed to use RAID?

“RAID is not a back-up”

It's one of those glib phrases that's used in discussions about RAID when people are asking whether they should use it. It's also, very obviously, wrong (except in the case of RAID 0, of course.) RAID is primarily about preventing data loss, which it does by duplicating data. That's literally a back-up. And for many people, that's the same improvement they'd get from copying all their data to another drive on a regular basis, except that RAID results in less data loss than that strategy because it's automatic and immediate.

There are situations RAID doesn't handle. But there are also situations “copying your disk” doesn't handle either. Both react badly to the entire building catching fire unless you're in the habit of mailing your back-up disks to Siberia each time you make a back-up, or are ploughing through your monthly 1Tb Comcast quota shoving the data, expensively into the cloud. On the other hand, RAID is pretty bad at recovering from someone typing 'rm -rf /', though that happens less frequently than people think it does. And arguably, a good file system lets you recover from that too. And a good back-up system should somehow recognize when you intentionally remove a file.

“RAID is about uptime and high availability”

...which RAID does by automatically backing everything up in real time. Also that's 100% true of RAID on an enterprise grade server. But if you shoved five disks into your whitebox PC (keyword: into) and configured them as a RAID 6 set (because someone told you RAID 5 is bad, we'll get to that in a moment) then it's more about predictable up time in the sense you'll be able to say “Well, I'm going to need to turn off the computer to fix this bad disk but I can wait until things are quiet before doing it.”

Anyway it's kind of about uptime and, more importantly, high availability. And it's actually high availability that's the thing you're probably after. And RAID only has a limited impact on high availability generally.

“RAID makes some things faster, and other things slower”

This bit's awkward because while it's been technically true in the past that RAID's spreading of data across multiple disks has helped with speed, assuming a fast controller that can talk to multiple disks at once, it's always been a six of one, half a dozen of the other, kind of situation. You probably read files more than you write them, and most RAID configurations speed up the former at the expense of the latter, but over all it's unclear what kind of advantage you'd gain from using it. RAID 1 or “RAID 10” were once probably best if you're just looking for ways to speed up a slow server. But technologies change. We have SSDs now. It is very unlikely RAIDing SSDs makes any difference to read speeds whatsoever.

Conclusion: RAID is ultimately a way to make one component of your system, albeit a fairly important one, more reliable and less prone to data loss. But it's considered inadequate even when it works, and requires augmentation by other systems to prevent data loss.

How are technology changes affecting RAID?

Disk sizes, error rates, and matrixing

Until the early 2010s, the major technology change concerning storage was the exponential increase in capacity regular hard disks were seeing. Around 2010 or so, experts started to warn computer users that continuing to use RAID 5 might be a problem, because RAID controllers would mark entire disks as bad if they saw as many as just one unrecoverable failure Disk capacities had grown faster than their reliability per bit, so the chances of there being a bad sector in your disk grew with its capacity.

The scenario RAID 5 opponents worried over was essentially that one disk would fail, and then when recovering the data to add to another disk, the RAID controller would notice a problem with one or other of the other drives. This might make the data in question unrecoverable assuming the original failed drive was completely offline, which it might be if the entire disk failed, or in some silly circumstances where the RAID controller wants to be snotty about disks it found errors on, or if there are only three hotswap bays. Regardless, most (all?) RAID controllers will refuse to continue at that point because there's no way to recover without losing data, possibly a lot of data because of the matrixing algorithms used in RAID 5.

RAID 6 requires two redundant drives instead of one, and so, in 2010, was considered better than RAID 5. But recently we've been hearing similar concerns expressed about RAID 6. At the time of writing there is no standard RAID 7, with three redundant drives, but several unofficial implementations. A potential option is just to use RAID 1 with more than two drives, with a controller that can tolerate occasional, non-overlapping, errors.

These concerns are rarely expressed for RAID 1/10, but these have the same potential issue. A disk fails, you replace it, the controller tries to replicate the drive and finds a bad sector. But in theory, less data would be lost under that scenario because of the lack of matrixing.

SMR/Shingled drives

Post 2010s, the major innovation in hard disk technologies is a technology called “shingling”. As disk capacities increase, the size of magnetic tracks on those disks inevitably decreases until it becomes difficult to even make robust read/write heads capable of writing tracks small enough. Shingling is the great hope, instead of making the heads smaller, you keep them the same size, but you write and rewrite several tracks in succession as a single operation – writing each track so it overlaps with the track next to it. This means your big ass head might be wide enough to write three tracks at once, but you can still get multiple tracks in the same space.

For example, suppose you group seven together, you can write these like this:

First pass:

111??????

Second pass:

1222?????

Third (and so on) passes: 12333???? 123444??? 1234555?? 12345666? 123456777

Now if you hadn't used shingling, then in that same space you'd have only fit three tracks:

123456777 111222333

But, due to magic, you have seven! Hooray!

Let's not kid ourselves, this is a really good, smart, idea. It almost certainly improves reliability (you don't want to shrink disk heads too much or massively increase the number of heads and platters – the latter adds cost as well as increasing things that can fail.) But it does come at the expense of write speeds. Because every time you write to a drive like this, the drive has to (in the above case) read 7 tracks into memory, patch your changes into that image, and then rewrite the entire thing. It can't just just locate the track and sector and write just that sector because it'll overwrite two other sectors when it does it.

A less controversial technology that has a similar, albeit nowhere near as bad, effect is Advanced Format, often known as 4K. This increases the drive's native sector size from 512 bytes to 4k, essentially eliminating 7 sector gaps per sector on a drive which can be used to store data instead. Because operating systems have assumed 512 byte sectors since the mid-1980s, most drives emulate 512 byte sectors by... reading an entire sector into memory, patching it with the 512 bytes just read, and writing it out as needed. This is less of a problem than shingling because (1) operating systems can just natively use 4k sectors if they're available and (2) usually consecutive sectors contain consecutive data, so normally, during a copy, the drive can wait a while before updating a sector and will see the an entire 4k sector's worth of data given to it so it doesn't have to read and patch anything.

Shingled drives however don't have anything that would make their approach more efficient. They end up being very slow, and during a RAID recovery process, many RAID controllers will simply give up and assume a fault with a drive if it's shingled.

Shingled drives are rapidly becoming default for hard disks. Despite needing more on-board memory they end up being cheaper per terabyte, for obvious reasons.

Solid State Drives (SSD)

Probably the single biggest innovation that's occurred in the last 20 years has been the transition to SSDs as the primary storage medium. SSDs have very few disadvantages over HDDs. They're faster, so fast the best way to use an SSD is to wire it directly to the computer's PCIe bus (that's what “M.2” is), they use less power, and they're far more reliable. Their negatives are that they're more expensive per terabyte (though costs are coming down, and are comparable with 2.5” HDDs at the time of writing) and they have a “write limit”. The write limits are becoming less and less of a problem, though much of that is because of operating system vendors being more sensitive to how SSDs should be written to.

SSDs are RAIDable but there's a price. RAID usually astronomically increases the number of writes. Naively you'd expect at least a doubling of the number of writes, purely because of the need for redundancy. This isn't an issue with RAID 1 as you have double the disks, but for RAID 5 you have at most 1.5x as many disks, and RAID 6 isn't much better.

But, aside from RAID 1, it isn't “double” the writes. Each write in a RAID 5 or RAID 6 environment requires also updating error correction data, and while some of that can be buffered and done as a single operation, that does substantially increase the number of writes per underlying write.

For HDDs this isn't much of a problem, HDD life is mostly affected by general environment, drives will literally live for decades if cared for properly, no matter how much you write to them, but for SSDs each write is a reduction in the SSD's lifetime. And given most RAID users would initially start with a similar set of SSDs, that also increases the chance of more than one SSD failing in a short space of time, reducing the chance to recover a RAID set if one drive fails.

Ironically, SSDs should be easier to recover from than HDDs in this set of circumstances because a typical SSD, upon noticing its write count is almost up, will fall back to a read-only mode. So your data will be there, it's just your RAID controller may or may not be able to recover it because it was built for an earlier time.

Preempting a predicted reply about the above: many argue SSDs don't have write limit problems in practice because manufacturers are getting better at increasing write limits and they haven't had a problem in the whole five years they had one in their PC. Leaving aside issues with anecdotal data, remember that most operating systems have been modified to be more efficient and avoid unnecessary writes, and remember that RAID is inherently not efficient and makes many redundant writes (that's the 'R' in RAID). The performance of an SSD in a RAID 5 or RAID 6 set will not be similar to its performance in your desktop PC.

Alternative approaches to high availability

A common misunderstanding about high availability is that it's a component level concern, rather than an application level concern. But ultimately as a user you're only interested in the following:

The application needs to be up when I want to use it
I don't want to lose my work when using it.

RAID only secures one part of your application stack, in that it makes it less likely your application will fail due to a disk problem. But in reality, any part of the computer the application runs on could fail. It might become unpowered due to a UPS failure or a prolonged power outage, or because the PSU develops a fault. The system's CPU might overheat and burn out, perhaps because of a fan failure. While enterprise grade servers contain some mitigations to reduce the chance of these types of failure happening, the reality is that there are many failure modes, and RAID is designed to deal with just one of them, and increasingly it's bad at it.

But let's look at the structure of an average application: It serves web pages. All those web pages are generated from content stored in a database. For images and other semi-static assets, you can put them in a Minio bucket. You'd have nginx in front of all of this doing reverse proxying.

For the application and nginx, you are looking at virtually no file system modifications except when deploying updates. So you could easily create two nginx servers and two application servers (assuming the application server tolerates the idea of multiple servers, most will and are designed like that to begin with.) So you don't actually need RAID for either server, if one suffers a disk crash, already unlikely, you can just switch over to the back-up server and clone a new one while you wait.

For the database, both MySQL/MariaDB, and PostgreSQL, the two major open source databases, support replication. This means you can stand up a second server of the same type, and with a suitable configuration the two will automatically sync with one another. Should one server suffer a horrific disk crash, you can switch over to the back up, and stand up a new server to replace the back-up.

For your assets server, you can also do replication just as you can with the database.

One huge advantage of this approach over trying to make your server hardware bullet proof is that you don't have to have everything in the same place. Your Minio server can be a $1/month VPS somewhere. Your PostgreSQL replication server can be two servers on different PCs in your house. Your clones of your nginx and application servers can be on Azure and Amazon respectively. You could even have your primary servers running from unRAIDed SSDs, and secondary servers on a big slow box running classic HDDs in a RAIDed configuration.

This is how things should work. But there's a problem: some applications just don't play well in this environment. There are very few email servers, for example, that are happy storing emails in databases or blob storage. Wordpress needs a lot of work to get it to support blob storage for media assets, though it's not impossible to configure it to do so if you know what you're doing.

But if you do this properly, RAID isn't just unnecessary, it just complicates your system and makes it more expensive.

Final thoughts

This is a summary of the above, you can CTRL-F to find the justifications for each if you skimmed the above and do not understand why I would come to the conclusions I have come to, but I came to them, so suck it up:

RAID is viable right now, but within the next 5-10 years seems likely to be become mostly obsolete and a danger to anyone who blindly uses it without limiting themselves to those specific instances where it's necessary, and favoring large numbers of small drives in a RAID 1 configuration for those instances.
RAID is still useful if you're using them to create redundant high availability storage of relatively small disks, I've seen numbers as low as 2T quoted for RAID 5, but I'm sure you can go a little bigger on that.
Other than RAID 1, RAID with SSDs is probably an unwise approach.
Going forward, you should be choosing applications that rarely access the file system, preferring instead to use storage servers like databases, blob storage APIs, etc, that in turn support replication. If you have to run some legacy application that doesn't understand the need to do this, then use RAID, but limit the amount you use them for. (If it helps, a Raspberry Pi 5 with 4Gb of RAM and a suitably large amount of storage can probably manage MariaDB, PostgreSQL, and Minio replication servers. None of these servers are heavy on RAM.)
Moving away from RAID, forced or not, means you must, again, look at your back-up strategy. From a home user's point of view: consider adding a hotswap drive slot on your server and some automated processes that mount that drive, copy stuff to it, and unmount it, every night. You probably have a ton of unused SATA HDDs anyway, right?

Setting up a XEN PV/PVH domU with btrfs

September 24, 2024

These instructions refer to Xen 4.17 with a Debian “Bookworm” 12 dom0 and the bundled version of PyGrub. It's entirely possible that by the time you read this PyGrub will have gained btrfs support. But if you tried it and got errors, read on!

The problem

While some of Xen, notably xen-create-image, kinda sorta supports btrfs, that's not true of PyGrub, the Xen bootloader for PVs and PVHes doesn't. This means you're left with fairly ugly options if you want to use btrfs with Xen, including HVMs. And if you have to use HVMs then why use Xen at all? There are virtualization platforms with much better support than Xen that do things the HVM way.

I said “kinda sorta” for xen-create-image because when I created a Debian 12 image with it (-dist bookworm) it was unbootable (well, it booted read only) because the fstab contained a non-btrfs option in the entry for “/”. We'll get to that in a moment.

So, before you begin, I must warn you that the solution I'm about to propose, while modifiable for whatever the devil it is you plan to do, requires eschewing xen-create-image for the heavy lifting and doing all the steps it does manually. But to make things easier, we'll at least create a template using xen-create-image.

So what I want you to do first is create an image somewhere, using EXT4 as the file system, using xen-create-image, with the same name as the domU you intend to create, go into it, kick the tires, make sure it works and has network connectivity, make any modifications you need to do to get it into that state, and then come back here. Oh, if it's a bookworm image, be prepared to fix networking, /etc/network/interfaces assumes an 'eth0' device. The easiest fix is change all references to it to 'enX0' (assuming that's the Xen Ethernet device you booted with.)

Other notes:

We're essentially doing a set of commands that all require root. Rather than sudo everything, I'm assuming you 'sudo -s' to get a root shell. It's much easier.
I'm assuming the use of volume groups, specifically using a group called 'vg-example', and am using 'testbox' as the hostname. The general flow should translate to other storage backends, but will require different commands. You can change everything as needed as you go, testbox to whatever you call the machine, etc.

Back up the image

OK, shutdown your VM, and then mount the image somewhere. Typically you'll see lines in your /etc/xen/testbox.cfg that read something like:

disk        = [
                  'phy:/dev/vg-example/testbox-disk,xvda2,w',
              ]

There may be two lines if you created a swapfile. Take the path for the root file system anyway, and mount it somewhere, say, /mnt (if you mount somewhere else, change /mnt to your mount point accordingly when following these instructions. Likewise vg-example and testbox-root should be changed to whatever you see in the disk line. If you're not using volume groups, then adjust accordingly. You can mount disk images using the loopback device, mount -o loop /path/to/disk.img /mnt, for example.):

# mount /dev/vg-example/testbox-disk /mnt

Now do the following:

Edit /mnt/etc/fstab and change the line that mounts root to assume btrfs and put in some good btrfs options, for example:

/dev/xvda2 / btrfs defaults,noatime,compress=lzo 0 1

cd to /mnt/boot, and type this:

# ln -s . boot

That bit of magic will help pygrub find everything, because ultimately we're going to create a separate boot partition, and pygrub will mount it and think it's root, so it'll get confused there's no “boot/grub/grub.cfg” file unless you put that softlink there.

Finally, cd to /mnt and type:

# tar zcf /root/testbox-image.tgz -S .

(The -S is something I type by habit, it just makes sure sparse files are compressed and decompressed properly. If you're paranoid you might want to look up the other options for tar such as those handling extended attributes. Or use a different archiver like cpio. But the above is working for me.)

Finally, umount the image, and then delete it. eg:

# umount /mnt # lvremove vg-example/testbox-disk

(If you're not using volume groups, ignore the last line, just use rm if it's a disk image, or some other method if it's not a disk.)

Creating the first domU

OK, so we have a generic image we can use for both this specific domU and new domUs in future (if you plan to create a whole bunch of Debian 12 images, there's no need to duplicate it. I'll explain later how to do this.)

First, let's create the two partitions we need, for root and boot. I'm going to assume 'vg-example' is the volume group for this, but nothing requires you use this. or even use volume groups. If you're not using volume groups, do an equivalent with whatever system you have. You can create disk images using dd, eg # dd if=/dev/zero of=image.img iflag=fullblock bs=1M count=100 && sync for a 100M file and use losetup (eg # losetup loop1 image.img) to create a virtual device so you can mkfs it and mount it.

# lvcreate -L 512M -n testbox-boot vg-example # lvcreate -L 512G -n testbox-root vg-example

This creates our two new partitions, root and boot. Note they don't have to be part of the same volume group. You could even, probably, make one a disk image file and the other a volume group partition.

Now create the file systems on both:

# mkfs.btrfs /dev/vg-example/testbox-root # mkfs.ext4 /dev/vg-example/testbox-boot

Now mount the main root:

# mount /dev/vg-example/testbox-root /mnt

Create the mount point for boot inside the root (this is important):

# mkdir /mnt/boot

Mount the boot partition

# mount /dev/vg-example/testbox-boot /mnt/boot

Finally create the image:

# cd /mnt # tar zxf /root/testbox-image.tgz

After you're done (if you need to do anything at all), just umount boot and root in that order:

# cd ; umount /mnt/boot ; umount /mnt

Finally, modify your .cfg file, so load it into your favorite editor.

# vi /etc/xen/testbox.cfg

Modify the disk = [ ] section to look more like this:

disk        = [
                  'phy:/dev/vg-example/testbox-boot,xvda3,w',
                  'phy:/dev/vg-example/textbox-root,xvda2,w'
              ]

If your original had a swap partition, leave the entry there.

OK, moment of truth: boot using

# xl create /etc/xen/testbox.cfg -c

It should come up with the Grub menu. And then it should boot into your VM. And when you log into your VM, everything should be working as it was before you made your modifications.

Creating clones from the original archive

This is easy too. Keep that archive around. For this guide we'll keep most things the same but call the new domU you're creating clonebox.

Before you begin, copy across /etc/xen/testbox.cfg to /etc/xen/clonebox.cfg. You can edit it as you go, but at least begin by changing all references of testbox to clonebox, and change the MAC address. You can easily get one from https://dnschecker.org/mac-address-generator.php: use 00163E as the prefix (I have no connection with the makers of that tool, it seems to work OK, just want to save you a search engine result.) Finally allocate a new IP address, assuming you're not using DHCP.

Now, create the two partitions we need, for root and boot. You can create a swap partition too if your original had one.

As earlier, I'm going to assume 'vg-example' is the volume group and we're using volume groups.

# lvcreate -L 512M -n clonebox-boot vg-example
# lvcreate -L 512G -n clonebox-root vg-example
# mkfs.btrfs /dev/vg-example/clonebox-root
# mkfs.ext4 /dev/vg-example/clonebox-boot

(Again, for the second one change the -L to whatever you need it to be. A useful alternative to know is “-l '100%FREE'” – yes, lowercase L, that'll give the remainder of the disk to that partition.)

If you created partitions whose device names do not match what's in /etc/xen/clonebox.cfg, modify the disks = [] line in the latter file to use the new device paths (eg /dev/vg-example2/clonebox-root if you use a different volume group.).

Mount it as before, adding the boot mountpoint, and unarchive the original archive:

# mount /dev/vg-example/clonebox-root /mnt
# mkdir /mnt/boot
# mount /dev/vg-example/clonebox-boot /mnt/boot
# cd /mnt
# tar zxf /root/testbox-image.tgz

Now before you unmount things, you'll need to adjust the image.

The default Debian image is borked and creates an /etc/network/interfaces file with the wrong Ethernet device. When you were fixing it earlier, you either fixed /etc/network/interfaces, or you created a file called etc/systemd/network/10-eth0-link that maps eth0 to the device with the right MAC address. The former is probably easier, but if you did the latter, update /mnt/etc/systemd/network/10-eth0-link to whatever the new MAC address is.

If you're using static IPs, you'll also need to change /mnt/etc/network/interfaces and include the IP for this domU.

Regardless of everything else, you'll also definitely need to modify these files:

/mnt/etc/hostname
/mnt/etc/hosts
/mnt/etc/mailname

Something else you probably want to do is update the SSH keys on the system. To do this, you can chroot the file system and then run a tool called ssh-keygen -A:

# chroot /mnt /bin/bash
# cd /etc/ssh
# rm ssh_host*key*
# ssh-keygen -A
# exit

Once you're done, unmount:

# cd ; umount /mnt/boot ; umount /mnt

Save your new .cfg file, and then:

# xl create /etc/xen/clonebox.cfg -c

If it comes up, log in, poke around, make sure networking's working etc.

Convert Xen PVs to PVHs

After you get a PV domU working, it's time to see if it'll work as a PVH. PVHs are like PVs but use a more efficient memory management technique. They're still under development and considered experimental technology after approximately a decade of development, largely because, well, Kevin's out sick, and everyone else has been hired by RedHat to work on KVM. So use at your own risk. But if you've ever heard someone tell you PVs are old hat, and the hotness is HVMs, and thrown up a little in the back of your mouth because why use Xen then, well PVHs are the things that are more efficient than both, using some of the lessons learned developing HVMs while keeping the efficiencies of having operating systems work with the hypervisor as with PV.

So, edit your /etc/xen/testbox.cfg or /etc/xen/clonebox.cfg or whatever, and add the line “type = 'pvh'” somewhere. Save. Do xl create /etc/xen/.cfg and check it comes up (in PV mode I find the kernel writes messages to the console, while in PVH it doesn't, so wait a minute or two before declaring it broken,

How to get out of the console

Control-]. Both the xl create commands have a -c to attach the console so you can see what's happening, but if you're not used to that, well, CTRL-] is the thing to use to escape from that.

Setting up a Xen PV/PVH with NFS

September 23, 2024

OK, so this one's tough to describe. I had an issue where, because I wanted to have a VM use SSD storage, I wanted to use the btrfs file system. Turns out Xen's PV/PVH system is not btrfs friendly. xen-create-image will happily do it, but pygrub won't boot from a btrfs file system.

There are obvious workarounds. One is to boot the kernel directly, but then you have to pull the kernel and initrd.img files out of the guest's file system. Every time you do an update, you'll have to do this again. If the guest has the same OS as the dom0 then, I guess, you can use that kernel and just make sure you update both at the same time. But it's not really best practices, is it?

Another workaround you can do is create a separate boot partition. This isn't as easy as it sounds, as you'll need to point pygrub at it and somehow convince it the files in /boot are there, because it's looking for /boot/grub/grub.cfg, not [root of device]/grub/grub.cfg. You can do that with a softlink of course (cd /boot ; ln -s . boot) which is a bit hacky, but it should work. But using xen-create-image is going to be a little more complicated if you go down this route. If you're interested in going down this route, I have instructions here.

Finally there's the thing I'm experimenting with which may also solve other problems you had. And that's why I'm experimenting with it, because it does solve other problems.

What if... the domU's file system was actually native to the dom0 rather than just a block device there?

Well, you can do this with NFS. You can actually tell Linux to boot from an NFS share served from the dom0's NFS server, and have that NFS share be part of the dom0's file system. There's all kinds of reasons why this is Good, Actually™:

You can pick your own file system, hence the description above
You can have two or more domUs share storage rather than having to shut them down and manually resize their partitions every time they get too big.
Backing up becomes much easier as the dom0 has access to everything and is likely the VM any attached back-up media will mount on.
Very easy to clone VMs, there's this command called “cp” you can use.
A domU crashing will not destroy its own file system (probably)

There are, of course, downsides to this approach:

It's probably slightly slower.
There's some set up involved.
You better make sure your dom0 is locked down because it has all the data open to view in its own file system. Of course, technically it already does have the data available, just not visible in the file system.
Most importantly, some stuff just doesn't like NFS.

To clarify on the last point, Debian (at least) can only boot using NFS version 3 or lower. Several modern file system features, notably extended attributes, require version 4.2 (in theory a modified, non-default, initrd.img can be built that supports later NFSes, if I figure it out I'll update the instructions here.) The lack of extended attributes meant I couldn't run LXC on the host and install more recent Ubuntus, though older Ubuntus worked. Despite my best efforts I found certain tools, including a bind/Samba AD combination, just failed in this environment, becoming unresponsive.

The technique I'm going to describe is probably useless to HVM-only users. But you people should probably migrate to KVM anyway, so who cares.

So what do you need to do?

These instructions assume Debian “Bookworm” 12 is your dom0. If you're using something like XenServer or XCP-ng or whatever it calls itself today this won't be very useful, but PVs are deprecated on those platforms anyway, and PVHs didn't work at all from what I remember. They also assume use of a Linux kernel based domU.

Setting up the environment

So on your dom0, do this as root:

# apt-get install kernel-nfs-server rpcbind

Now there's a little set up involved, go edit /etc/defaults/nfs-kernel-server and change RPCMOUNTDOPTS=“—manage-gids” to RPCMOUNTDOPTS=“—manage-gids=no”

Now also edit /etc/nfs.conf, look for the line manage-gids=y and change it to manage-gids=n. There's at least two manage-gids=xxx lines in there, change the any that aren't commented out.

Both are necessary because NFS with that option starts using the computer it's running on's /etc/passwd and /etc/group files to figure out permissions if those are enabled, and this breaks group ownership unless the client and server have the exact passwd and group files. (Off topic, but why is this default? It'll break NFS shares in 99% of cases.)

# systemctl restart rpcbind kernel-nfs-server

Next up, create somewhere you can put all these NFS shares. You can change the below, just make sure to change it everywhere else when I give commands that refer to it later:

# mkdir /fs4xen

Final thing we're going to do is create an internal network for the domUs to access the NFS server with. We're keeping this separate from whatever network you've created to allow the domUs to talk to the network with to avoid the NFS server being accessed from the outside.

Edit your /etc/network/interfaces file and add this:

auto fsnet
iface fsnet inet static
        bridge_ports none
        address 172.16.20.1/16

(These are instructions for Debian but I'm aware many Ubuntu users will use Debian instructions as a starting point: If you're using Ubuntu, you'll want to do the Netplan equivalent. You should already have an example bridge in your current /etc/netplan/xxx file, the one used for allowing your domUs to access the outside world. This is more or less the same, except you don't need to bridge to an external port.)

That's the initial set-up out of the way. How do we create an image that uses it?

Creating domUs that boot from NFS

First create your PV (we'll upgrade it to a PVH later) Do it however you normally do, xen-create-image will usually do most of the work for you. But for the sake of argument, let's pretend we're left with a configuration file for the domU that looks a bit like this when stripped of comments:

bootloader = 'pygrub'
vcpus       = '4'
memory      = '16384'
root        = '/dev/xvda2 ro'
disk        = [
                  'phy:/dev/vg-example/testbox-root,xvda2,w',
              ]
name        = 'testbox'
vif         = [ 'ip=10.0.2.1 ,mac=00:01:02:03:04:05' ]
on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash    = 'restart'

If this configuration is supported by your current version of Xen (for example, you used a regular file system), then you might want to test it works, just to make sure. But shut it down immediately afterwards before continuing.

Verify it isn't running with xl list.

Allocate a new IP address on the internal NFS network, I'm going to use 172.16.20.2 for this example. (I mean, manually allocate it, just look for an IP that isn't in use yet.)

Now, add the root file system to your domU's fstab, and exports:

# mkdir /fs4xen/testbox
# echo '/dev/vg-example/testbox-root  /fs4xen/testbox     btrfs   noatime,compress=lzo    0       0' >> /etc/fstab
# mount /fs4xen/testbox
# echo '/fs4xen/testbox     172.16.20.2(rw,no_root_squash,sec=sys)' >> /etc/exports
# exportfs -a

If you get any errors when you enter the mount command, check /etc/fstab has the right parameters for your file system. I've assumed btrfs here. You should know roughly what the right parameters are anyway!

You may also get warnings when you type 'exportfs -a' about something called 'subtree_check', you can ignore those. Anything else you should go back and fix.

Finally, we rewrite the Xen .cfg file. This is what the new file would look like based upon everything else here. Use your favorite editor to make the changes and back up the original in case something goes wrong:

vcpus       = '4'
memory      = '16384'
kernel='/fs4xen/testbox/vmlinuz'
root='/dev/nfs'
extra=' rw elevator=noop nfsroot=172.16.20.1:/fs4xen/testbox,vers=3 ip=172.16.20.2:172.16.20.1::255.255.0.0::enX1:off:::'
ramdisk='/fs4xen/testbox/initrd.img'
name        = 'testbox'
vif         = [ 'ip=10.0.2.1 ,mac=00:01:02:03:04:05'
                'ip=172.16.20.2,mac=00:01:02:03:04:06,bridge=fsnet' ]
on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash    = 'restart'

Now just boot it with:

# xl create /etc/xen/testhost.cfg -c

and verify it boots and you can log in. Assuming you can, go in and create a file somewhere you have permissions (not /tmp as that's rarely exported), then switch to another session on your Xen dom0, and verify it exists in the same place under /fs4xen/testbox.)

If everything's fine, you can exit the console using Control-] in the usual way.

Turn a PV into a PVH

PVs are the default domU type created by xen-create-image, not least because they're supported on even the worst hardware. But they're considered obsolete, largely because AMD64 introduced better ways to sandbox virtualized operating systems. The new hotness is PVH which uses the newer processor features, but takes advantage of the aspects of PVs that made Xen wonderful and efficient to begin with.

Here's how to turn a PV into a PVH:

Edit the .cfg file in your favorite editor
Add the line “type = 'pvh'” somewhere in the file. Remove any other “type=” lines.
Save
Start the dom0, using 'xl create /etc/xen/testhost.cfg -c' and verify it boots to a login prompt.

If it doesn't boot, it may be a hardware issue, or the guest operating system itself might have an issue with PVHs. The most common “hardware” issue is that either your CPU doesn't support virtualization, or it does but it's been disabled in the BIOS. So check your BIOS for virtualization settings.