The first site of this homelab needed a real network design before anything else could land on it: which VLANs, which subnets, and — since “no single point of failure” is a founding principle — how DHCP and DNS keep working when a box dies. This post is the whole design, as ratified, with the reasoning that survived the arguments.
The gear at this site is modest and deliberate: a UniFi Cloud Gateway, one 8-port 2.5G PoE switch, one Wi-Fi 7 AP, a small ARM board running Incus, a 2-node Proxmox cluster, and three retired Intel Macs earning their keep as VM hosts.
The VLAN map — and why VLAN 1 is deliberately empty#
| Network | VLAN | Purpose |
|---|---|---|
| Adoption | 1 | quarantine — factory-fresh gear only |
| Management | 10 | switch/IPMI/hypervisor management, tagged-only |
| IoT | 51 | things; external egress blocked |
| Guest | 52 | internet only, isolated |
| Servers | 64 | hypervisors and services |
| Clients | 84 | trusted family devices |
| DNS/DHCP infra | 94 | the network’s own services |
The interesting one is VLAN 1. UniFi devices are born talking untagged — a factory-reset AP always comes up on the default network. Most setups either live with clients sharing that space or fight the vendor default. We did neither: VLAN 1 holds nothing at all. New gear lands there, gets adopted, and its management interface is immediately moved to tagged VLAN 10. From then on VLAN 1 is an empty room with a doorbell.
Two properties fall out of this for free:
- A stray device on a forgotten port joins a network containing nothing — not clients, not management. Port discipline is still good hygiene, but it’s no longer load-bearing for security.
- If a switch’s VLAN-10 config ever breaks, it falls back to untagged VLAN 1 — still reachable, still adoptable. The quarantine doubles as the break-glass path.
Addresses you can read#
Multi-site changes the addressing calculus: every subnet must be unique across all sites, because they’ll be meshed over a VPN overlay. The rule we ratified:
10.<site>.<vlan>.<host>/24Second octet = site, third = VLAN ID, and suddenly every address is self-describing:
10.1.51.x is site-1 IoT, 10.2.64.x is site-2 servers. No lookup table, ever.
We stress-tested the obvious alternatives and both failed on arithmetic, not taste:
- 192.168/16 has a single free octet, which would have to encode site and VLAN together. With our VLAN IDs, every packing either overflows past the second site or collides with itself. It’s also what every hotel, phone hotspot, and friend’s router uses — a standing invitation for VPN routing conflicts.
- 172.16/12 actually has enough bits — but Docker squats on exactly that range (the default bridge takes 172.17/16, and every compose network eats the next /16). A site numbered 172.18.x would be shadowed by a local bridge route on every Docker host we run. Colliding with your own infrastructure by default is worse than any aesthetic argument.
Within each /24, the host part has a reading rule too: .1–.9 reserved, .10–.99
DHCP pool, .100–.253 manual statics, gateway at .254. See an address below 100?
It’s a dynamic client. At or above? Static infrastructure. We also decided against
fixed-lease DHCP reservations entirely — servers get configured statics, and the DHCP
server stays a service for clients, not an inventory database.
DHCP that survives a dead server#
The gateway’s built-in DHCP is convenient and a single point of failure married to the router. We replaced it with ISC Kea in high-availability hot-standby: two instances, one in an Incus container on the ARM board, one on the Proxmox cluster — different hardware, synchronized leases, automatic failover, JSON configs living in git.
Two design choices worth stealing:
Trunk, don’t relay. Instead of DHCP relay on the gateway, both Kea containers get a tagged NIC in every VLAN they serve. This dodges a real UniFi limitation (there is no DHCPv6 relay at all), removes the gateway from the DHCP path entirely, and — since the containers touch only tagged VLANs — changing what rides untagged on a trunk can never break address assignment.
One deliberate exception. VLAN 1, the adoption quarantine, keeps the gateway’s DHCP. Why? Circular dependency: device adoption is what you need most while rebuilding the network — exactly when the Kea containers, which live behind switches that may themselves need re-adoption, might be down. The break-glass network depends on nothing but the gateway.
DNS in two layers#
DNS got split into two services that are usually (wrongly) one:
- Recursive resolvers — dnsdist in front of Unbound, three VMs, one virtual IP via keepalived. This is the only DNS address clients ever see.
- Authoritative servers — three more VMs behind a second VIP, serving the internal zone (a delegated subdomain of the public domain). No recursion, no forwarding, never handed to clients.
Both layers live in VLAN 94 with the DHCP servers, which keeps the firewall policy for the whole “network infrastructure” class down to one sentence: every VLAN may reach the resolver VIP on port 53, and nothing else. Splitting authoritative from recursive means a cache-poisoning bug and a zone-transfer bug are different blast radii — and you can upgrade one layer without touching the other.
An honest note: all six DNS VMs run on the retired Intel Macs, which our own tier policy classifies as expendable hardware. That’s a documented, accepted risk — three instances and a VIP per layer buy enough redundancy for a family — with a planned exit once proper always-on hardware lands at this site.
Time, and the IoT lie#
NTP came next, with two lessons:
Run time servers on hosts, never in containers. An unprivileged container shares the host’s kernel clock and can’t discipline it — chrony inside would serve time it has no power to correct. So chrony runs on five hosts (the ARM board, a Proxmox node, and the three Macs), all with a leg in VLAN 94, syncing upstream over NTS. No virtual IP needed here: the NTP protocol expects clients to juggle multiple servers — redundancy is built into the protocol, so we just hand out all five addresses via DHCP.
IoT firmware lies about NTP. Cheap devices hardcode pool.ntp.org and ignore
whatever DHCP tells them — and our IoT VLAN blocks external egress on principle. The
fix is a one-line gateway DNAT: any UDP/123 packet leaving the IoT VLAN for the
internet gets its destination rewritten to an internal time server. The device asks
“the pool”, the ARM board answers, nobody’s feelings get hurt.
IPv6, honestly#
The v6 story is deliberately unglamorous. Clients get global addresses via SLAAC from router advertisements. The ISP hands us a small set of fixed /64s rather than a real delegation — not enough for every VLAN, and here the earlier decisions paid off: the IoT VLAN has no business with global addresses (its egress is blocked anyway), and neither do the quarantine or management VLANs. Internet-facing VLANs get the prefixes; the rest get none, which structurally closes the “IoT phones home over v6” hole — no NAT66 acrobatics required.
For stable internal addressing there’s a ratified-but-deferred ULA plan: one random
/48 per site (per RFC 4193), with the subnet ID carrying the VLAN digits verbatim —
fdxx:xxxx:xxxx:94::/64 maps to 10.<site>.94.0/24, and static hosts mirror their v4
last octet as the v6 suffix (::135 ↔ .135). One address family reads exactly like
the other. It’s deferred because the current gateway can’t advertise a delegated
prefix and a static ULA on the same network — a limitation to revisit, not design
debt.
And one rule that costs nothing now and everything later if skipped: every IPv4 firewall policy has a v6 twin. There’s no NAT in v6 to paper over a missing rule.
What I’d tell past-me#
- An empty VLAN 1 turns a vendor default from a liability into a break-glass feature.
- Encode meaning into addresses (site, VLAN, static-vs-dynamic) — future-you reads IPs at 1 a.m. more often than documentation.
- Check what Docker, Kubernetes, and your own tools claim by default before picking address space.
- Redundancy mechanisms differ per protocol: DHCP needed an HA pair, DNS needed VIPs, NTP needed nothing but a longer server list. Don’t pay for HA machinery a protocol already gives you.
- Write down the exceptions (the VLAN-1 DHCP carve-out) with their why — they’re the first thing a future rebuild will otherwise “fix” back into a circular dependency.
Next up: actually cabling it — port profiles first, then the Kea cutover, one VLAN at a time.