== Table of Contents ==
* Executive Summary
* How Did We Get Here?
* Reliability Is Not Availability
* Enter Scaleway
* Hard Numbers
* Conclusion
* References
== Executive Summary ==
This is a formal proposal to retire the dedicated server we have with
Integricloud and replace it with a set of virtual servers from Scaleway.
We originally chose Integricloud's dedicated server offering primarily
for reliability and security. While it has proven secure, and the
hardware itself is reliable, its availability leaves something to be
desired.
Scaleway offers a similar level of reliability, and has a higher level
of availability based on our current account with them. They
additionally offer servers that are not based on the x86 architecture,
so we are still protected from the numerous issues that plague x86.
This will also reduce our hosting costs by almost 90%, and should reduce
downtime by nearly 100%.
== How Did We Get Here? ==
In early January 2019, we were notified that both of our dedicated
servers at Rack911 were being retired, with very little notice. For
some additional information, reference adelie-devel@ post with message
ID <ba35ebd3-54b4-f18f-b65f-d327e9d0af80(a)adelielinux.org>
(archived at [1]).
After our sponsorship was pulled in October 2018, we had done a bit of
investigation into replacement hosting providers in the event that this
would happen. Our requirements at the time were:
* non-x86 based (due to the plethora of x86 bugs being discovered)
* at least 8 GB RAM minimum
* dedicated hardware preferred
* at least 3 IPv4 addresses
We evaluated
Packet.net for ARM64 based systems[2] and Integricloud for
PPC64 based systems[3]. We found Integricloud to be approximately 60%
of the cost of Packet.net[4]. Additionally, we had a professional
working relationship with their parent company, Raptor Engineering, who
make the Talos and Blackbird family of computers. In fact, the
Integricloud system we were offered was to be a rack-mounted Talos II.
Since we already had a Talos II in use as a build server, we felt this
would be close to ideal, as any hardware oddities have already been
worked out.
We chose their 4-core (16-thread) PowerPC system with 8 GB RAM and 2 x 1
TB NVMe disk storage. One 1 TB NVMe disk is dedicated to
mirrormaster.adelielinux.org. The other 1 TB NVMe disk is an LVM group,
shared between the various KVM-based virtual servers run on it.
== Reliability Is Not Availability ==
The Integricloud dedicated server,
chloe.adelielinux.org, has has no
hardware issues in over eight months of service. The hardware itself
has been fast, stable, and very reliable. However, there have been
multiple issues regarding availability.
Integricloud has a single homed fibre infrastructure; per a public
looking glass, it is run via Mediacom[5]. This has caused an unforeseen
and consistent issue regarding availability.
2019-04-16 13:17 down
2019-04-16 22:24 9 hours, 7 minutes
2019-04-17 00:10 down
2019-04-17 12:29 12 hours, 19 minutes
2019-07-09 06:25 down
2019-07-09 20:01 13 hours, 37 minutes
2019-07-10 15:14 down
2019-07-10 15:39 25 minutes
2019-07-12 16:35 down
2019-07-12 16:43 8 minutes
This has resulted in a 97% uptime for April, and a 98% uptime for July -
and we are only 13 days into July, so this number could go down further.
Additionally, many ISPs are not accepting Mediacom's IPv6 route
announcements. This has caused mirrormaster to be inaccessible to many
of our users, and even one of the members of our own Infra Team[6].
Finally, while yours truly was trying to show an Adélie Web page to
someone while on public Wi-Fi at a well-known place in Broken Arrow, OK,
I was greeted with an error page[7]:
Sonicwall Network Security Appliance
This site has been blocked by the network administrator.
Block reason: Gateway GEO-IP Filter Alert
IP address: 23.155.224.64
Connection initiated towards country: Unknown
If a car dealership's firewall is blocking us, who knows what other
firewalls are blocking us. How many people are unable discover us, and
how many corporate sponsors are we missing out on, because they can't
even connect to our Web site? And why can they not connect to our Web
site? It could be the IPv6 peering issue, or a firewall blocking our
IPv4 space, or because Mediacom has suffered another "fibre cut".
== Enter Scaleway ==
We have had a working relationship with Scaleway for almost a year and a
half. We launched our 32-bit ARM builder on the Scaleway ARM cloud in
March 2018, and have had no downtime in that time:
awilcox on erin [pts/0 Sat 13 9:33] ~: uptime
09:33:02 up 489 days, 5:59, load average: 0.00, 0.00, 0.00
The network has never suffered any outages, either. Since the Scaleway
cloud features ARM servers, we would additionally still be able to avoid
the x86 architecture and all of its failings.
We have continually been limited by our lack of IPv4 space at
Integricloud. Currently, we "proxy" every server via athdheise, a
virtual server on our Integricloud dedicated system that has both an
IPv4 and IPv6 address. All of our main systems are IPv6-only (wiki,
bts, next, etc), and when an IPv4 system attempts to connect to any of
these services, they have to be proxied via athdheise.
If we use Scaleway virtual servers, every system gets its own dedicated
IPv4 address, which drastically simplifies our administration.
Additionally, we would receive a lot more RAM per virtual server.
Currently, athdheise - the aforementioned Web server and proxy - has 256
MB RAM. It has 34 MB of available RAM. When documentation changes are
made and the Git hook runs to cause athdheise to rebuild the
documentation site (at
help.adelielinux.org), sometimes the process runs
out of memory. This means one of us has to log in, stop the web server,
run the make process, and then restart the web server. The minimum RAM
at Scaleway is 2 GB per virtual server. This is an extreme amount of
overhead, and would even allow us to play with memcached (or other
caching solutions) to reduce latency across our infrastructure.
Finally, we would save a dramatic amount of money. We currently pay
225$/mo pre-tax for Integricloud.
== Hard Numbers ==
The current systems we run on Integricloud are:
enfys (postgresql) 768 MB RAM 30 GB disk
rarity (these mailing lists) 1536 MB RAM 30 GB disk
mirrormaster 256 MB RAM 1 TB disk
bts (Bugzilla issue tracking) 512 MB RAM 8 GB disk
athdheise (Web server/proxy) 256 MB RAM 4 GB disk
wiki 512 MB RAM 8 GB disk
annwyn (Nextcloud) 512 MB RAM 100 GB disk
chatterbox (Quassel IRC) 512 MB RAM 40 GB disk
Since Scaleway tops out at 500 GB disk, we will need to consider
alternate hosting for mirrormaster. I believe we can run this on the
Hetzner dedicated server that is being sponsored by Alyx at Leuhta Labs.
And this is what we could pay per virtual system on Scaleway:
4 ARM CPUs, 2 GB RAM, 50 GB disk - 2.99€/mo
6 ARM CPUs, 4 GB RAM, 100 GB disk - 5.99€/mo
8 ARM CPUs, 8 GB RAM, 200 GB disk - 11.99€/mo
By my approximation, we would be able to put every single system except
annwyn on the smallest server, and annwyn on the second-smallest.
6× 2.99€ = 17.94€ per month
1× 4.99€ + 17.94€ = 22.93€ per month total cost, or approximately
25.81$. This is a savings of nearly 90% after tax.
== Conclusion ==
I believe that retiring our Integricloud dedicated server and replacing
it with Scaleway virtual ARM servers makes business sense. It will
allow us to spend less time down, dramatically improve the architecture
of our infrastructure, and reach more people. This will allow us to
have an even greater reach, and allow us to grow into a larger, more
healthy Linux distribution that can genuinely improve the world.
I do not want to leave this proposal without a separate smaller proposal
for how this could be effected easily. I believe that we can simply
start by migrating the wiki server, since it is the least used service.
We can feel out Scaleway's ARM offering for a while, and make sure that
it will genuinely work for our needs. After we are satisfied, we can
change the DNS for the wiki and begin work on another server. Assuming
all goes well, we will eventually be able to quietly power off the
Integricloud dedicated system with zero further downtime.
Thank you so much for reading this proposal. I welcome any comments or
questions you may have. You may respond here or poke me on IRC. I'll
post a summary email in response with any important notes from IRC.
Best,
--arw
== References ==
[1]:
https://lists.adelielinux.org/hyperkitty/list/adelie-devel@lists.adelieli...
[2]:
https://www.packet.com/cloud/servers/c1-large-arm/
[3]:
https://www.integricloud.com/
[4]: The
Packet.net ARM box runs at 360$/mo. Integricloud is 220$/mo.
[5]:
https://bgp.he.net/AS46246
[6]:
<aranea> awilfox: Looks like my routing issues are Mediacom's (that's
Raptor's only upstream) fault. I doubt I'll have any success contacting
them; this needs to come from a customer. I'll try contacting tpearson
again with more details; if he doesn't respond, I may have to ask you to
file an outage report or sth.
<aranea> Short version: Mediacom doesn't follow some standard industry
practices, and thus many of their peers aren't accepting the routes they
announce on behalf of their customers (and guess what, Raport is their
only IPv6 customer.)
[7]:
https://i.imgur.com/khmebJ5.png
--
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org