Ceph and NixOS

Recently along the continuing chain of my entanglement, the media server died. We've been ignoring the symptoms for a while such as randomly crashing with hard drive failures, refusing to start up, and occasionally making a high-pitched noise. All signs that something terrible was going to happy but we were in the middle of other problems and just pushed it off.

Then it died properly and wouldn't get back up.

The Plans

Over the years, I've been looking at Ceph ever since DreamHost did a blog post about it. It seemed perfect for my previous experiences in crashing RAIDs and trying to find enough disk space to ram “just one more database” needed to get an analysis done. These same problems leaked into my media collection and keeping track of Partner's photo shoots.

When the drives crashed, I figured I'd sit down and do a Ceph cluster instead and see if it would be easier to bring more drives online as I run out without having to tear down servers and replace drives or find another spot to fit the server. Plus the whole idea of being able to have duplicates appealed to me.

After a bit of hemming and hawing, I picked up five 6 TB drives and waited half a month for them to show up (because I'm avoiding Amazon as much as possible and NewEgg is great (except that I cannot run their application on my phone).

Initial Set Up

After putting in the first drive, I promptly set up the other four for Ceph. Because of various bugs, frustrations, and learning curve difficulties, it took me almost a week. I was expecting NixOS to be able to set up drives, but it couldn't. It has options for things, but there are a lot of little fiddly bits and dials to do manually and then turn them into the Nixian way after the fact.

OSDs and NixOS

Setting up an OSD on Nix seems pretty simple:

services.ceph.osd = {
  enable = true;
  daemons = ["0"];

But that was not meant to be. NixOS doesn't really set things up for you, so you have to do it manually. Also, Ceph plays well with systemd but it does not play with Nix's version of systemd.

ceph-volume lvm create --data /dev/sdb --no-systemd

The above command will create the proper entries in /var/lib/ceph/osd/osd-0 and things will appear to be working fine. But I found this is a lie. When the system restarts, it will helpfully wipe out the entire contents of /var/lib/ceph/osd/osd-0.

To fix that, after I call ceph-volume above, I had to do this to get the state to recover restarting.

cd /var/lib/ceph/osd
tar -cjf ~/osd-0.tar.bz2 osd-0
systemctl restart ceph-osd-0.service
tar -xjf ~/osd-0.tar.bz2
chown -R ceph:disk osd-0
systemctl restart ceph-osd-0.service

After that, systemctl restart ceph-osd.target didn't blow away all my files. This is because there are some tmpfile rules in the systemd configuration for OSD that don't seem to be in the mon, mds, or mgr entries.

Corrupted Drives

It took another week of copying as much as I could off the existing corrupted drives. Thankful, I could mount them with ntfs-3g which means rsync could grind through them, spending about an hour for every file that was unrecoverable before moving on.

Over the years, I've had to recover bad drives thrice now. One time it took me almost two weeks to recover what I could off of Partner's laptop drives. This was one of the big reasons why I decided to go with Ceph, to handle the cases when we have photoshoots and large files and then lose the hard drive they are stored on.

Mistake One

I managed to get it up to find out my first mistake: Ceph needs 1 GB of RAM for every 1 TB of disk and I had 24 TB on a 16 GB machine. I also only had a 8 GB swap partition for the machine, which means everything was just grinding away.

To compensate, I could have removed the drives but I had just spent a week copying files over and basically the old drives were toast. So I spent a few hundred dollars and picked up two cheap Dell business towers from NewEgg instead. When they showed up, I started the process of moving one of the drive which involves “out” the OSD (the disk) and then Ceph gracefully moved the files off that drive so it can be safely moved.

Mistake Two

The second mistake was a minor one, the new computers didn't have drive rails so I needed to order a few more ports while I waited for the first disk to be cleared off so I could move it to a second drive I decided to put the 3.5" drives into the 5.25" bay because I didn't need the DVD and it was easier to run the wires.

Mistake Three

The third mistake was probably a big one. I picked the wrong drive to “out” and moved a live drive into the second machine. I also learned that Ceph is very tolerant of moving said drives but it gets very cranky. Since the old PC (my first Ceph server) was struggling, the rubber for the mounting screws cracked so I decided to go the slow approach and just “in” the drive I thought I was going to remove, “out” the one I'm actually moving, and used a Sharpie to identify said drives so I don't make that mistake again.

Adding Monitors

When I had three systems, I decided I needed to bring up a monitor on all three to get some balancing. Not to mention having a monitor go down means the entire system crashed. However, this took me a few tries because of how NixOS handles systems.

export MID=$(hostname)
export MIP=$(host $MID.local | cut -f 4 -d ' ')

cd /var/lib/ceph/mon
mkdir ceph-$MID
cd ceph-$MID
mkdir /tmp/add-ceph-mon
ceph auth get mon. -o /tmp/add-ceph-mon/keyring
ceph mon getmap -o /tmp/add-ceph-mon/map
ceph-mon -i $MID --mkfs --monmap /tmp/add-ceph-mon/map --keyring /tmp/add-ceph-mon/keyring
ceph-mon -i $MID --public-addr $MIP
rm -rf /tmp/add-ceph-mon

Starting up the new monitor is fine as long as you do not touch the NixOS configuration files. This first part uses killall and host to make the monitor ID equal to the server. When it comes up, it will be running in daemon mode which means NixOS will forget it on restart:

root        1792  1.0  0.3 575324 59284 ?        Ssl  17:00   0:00 ceph-mon -i notil --public-addr

In this case, I kill that process, either with kill 1792 or killall ceph-mon.

Then, I configure the server to start up the server and use colmena to push out the changes.

services.ceph = {
  mon.enable = true;
  mon.daemons = ["paruk"];

Once I make sure everything is finally up, then I add them to the initial hosts.

services.ceph = {
  global = {
    monInitialMembers = "muliq,notil,paruk";
    monHost = "muliq.local,notil.local,paruk.local";

Next Steps

Overall, I'm still happy with this. There is a huge learning curve when it comes to Ceph and NixOS, mainly in how they interact with each other. I assume there will be more difficulties but I seem to be heading in the right direction.