Switching Ceph to SeaweedFS on NixOS

2024-03-21T05:00:00Z

At the end of 2023, I realized that I was running out of space on my home Ceph cluster and it was time to add another node to it. While I had space for one more 3.5" drive in one of my servers, I was feeling a little adventurous and decided to get a DeskPi Super6C, a Raspberry Pi CM4, a large NVMe drive, and try to create a new node that way.

Well, over the following few months, a lot of mistakes were made that are worthy of a dedicated post. But, when most of those problems were resolve, I encountered another series of “adventures” which led me to switching out my home's Ceph cluster for a SeaweedFS one.

Unable to Build Ceph

Around the time I was working on the Pi setup, my NixOS flake was unable to build the ceph packages. Part of this is because I was working off unstable and so the few weeks of being unable to build meant I couldn't get Ceph working on the new hardware. I tried even compiling it myself, which takes about six hours on my laptop and longe because I had to remote build on the Pi itself since I have yet to figure out how to get Colmena to build aarch64 on my laptop.

Also, I was dreading setting up Ceph since I remember how many manual steps I had to do to get the OSDs working on my machines. While researching it, I was surprised to see my blog post on it was on the wiki page, which is kind of cool and a nice egoboo.

There was a PR on Github for using the Ceph-provided OSD setup that would have hopefully alleviated it. That looked promising, so I was watching that PR with interest because I was right at the point of needing it.

Sadly, that PR ended up being abandoned for a “better” approach. Given that it takes me six hours to build Ceph, I couldn't really help with that approach which meant I was stuck waiting unless I was willing to dedicate a month or so trying to figure it all out. Given that the last time I tried to do that, my PR was abandoned for a different reason, I was preparing to keeping my Ceph pinned until the next release and just having my Raspberry Pi setup sit there idle.

I was also being impatient and there was something new to try out.

SeaweedFS

Then I noticed a little thing on top of the NixOS wiki for Ceph:

Another distributed filesystem alternative you may evaluate is SeaweedFS.

I vaguely remember looking at it when I first set up my 22 TB Ceph cluster, but I'd been dreaming about having a Ceph cluster for so long that it was dismissed because I really wanted to do Ceph.

Now, the need was less there so I thought I would give it a try. If anything, I still had a running Ceph cluster and I could run them side-by-side.

A big difference I noticed is that SeaweedFS has a single executable that provides everything. You can run it as a all-in-one process, but the three big services can also be run independently. That includes the master (coordinates everything), the volumes (where things are stored), and filer (make it look like a filesystem).

Also, Ceph likes to work at the block level whereas SeaweedFS wants to be pointed to plain directories. So the plan was to take the 1 TB drive for my Raspberry Pi and turn it into a little cluster to try it out.

SeaweedFS and NixOS

The first thing was that SeaweedFS doesn't have any NixOS options. I couldn't find any flakes for it either. My attempt to create one took me three days with little success. Instead, I ended up cheating and just grabbed the best-looking one I could find and dump it directly into my flake. It isn't even an override.

Yeah I would love to have a flake for this but I'm not skilled enough to create it myself.

Masters

With that, a little fumbling got a master† server up and running. You only need one of these, so pick a stable server and set it up.

# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
  imports = [
    ../../services/seaweedfs.nix # the file from dermetfan
  ];

  services.seaweedfs.clusters.default = {
    package = pkgs.seaweedfs;

    masters.main = {
      openFirewall = true;
      ip = "fs.home"; # This is what shows up in the links
      mdir = "/var/lib/seaweedfs/master/main";
      volumePreallocate = true;

      defaultReplication = {
        dataCenter = 0;
        rack = 0;
        server = 0;
      };
    };
  };
}

This is a really basic setup that doesn't really do anything. The master server is pretty much a coordinator. But, what is nice is that that it shows something by starting up a web server at fs.home:9333 that lets you see that it is up (sadly, no dark mode) and running. This site will also let you get to all the other servers through web links.

Another important part is the defaultReplication. I made it explicit, but when messing around, setting all three to 0 means that you don't get hung up the first time you try to write a file and it tries to replicate to a second node that isn't set up. All zeros is basically “treat the cluster as a single large disk.”

Later on, you can change that easily. I ended up setting rack = 1; in the above example because I treat each node as a “rack” since I don't really have a server rack.

† I don't like using “master” and prefer main, but that is the terminology that SeaweedFS uses.

Volumes

Next up was configuring a volume server. I ended up doing one per server (I have four nodes in the cluster now) even though three of them had multiple partitions/directories on different physical drives. In all of these cases, I named the directory /mnt/fs-001 and created an ext4 partition on it. I could have used ZFS, but I know and trust ext4 and had trouble with ZFS years ago. But it doesn't matter, just make a drive.

# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
  imports = [
    ../../services/seaweedfs.nix # the file from dermetfan
  ];

  services.seaweedfs.clusters.default = {
    package = pkgs.seaweedfs;

    volumes.${config.networking.hostName} = {
      openFirewall = true;
      dataCenter = "home";
      rack = config.networking.hostName;
      ip = "${config.networking.hostName}.home";
      dir = [ "/mnt/fs-001" ];
      disk = [ "hdd" ]; # Replication gets screwy if these don't match
      max = [ 0 ];
      port = 9334;

      mserver = [
        {
          ip = "fs.home";
          port = 9333;
        }
      ];
    };
  };
}

Once started up, this starts a service on http://node-001.home:9333, connects to the master which will then show a link on that page, and basically say there is plenty of space.

The key parts I found are the disk and max.

Replication is based on dataCenter, rack, and server but also only if the disk types agree. So, hdd will only sync to other hdd even if half of them are ssd or nvme. Because I have a mix of NVMe and HDD, I made them all hdd because it works and I don't really care.

The value of 0 for max means use all the available space. Otherwise, it only grabs a small number of 30 GB blocks and stops. Since I was dedicating the entire drive over to the cluster, I wanted to use everything.

Filers

The final service needed is a filer. This is basically the POSIX layer that lets you mount the drive in Linux and start to do fun things with it. Like the others, it just gets put on the server. I only set up one filer and it seems to work, but others set up multiples. I just don't really understand why.

# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
  imports = [
    ../../services/seaweedfs.nix # the file from dermetfan
  ];

  services.seaweedfs.clusters.default = {
    package = pkgs.seaweedfs;

    filers.main = {
      openFirewall = true;
      dataCenter = "home";
      encryptVolumeData = false;
      ip = "fs.home";
      peers = [ ];

      master = [ # this is actually in cluster.masters that I import in the real file
        {
          ip = "fs.home";
          port = 9333;
        }
      ];
    };
  };
}

Like the others, this starts up a web service at fs.home:8888 that lets you browse the file system, upload files, and do fun things. Once this is all deployed (by your system of choice, mine is Colmena), then it should be up and running. Which means you should be able to upload a folder with the port 8888 site.

Debugging

I found the error messages are a little confusing a time, but weren't too much of a trouble to find. I just had to tail journalctl and then try to figure it out.

journalctl -f | grep seaweed

If you have multiple servers, debugging requires doing this to all of them.

Secondary Volumes

Adding more volumes is pretty easy. I just add a Nix expression to each node to include drives.

  services.seaweedfs.clusters.default = {
    package = pkgs.seaweedfs;

    volumes.main = {
      openFirewall = true;
      dataCenter = "main";
      rack = config.networking.hostName;
      mserver = cluster.masters; # I have this expanded out above
      ip = "${config.networking.hostName}.home";
      dir = [ "/mnt/fs-002" "/mnt/fs-007" ]; # These are two 6 TB red drives
      disk = [ "hdd" "hdd" ]; # Replication gets screwy if these don't match
      max = [ 0 0 ];
      port = 9334;
    };
  };

As soon as they deploy, they hook up automatically and increase the size of the cluster.

Mounting

Mounting… this gave me a lot of trouble. Nix does not play well with the auto-mount and SeaweedFS, so I had to jump through a few hoops. In the end, I created a mount.nix file that I include on any node that I have to mount the cluster on, which always goes into /mnt/cluster.

inputs @ { config
, pkgs
, ...
}:
let
  mountDir = "/mnt/cluster";

  # A script to go directly to the shell.
  shellScript = (pkgs.writeShellScriptBin
    "weed-shell"
    ''
      weed shell -filer fs.home:8888 -master fs.home:9333 "$@"
    '');

  # A script to list the volumes.
  volumeListScript = (pkgs.writeShellScriptBin
    "weed-volume-list"
    ''
      echo "volume.list" | weed-shell
    '');

  # A script to allow the file system to be mounted using Nix services.
  mountScript = (pkgs.writeShellScriptBin
    "mount.seaweedfs"
    ''
      if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo
      then
        echo "already mounted, unmounting"
        exit 0
      fi

      echo "mounting weed: ${pkgs.seaweedfs}/bin/weed" "$@"
      ${pkgs.seaweedfs}/bin/weed "$@"
      status=$?

      for i in 1 1 2 3 4 8 16
      do
        echo "checking if mounted yet: $i"
        if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo
        then
          echo "mounted"
          exit 0
        fi

        ${pkgs.coreutils-full}/bin/sleep $i
      done

      echo "gave up: status=$status"
      exit $status
    '');
in
{
  imports = [
    ../../seaweedfs.nix
  ];

  # The `weed fuse` returns too fast and systemd doesn't think it has succeeded
  # so we have a little delay put in here to give the file system a chance to
  # finish mounting and populate /proc/self/mountinfo before returning.
  environment.systemPackages = [
    pkgs.seaweedfs
    shellScript
    volumeListScript
    mountScript
  ];

  systemd.mounts = [
    {
      type = "seaweedfs";
      what = "fuse";
      where = "${mountDir}";
      mountConfig = {
        Options = "filer=fs.home:8888";
      };
    }
  ];
}

So, let me break this into the parts. SeaweedFS has a nice little interactive shell where you can query status, change replication, and do lots of little things. However, it requires a few parameters, so the first thing I do is create a shell script called weed-shell that provides those parameters so I don't have to type them.

$ weed-shell

The second thing while doing this is that I wanted to see a list of all the volumes. SeaweedFS creates 30 GB blobs for storage instead of thousands of little files. This makes things more efficient in a lot of way (replication is done on volume blocks).

$ weed-volume-list | head
.> Topology volumeSizeLimit:30000 MB hdd(volume:810/1046 active:808 free:236 remote:0)
  DataCenter main hdd(volume:810/1046 active:808 free:236 remote:0)
    Rack node-0 hdd(volume:276/371 active:275 free:95 remote:0)
      DataNode node-0.home:9334 hdd(volume:276/371 active:275 free:95 remote:0)
        Disk hdd(volume:276/371 active:275 free:95 remote:0)
          volume id:77618  size:31474091232  file_count:16345  replica_placement:10  version:3  modified_at_second:1708137673 
          volume id:77620  size:31501725624  file_count:16342  delete_count:4  deleted_byte_count:7990733  replica_placement:10  version:3  modified_at_second:1708268248 
          volume id:77591  size:31470805832  file_count:15095  replica_placement:10  version:3  modified_at_second:1708104961 
          volume id:77439  size:31489572176  file_count:15067  replica_placement:10  version:3  modified_at_second:1708027468 
          volume id:77480  size:31528095736  file_count:15118  delete_count:1  deleted_byte_count:1133  replica_placement:10  version:3  modified_at_second:1708093312

When doing things manually, that is all I needed to see things working and get the warm and fuzzy feeling that it worked.

Getting it to automatically mount (or even systemctl start mnt-cluser.mount) is that the command to do so is weed fuse /mnt/cluster -o "filer=fs.home:8888".

NixOS doesn't like that.

So my answer was to write a shell script that fakes a mount.seaweedfs and calls the right thing. Unfortunately, it rarely worked and it took me a few days to figure out why. While weed fuse returns right away, I'm guessing network latency means that /proc/self/mountinfo doesn't update for a few seconds later. But systemd had already queried the mountinfo file, saw that it wasn't mounted, and then declared the mount failed.

But, by the time I (as a slow human) looked at it, the mountinfo showed success.

The answer was to delay returning from mount.seaweedfs until we give SeaweedFS a chance to finish so systemd could see it was mounted and didn't fail the unit. Hence the loop, grep, and sleeping inside mount.seaweedfs. Figuring that out required a lot of reading code and puzzling through things to figure out, so hopefully that will help someone else.

After I did, though, it was working pretty smoothly since, including recovering on reboot.

Changing Replication

As I mentioned above, once I was able to migrate the Ceph cluster, I changed replication to rack = 1; to create one extra copy across all four nodes. However, SeaweedFS doesn't automatically rebalance like Ceph does. Instead, you have to go into the shell and run some commands.

$ weed-shell
lock
volume.deleteEmpty -quietFor=24h -force
volume.balance -force
volume.fix.replication
unlock
exit

You can also set it up to do it automatically, I'm not entirely sure I've done that so I'm not going to show my attempt.

Observations

One of the biggest thing I noticed is that Ceph does proactive maintenance on drives. It doesn't sound like much, but I feel more comfortable that Ceph would detect errors. It also means that the hard drives are always running in my basement; just the slow grind of physical hardware as Ceph scrubs and shuffles things around.

SeaweedFS is more passive in that regard. I don't trust that it won't catch a failed hard drive a fast, but it still doesn't have the failures of RAID, lets me spread out data across multiple servers and locations. There is also a feature for uploading to a S3 server if I wanted. I use a Restic service for my S3 uploads.

That passivity also means it hasn't been grinding my drives as much and I don't have to worry about the SSDs burning out too quickly.

Another minor thing is that while there are a lot less options with SeaweedFS, it took me about a third of the time to get the cluster up and running. There were a few error messages that threw me, but for the most part, I understood the errors and what SeaweedFS was looking for. Not always the case with Ceph and I had a few year-long warnings that I never figured out how to fix that I was content to leave as-is.

I do not like the lack of dark mode on SeaweedFS's websites.

Opinions

I continue to like Ceph, but I also like SeaweedFS. I would use either, depending on the expected load. If I was running Docker images or doing coding on the cluster, I would use a Ceph cluster. But, in my case, I'm using it for long-term storage, video files, assets, and photo shoots. Not to mention my dad's backups. So I don't need the interactive of Ceph along its higher level of maintenance.

Also, it is a relatively simple Go project, doesn't take six hours to build, and uses more concepts that I understand (mkfs.ext4) that I'm more comfortable with it.

It was also available at the point I wanted to play (though Ceph is building on NixOS unstable again, so that is moot problem. I was just being impatient and wanted to learn something new.)

At the moment, SeaweedFS works out nice for my use case and decided to switch my entire Ceph cluster over. I don't feel as safe with SeaweedFS, but I feel Safe Enough™.