Switching Ceph to SeaweedFS on NixOS
At the end of 2023, I realized that I was running out of space on my home Ceph cluster and it was time to add another node to it. While I had space for one more 3.5" drive in one of my servers, I was feeling a little adventurous and decided to get a DeskPi Super6C, a Raspberry Pi CM4, a large NVMe drive, and try to create a new node that way.
Well, over the following few months, a lot of mistakes were made that are worthy of a dedicated post. But, when most of those problems were resolve, I encountered another series of “adventures” which led me to switching out my home's Ceph cluster for a SeaweedFS one.
Unable to Build Ceph
Around the time I was working on the Pi setup, my NixOS flake was unable to build the ceph
packages. Part of this is because I was working off unstable and so the few weeks of being unable to build meant I couldn't get Ceph working on the new hardware. I tried even compiling it myself, which takes about six hours on my laptop and longe because I had to remote build on the Pi itself since I have yet to figure out how to get Colmena to build aarch64
on my laptop.
Also, I was dreading setting up Ceph since I remember how many manual steps I had to do to get the OSDs working on my machines. While researching it, I was surprised to see my blog post on it was on the wiki page, which is kind of cool and a nice egoboo.
There was a PR on Github for using the Ceph-provided OSD setup that would have hopefully alleviated it. That looked promising, so I was watching that PR with interest because I was right at the point of needing it.
Sadly, that PR ended up being abandoned for a “better” approach. Given that it takes me six hours to build Ceph, I couldn't really help with that approach which meant I was stuck waiting unless I was willing to dedicate a month or so trying to figure it all out. Given that the last time I tried to do that, my PR was abandoned for a different reason, I was preparing to keeping my Ceph pinned until the next release and just having my Raspberry Pi setup sit there idle.
I was also being impatient and there was something new to try out.
SeaweedFS
Then I noticed a little thing on top of the NixOS wiki for Ceph:
Another distributed filesystem alternative you may evaluate is SeaweedFS.
I vaguely remember looking at it when I first set up my 22 TB Ceph cluster, but I'd been dreaming about having a Ceph cluster for so long that it was dismissed because I really wanted to do Ceph.
Now, the need was less there so I thought I would give it a try. If anything, I still had a running Ceph cluster and I could run them side-by-side.
A big difference I noticed is that SeaweedFS has a single executable that provides everything. You can run it as a all-in-one process, but the three big services can also be run independently. That includes the master (coordinates everything), the volumes (where things are stored), and filer (make it look like a filesystem).
Also, Ceph likes to work at the block level whereas SeaweedFS wants to be pointed to plain directories. So the plan was to take the 1 TB drive for my Raspberry Pi and turn it into a little cluster to try it out.
SeaweedFS and NixOS
The first thing was that SeaweedFS doesn't have any NixOS options. I couldn't find any flakes for it either. My attempt to create one took me three days with little success. Instead, I ended up cheating and just grabbed the best-looking one I could find and dump it directly into my flake. It isn't even an override.
Yeah I would love to have a flake for this but I'm not skilled enough to create it myself.
Masters
With that, a little fumbling got a master† server up and running. You only need one of these, so pick a stable server and set it up.
# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
imports = [
../../services/seaweedfs.nix # the file from dermetfan
];
services.seaweedfs.clusters.default = {
package = pkgs.seaweedfs;
masters.main = {
openFirewall = true;
ip = "fs.home"; # This is what shows up in the links
mdir = "/var/lib/seaweedfs/master/main";
volumePreallocate = true;
defaultReplication = {
dataCenter = 0;
rack = 0;
server = 0;
};
};
};
}
This is a really basic setup that doesn't really do anything. The master server is pretty much a coordinator. But, what is nice is that that it shows something by starting up a web server at fs.home:9333
that lets you see that it is up (sadly, no dark mode) and running. This site will also let you get to all the other servers through web links.
Another important part is the defaultReplication
. I made it explicit, but when messing around, setting all three to 0
means that you don't get hung up the first time you try to write a file and it tries to replicate to a second node that isn't set up. All zeros is basically “treat the cluster as a single large disk.”
Later on, you can change that easily. I ended up setting rack = 1;
in the above example because I treat each node as a “rack” since I don't really have a server rack.
† I don't like using “master” and prefer main, but that is the terminology that SeaweedFS uses.
Volumes
Next up was configuring a volume server. I ended up doing one per server (I have four nodes in the cluster now) even though three of them had multiple partitions/directories on different physical drives. In all of these cases, I named the directory /mnt/fs-001
and created an ext4
partition on it. I could have used ZFS, but I know and trust ext4
and had trouble with ZFS years ago. But it doesn't matter, just make a drive.
# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
imports = [
../../services/seaweedfs.nix # the file from dermetfan
];
services.seaweedfs.clusters.default = {
package = pkgs.seaweedfs;
volumes.${config.networking.hostName} = {
openFirewall = true;
dataCenter = "home";
rack = config.networking.hostName;
ip = "${config.networking.hostName}.home";
dir = [ "/mnt/fs-001" ];
disk = [ "hdd" ]; # Replication gets screwy if these don't match
max = [ 0 ];
port = 9334;
mserver = [
{
ip = "fs.home";
port = 9333;
}
];
};
};
}
Once started up, this starts a service on http://node-001.home:9333
, connects to the master which will then show a link on that page, and basically say there is plenty of space.
The key parts I found are the disk
and max
.
Replication is based on dataCenter, rack, and server but also only if the disk types agree. So, hdd
will only sync to other hdd
even if half of them are ssd
or nvme
. Because I have a mix of NVMe and HDD, I made them all hdd
because it works and I don't really care.
The value of 0
for max
means use all the available space. Otherwise, it only grabs a small number of 30 GB blocks and stops. Since I was dedicating the entire drive over to the cluster, I wanted to use everything.
Filers
The final service needed is a filer. This is basically the POSIX layer that lets you mount the drive in Linux and start to do fun things with it. Like the others, it just gets put on the server. I only set up one filer and it seems to work, but others set up multiples. I just don't really understand why.
# src/nodes/node-0.nix
inputs @ { config
, pkgs
, flakes
, ...
}: {
imports = [
../../services/seaweedfs.nix # the file from dermetfan
];
services.seaweedfs.clusters.default = {
package = pkgs.seaweedfs;
filers.main = {
openFirewall = true;
dataCenter = "home";
encryptVolumeData = false;
ip = "fs.home";
peers = [ ];
master = [ # this is actually in cluster.masters that I import in the real file
{
ip = "fs.home";
port = 9333;
}
];
};
};
}
Like the others, this starts up a web service at fs.home:8888
that lets you browse the file system, upload files, and do fun things. Once this is all deployed (by your system of choice, mine is Colmena), then it should be up and running. Which means you should be able to upload a folder with the port 8888 site.
Debugging
I found the error messages are a little confusing a time, but weren't too much of a trouble to find. I just had to tail journalctl
and then try to figure it out.
journalctl -f | grep seaweed
If you have multiple servers, debugging requires doing this to all of them.
Secondary Volumes
Adding more volumes is pretty easy. I just add a Nix expression to each node to include drives.
services.seaweedfs.clusters.default = {
package = pkgs.seaweedfs;
volumes.main = {
openFirewall = true;
dataCenter = "main";
rack = config.networking.hostName;
mserver = cluster.masters; # I have this expanded out above
ip = "${config.networking.hostName}.home";
dir = [ "/mnt/fs-002" "/mnt/fs-007" ]; # These are two 6 TB red drives
disk = [ "hdd" "hdd" ]; # Replication gets screwy if these don't match
max = [ 0 0 ];
port = 9334;
};
};
As soon as they deploy, they hook up automatically and increase the size of the cluster.
Mounting
Mounting… this gave me a lot of trouble. Nix does not play well with the auto-mount and SeaweedFS, so I had to jump through a few hoops. In the end, I created a mount.nix
file that I include on any node that I have to mount the cluster on, which always goes into /mnt/cluster
.
inputs @ { config
, pkgs
, ...
}:
let
mountDir = "/mnt/cluster";
# A script to go directly to the shell.
shellScript = (pkgs.writeShellScriptBin
"weed-shell"
''
weed shell -filer fs.home:8888 -master fs.home:9333 "$@"
'');
# A script to list the volumes.
volumeListScript = (pkgs.writeShellScriptBin
"weed-volume-list"
''
echo "volume.list" | weed-shell
'');
# A script to allow the file system to be mounted using Nix services.
mountScript = (pkgs.writeShellScriptBin
"mount.seaweedfs"
''
if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo
then
echo "already mounted, unmounting"
exit 0
fi
echo "mounting weed: ${pkgs.seaweedfs}/bin/weed" "$@"
${pkgs.seaweedfs}/bin/weed "$@"
status=$?
for i in 1 1 2 3 4 8 16
do
echo "checking if mounted yet: $i"
if ${pkgs.gnugrep}/bin/grep -q ${mountDir} /proc/self/mountinfo
then
echo "mounted"
exit 0
fi
${pkgs.coreutils-full}/bin/sleep $i
done
echo "gave up: status=$status"
exit $status
'');
in
{
imports = [
../../seaweedfs.nix
];
# The `weed fuse` returns too fast and systemd doesn't think it has succeeded
# so we have a little delay put in here to give the file system a chance to
# finish mounting and populate /proc/self/mountinfo before returning.
environment.systemPackages = [
pkgs.seaweedfs
shellScript
volumeListScript
mountScript
];
systemd.mounts = [
{
type = "seaweedfs";
what = "fuse";
where = "${mountDir}";
mountConfig = {
Options = "filer=fs.home:8888";
};
}
];
}
So, let me break this into the parts. SeaweedFS has a nice little interactive shell where you can query status, change replication, and do lots of little things. However, it requires a few parameters, so the first thing I do is create a shell script called weed-shell
that provides those parameters so I don't have to type them.
$ weed-shell
The second thing while doing this is that I wanted to see a list of all the volumes. SeaweedFS creates 30 GB blobs for storage instead of thousands of little files. This makes things more efficient in a lot of way (replication is done on volume blocks).
$ weed-volume-list | head
.> Topology volumeSizeLimit:30000 MB hdd(volume:810/1046 active:808 free:236 remote:0)
DataCenter main hdd(volume:810/1046 active:808 free:236 remote:0)
Rack node-0 hdd(volume:276/371 active:275 free:95 remote:0)
DataNode node-0.home:9334 hdd(volume:276/371 active:275 free:95 remote:0)
Disk hdd(volume:276/371 active:275 free:95 remote:0)
volume id:77618 size:31474091232 file_count:16345 replica_placement:10 version:3 modified_at_second:1708137673
volume id:77620 size:31501725624 file_count:16342 delete_count:4 deleted_byte_count:7990733 replica_placement:10 version:3 modified_at_second:1708268248
volume id:77591 size:31470805832 file_count:15095 replica_placement:10 version:3 modified_at_second:1708104961
volume id:77439 size:31489572176 file_count:15067 replica_placement:10 version:3 modified_at_second:1708027468
volume id:77480 size:31528095736 file_count:15118 delete_count:1 deleted_byte_count:1133 replica_placement:10 version:3 modified_at_second:1708093312
When doing things manually, that is all I needed to see things working and get the warm and fuzzy feeling that it worked.
Getting it to automatically mount (or even systemctl start mnt-cluser.mount
) is that the command to do so is weed fuse /mnt/cluster -o "filer=fs.home:8888"
.
NixOS doesn't like that.
So my answer was to write a shell script that fakes a mount.seaweedfs
and calls the right thing. Unfortunately, it rarely worked and it took me a few days to figure out why. While weed fuse
returns right away, I'm guessing network latency means that /proc/self/mountinfo
doesn't update for a few seconds later. But systemd
had already queried the mountinfo
file, saw that it wasn't mounted, and then declared the mount failed.
But, by the time I (as a slow human) looked at it, the mountinfo
showed success.
The answer was to delay returning from mount.seaweedfs
until we give SeaweedFS a chance to finish so systemd
could see it was mounted and didn't fail the unit. Hence the loop, grep, and sleeping inside mount.seaweedfs
. Figuring that out required a lot of reading code and puzzling through things to figure out, so hopefully that will help someone else.
After I did, though, it was working pretty smoothly since, including recovering on reboot.
Changing Replication
As I mentioned above, once I was able to migrate the Ceph cluster, I changed replication to rack = 1;
to create one extra copy across all four nodes. However, SeaweedFS doesn't automatically rebalance like Ceph does. Instead, you have to go into the shell and run some commands.
$ weed-shell
lock
volume.deleteEmpty -quietFor=24h -force
volume.balance -force
volume.fix.replication
unlock
exit
You can also set it up to do it automatically, I'm not entirely sure I've done that so I'm not going to show my attempt.
Observations
One of the biggest thing I noticed is that Ceph does proactive maintenance on drives. It doesn't sound like much, but I feel more comfortable that Ceph would detect errors. It also means that the hard drives are always running in my basement; just the slow grind of physical hardware as Ceph scrubs and shuffles things around.
SeaweedFS is more passive in that regard. I don't trust that it won't catch a failed hard drive a fast, but it still doesn't have the failures of RAID, lets me spread out data across multiple servers and locations. There is also a feature for uploading to a S3 server if I wanted. I use a Restic service for my S3 uploads.
That passivity also means it hasn't been grinding my drives as much and I don't have to worry about the SSDs burning out too quickly.
Another minor thing is that while there are a lot less options with SeaweedFS, it took me about a third of the time to get the cluster up and running. There were a few error messages that threw me, but for the most part, I understood the errors and what SeaweedFS was looking for. Not always the case with Ceph and I had a few year-long warnings that I never figured out how to fix that I was content to leave as-is.
I do not like the lack of dark mode on SeaweedFS's websites.
Opinions
I continue to like Ceph, but I also like SeaweedFS. I would use either, depending on the expected load. If I was running Docker images or doing coding on the cluster, I would use a Ceph cluster. But, in my case, I'm using it for long-term storage, video files, assets, and photo shoots. Not to mention my dad's backups. So I don't need the interactive of Ceph along its higher level of maintenance.
Also, it is a relatively simple Go project, doesn't take six hours to build, and uses more concepts that I understand (mkfs.ext4
) that I'm more comfortable with it.
It was also available at the point I wanted to play (though Ceph is building on NixOS unstable again, so that is moot problem. I was just being impatient and wanted to learn something new.)
At the moment, SeaweedFS works out nice for my use case and decided to switch my entire Ceph cluster over. I don't feel as safe with SeaweedFS, but I feel Safe Enough™.
Metadata
Categories:
Tags: