Lately, I've been quite fond of SeaweedFS. It isn't as powerful as Ceph but it considerably easier to maintain and manage. There are some tradeoffs, such as finding bit rotting (when the disks start to fail), but I find it not quite as “fragile” when it comes to using a random collection of Linux machines.

One of the features I want to play with SeaweedFS is the ability to upload a directory transparently to a S3 bucket (not AWS though, they are too big). I'm thinking about that for later, when I want to make an extra, off-site back up critical files including Partner's photo shoots.

Overfilling

Last week, I worked on one of the tasks I've been stalling on: archiving my dad's artwork. He had a lot of copies of nearly identical files and I didn't have the working storage on my laptop. I figured I had this huge (22 TB, though mostly full) cluster, I could use that.

Yeah… not the best of ideas.

I didn't realize I had made a mistake until everything started to fail because all of the nodes were at 98% or more full and the system couldn't replicate even the replication logs. I didn't even realize that until Partner said Plex was down.

Well, with replication down, I couldn't even use the weed shell to remove a file. When I did that, it just hung for hours.

$ weed-shell
> rm -rf in/dad-pictures

Nix Shell Scripts

Above, I use weed-shell. This is a custom script I generate with NixOS that is installed in any server that can talk to my SeaweedFS.

inputs:
let
  shellScript = (
    pkgs.writeShellScriptBin "weed-shell" ''
      weed shell -filer fs.local:8888 -master fs.local:9333 "$@"
    ''
  );
in
{
  environment.systemPackages = [ shellScript ];
}

This lets me handle common functions I use when maintaining things. In this case, I don't have to enter the common parameters needed to talk to my SeaweedFS cluster.

Cleaning Up

I tried a bunch of things, such as trying to force a more extreme of vacuuming (cleaning deleted files):

> volume.vacuum --help
Usage of volume.vacuum:
  -collection string
    	vacuum this collection
  -garbageThreshold float
    	vacuum when garbage is more than this limit (default 0.3)
  -volumeId uint
    	the volume id
> volume.vacuum -garbageThreshold 0.1

This didn't help as much as I hoped, but it did allow some replication and some commands to go through. I needed to clear up a lot more space so I could remove files properly and do a wholesale rm -rf to blow away father's files and try again later once I get some more space.

Replication

I have my volumes set to 010 replication. These are three numbers as data center, rack, and host.

Data Center: Make X copies to replicate across multiple data centers. I don't have multiple data centers, so this is always 0 for me.
Rack: This is to replicate across multiple racks. My setup is that each computer is a “rack” so my 1 replication means make an extra copy on a different machine. I treat this as a rack because I also have a DeskPi Super6C which is 6 Raspberry Pi CM4 (compute modules) in a single case, so I treat all six as a “rack” but with separate hosts.
Host: The same machine. I don't have much use for two copies on the same machine, so I always set this to 0.

If I ever got a friend where I could set up a local server, I would consider setting up a second “data center” to have an off-site backup. That probably would require Tailscale but that's beyond my current scope.

Volumes

SeaweedFS basically creates multiple 30 GB files which act as a blob with multiple files inside it. That way, the problems with thousands of small files aren't an issue since everything is done on the 30 GB files called “volumes”.

Replication is done at the volume level, which means I was able to turn off replication for a series of volumes.

> lock
> volume.configure.replication -replication 000 -volumeId 1
> volume.configure.replication -replication 000 -volumeId 2
> volume.fix.replication
> volume.balance -force
> unlock

The lock and unlock are important when making changes like this, they prevent some critical operations from corrupting the cluster. The commands will tell you when it is needed.

volume.configure.replication basically changed those volumes to no replication (risky). Once that is done, volume.fix.replication and volume.balance -force deletes the excessive copies and shuffle things around, giving me some breathing room to get replication running again so I can mass delete files.

When I'm done, I just go and change all the nodes back to -replication 010 to give me the second backup.

Data Hoarding

The problem ultimate is data hoarding. Both my father and I both have multiple copies of files running around. It isn't great, but when you don't have time to clean out a copy of a laptop dying, it is sometimes easier to rsync the entire laptop into a directory of the new machine and then move on.

In this case, I needed to do some trimming of the duplicates from his files. The script is based on the one from a StackOverflow answer:

find . -not -empty -type f -printf "%s\n" \
    | sort -rn \
    | uniq -d \
    | xargs -I{} -n1 find . -type f -size {}c -print0 \
    | xargs -0 sha256sum \
    | sort \
    | uniq -w32 --all-repeated=separate

The output is pretty simple because it only lists duplicates and the paths to find them.

$ echo one > a.txt
$ echo one > b.txt
$ echo two > c.txt
$ echo two > d.txt
$ echo three > e.txt
$ find . -not -empty -type f -printf "%s\n" \
    | sort -rn \
    | uniq -d \
    | xargs -I{} -n1 find . -type f -size {}c -print0 \
    | xargs -0 sha256sum \
    | sort \
    | uniq -w32 --all-repeated=separate
27dd8ed44a83ff94d557f9fd0412ed5a8cbca69ea04922d88c01184a07300a5a  ./c.txt
27dd8ed44a83ff94d557f9fd0412ed5a8cbca69ea04922d88c01184a07300a5a  ./d.txt

2c8b08da5ce60398e1f19af0e5dccc744df274b826abe585eaba68c525434806  ./a.txt
2c8b08da5ce60398e1f19af0e5dccc744df274b826abe585eaba68c525434806  ./b.txt
$

With this, I can find the duplicates in my own system and delete them to clear out a few terabytes worth of data. It takes time, but I haven't done it before so I pointed it at /mnt/seaweed and let it run.

Once that is done, I can turn replication back on, fix replication, rebalance, and I should be good to go.

Forward Steps

I knew I was running out of storage for a while now, so I blew my monthly budget and got the fourth server ordered. This one has three 4 TB NVMe sticks (about 11 TB which will be added to the cluster) and should give me enough room to get my dad's files collected, deduplicated, and then look into uploading them to a cheap S3 storage (Backblaze) for later.

Metadata

Categories:

Development

Tags: