by noirscape 3 days ago

I can understand in theory why they wouldn't want to back up .git folders as-is. Git has a serious object count bloat problem if you have any repository with a good amount of commit history, which causes a lot of unnecessary overhead in just scanning the folder for files alone.

I don't quite understand why it's still like this; it's probably the biggest reason why git tends to play poorly with a lot of filesystem tools (not just backups). If it'd been something like an SQLite database instead (just an example really), you wouldn't get so much unnecessary inode bloat.

At the same time Backblaze is a backup solution. The need to back up everything is sort of baked in there. They promise to be the third backup solution in a three layer strategy (backup directly connected, backup in home, backup external), and that third one is probably the single most important one of them all since it's the one you're going to be touching the least in an ideal scenario. They really can't be excluding any files whatsoever.

The cloud service exclusion is similarly bad, although much worse. Imagine getting hit by a cryptoworm. Your cloud storage tool is dutifully going to sync everything encrypted, junking up your entire storage across devices and because restoring old versions is both ass and near impossible at scale, you need an actual backup solution for that situation. Backblaze excluding files in those folders feels like a complete misunderstanding of what their purpose should be.

stebalien 3 days ago | [-0 more]

I've actually spent some time debugging why git causes so many issues with the backup software I use (restic).

Ironically, I believe you have it backwards: pack files, git's solution to the "too many tiny files" problem, are the issue here; not the tiny files themselves.

In my experience, incremental backup software works best with many small files that never change. Scanning is usually just a matter of checking modification times and moving on. This isn't fast, but it's fast enough for backups and can be optimized by monitoring for file changes in a long-running daemon.

However, lots of mostly identical files ARE an issue for filesystems as they tend to waste a lot of space. Git solves this issue by packing these small objects into larger pack files, then compressing them.

Unfortunately, it's those pack files that cause issues for backup software: any time git "garbage collects" and creates new pack files, it ends up deleting and creating a bunch of large files filled with what looks like random data (due to compression). Constantly creating/deleting large files filled with random data wreaks havoc on incremental/deduplicating backup systems.

adithyassekhar 3 days ago | [-12 more]

I don’t think this is the right way to see this.

Why should a file backup solution adapt to work with git? Or any application? It should not try to understand what a git object is.

I’m paying to copy files from a folder to their servers just do that. No matter what the file is. Stay at the filesystem level not the application level.

noirscape 3 days ago | [-8 more]

I'm not saying Backblaze should adapt to git; the issue isn't application related (besides git being badly configured by default; there's a solution with git gc, it's just that git gc basically never runs).

It's that to back up a folder on a filesystem, you need to traverse that folder and check every file in that folder to see if it's changed. Most filesystem tools usually assume a fairly low file count for these operations.

Git, rather unusually, tends to produce a lot of files in regular use; before packing, every commit/object/branch is simply stored as a file on the filesystem (branches only as pointers). Packing fixes that by compressing commit and object files together, but it's not done by default (only after an initial clone or when the garbage collector runs). Iterating over a .git folder can take a lot of time in a place that's typically not very well optimized (since most "normal" people don't have thousands of tiny files in their folders that contain sprawled out application state.)

The correct solution here is either for git to change, or for Backblaze to implement better iteration logic (which will probably require special handling for git..., so it'd be more "correct" to fix up git, since Backblaze's tools aren't the only ones with this problem.)

masfuerte 3 days ago | [-5 more]

7za (the compression app) does blazingly fast iteration over any kind of folder. This doesn't require special code for git. Backblaze's backup app could do the same but rather than fix their code they excluded .git folders.

When I backup my computer the .git folders are among the most important things on there. Most of my personal projects aren't pushed to github or anywhere else.

Fortunately I don't use Backblaze. I guess the moral is don't use a backup solution where the vendor has an incentive to exclude things.

toast0 3 days ago | [-3 more]

IMHO, you can't do blazingly fast iteration over folders with small files in Windows, because every open is hooked by the anti-virus, and there goes your performance.

noirscape 3 days ago | [-2 more]

Not just antivirus, there's also file locking.

Windows has a much harsher approach to file locking than Linux and backup software like BackBlaze absolutely should be making use of it (lest they back up files that are being modified while they back them up), but that also means that the software effectively has to ask the OS each time to lock the file, then release the lock when the software is done with it. With a large amount of files, that does stack up.

Linux file locking is to put it mildly, deficient. Most software doesn't even bother acquiring locks in the first place. Piling further onto that, basically nobody actually uses POSIX locks because the API has some very heavy footguns (most notably, every lock on a file is released whenever any close() for that file is called, even if another component of the same process is also having a second lock open). Most Linux file locks instead work on the honor system; you create a file called filename.lock in the same directory as the file you're working on, and then any software that detects the filename.lock file exists should stop reading the file.

Nobody using file locks is probably the bigger reason why Linux chokes less on fast iteration than Windows, given that Windows is slow with loads of files even when you aren't running a virus scanner.

buzer 2 days ago | [-0 more]

I have never personally used it, but aren't Windows' Shadow Copies supposed to be the answer to file locking/modification issues?

jcgl 2 days ago | [-0 more]

> Linux file locking is to put it mildly, deficient.

Since the introduction of flock on Linux, how bad is it really though? I don't see why one would need kludges like filename.lock. Though of course flock is still an "honor system" as you put it.

Tor3 2 days ago | [-0 more]

Same - on one of my computers (Linux, btw) the only directories in the list of directories to back up are .git directories. That's what I'm concerned with, so that's what I back up. And it works just fine, with my provider.

NetMageSCW 3 days ago | [-1 more]

Actually once the initial backup is done there is no reason to scan for changes. They can just use a Windows service that tells them when any file is modified or created and add that file to their backup list.

DarkUranium 2 days ago | [-0 more]

To an extent. WinAPI's file watching has a race condition in it, and there's no simple workaround (just complex & error-prone ones).

Well, for backups the workaround is a bit easier (as they strictly only ever read files), but still.

Saris 3 days ago | [-2 more]

Backblaze offers 'unlimited' backup space, so they have to do this kind of thing as a result of that poor marketing choice.

conductr 3 days ago | [-0 more]

No they don’t. They just have to price the product to reflect changing user patterns. When backblaze started, it was simply “we back up all the files on your drive” they didn’t even have a restore feature that was your job when you needed it. Over time they realized some user behavior changed, these Cloud drives where a huge data source they hadn’t priced in, git gave them some problems that they didn’t factor in, etc. The issue is there solution to dealing with it is to exclude it and that means they’re now a half baked solution to many of their users, they should have just changed the pricing and supported the backup solution people need today.

adithyassekhar 3 days ago | [-0 more]

If they must scam, shouldn’t they be deduplicating on the server rather than the client?

Ajedi32 3 days ago | [-1 more]

FWIW some other people in this thread are saying the article is wrong about .git folders not being backed up: https://news.ycombinator.com/item?id=47765788

That's a really important fact that's getting buried so I'd like to highlight it here.

Ajedi32 2 days ago | [-0 more]

Well, I checked and it looks like none of my .git repos are backed up. All attempts to restore only restore the working copy. -_- I'm not sure why it was working for the person in the comment I linked.

rmccue 3 days ago | [-0 more]

I think it's understandable for both Backblaze and most users, but surely the solution is to add `.git` to their default exclusion list which the user can manage.

maalhamdan 3 days ago | [-6 more]

I think they shouldn't back up git objects individually because git handles the versioning information. Just compress the .git folder itself and back it up as a single unit.

willis936 3 days ago | [-0 more]

Better yet, include dedpulication, incremental versioning, verification, and encryption. Wait, that's borg / restic.

This is a joke, but honestly anyone here shouldn't be directly backing up their filesystems and should instead be using the right tool for the job. You'll make the world a more efficient place, have more robust and quicker to recover backups, and save some money along the way.

pkaeding 3 days ago | [-4 more]

This is a good point, but you might expect them to back up untracked and modified files in the backup, along with everything else on your filesystem.

pixl97 3 days ago | [-3 more]

Eh, you really shouldn't do that for any kind of file that acts like a (an impromptu) database. This is how you get corruption. Especially when change information can be split across more than one file.

pkaeding 3 days ago | [-2 more]

Sorry, what are you saying shouldn't be done? Backing up untracked/modified files in a bit repo? Or compressing the .git folder and backing it up as a unit?

pixl97 3 days ago | [-1 more]

> Backing up untracked/modified files in a bit repo?

This. It's best to do this in an atomic operation, such as a VSS style snapshot that then is consistent and done with no or paused operations on the files. Something like a zip is generally better because it takes less time on the file system than the upload process typically takes.

pkaeding a day ago | [-0 more]

I see what you mean, but isn't this an issue with any filesystem backup tool? Or is there something about untracked files in a git workspace that is different, that I'm not seeing?

rcxdude 3 days ago | [-0 more]

It's probably primarily because Linus is a kernel and filesystem nerd, not a database nerd, so he preferred to just use the filesystem which he understood the performance characteristics of well (at least on linux).

ciupicri 3 days ago | [-0 more]

> If it'd been something like an SQLite database instead (just an example really)

See Fossil (https://fossil-scm.org/)

P.S. There's also (https://www.sourcegear.com/vault/)

> SourceGear Vault Pro is a version control and bug tracking solution for professional development teams. Vault Standard is for those who only want version control. Vault is based on a client / server architecture using technologies such as Microsoft SQL Server and IIS Web Services for increased performance, scalability, and security.

yangm97 3 days ago | [-0 more]

You don’t see ZFS/BTRFS block based snapshot replication choking on git or any sort of dataset. Use the right job for the tool or something.

grumbelbart2 3 days ago | [-4 more]

Git packs objects into pack-files on a regular basis. If it doesn't, check your configuration, or do it manually with 'git repack'.

noirscape 3 days ago | [-3 more]

I decided to look into this (git gc should also be doing this), and I think I figured out why it's such a consistent issue with git in particular. Running git gc does properly pack objects together and reduce inode count to something much more manageable.

It's the same reason why the postgres autovacuum daemon tends to be borderline useless unless you retune it[0]: the defaults are barmy. git gc only runs if there's 6700 loose unpacked objects[1]. Most typical filesystem tools tend to start balking at traversing ~1000 files in a structure (depends a bit on the filesystem/OS as well, Windows tends to get slower a good bit earlier than Linux).

To fix it, running

> git config --global gc.auto 1000

should retune it and any subsequent commit to your repo's will trigger garbage collection properly when there's around 1000 loose files. Pack file management seems to be properly tuned by default; at more than 50 packs, gc will repack into a larger pack.

[0]: For anyone curious, the default postgres autovacuum setting runs only when 10% of the table consists of dead tuples (roughly: deleted+every revision of an updated row). If you're working with a beefy table, you're never hitting 10%. Either tune it down or create an external cronjob to run vacuum analyze more frequently on the tables you need to keep speedy. I'm pretty sure the defaults are tuned solely to ensure that Postgres' internal tables are fast, since those seem to only have active rows to a point where it'd warrant autovacuum.

[1]: https://git-scm.com/docs/git-gc

LetTheSmokeOut 3 days ago | [-0 more]

I needed to use

> git config --global gc.auto 1000

with the long option name, and no `=`.

Dylan16807 3 days ago | [-0 more]

A few thousand files shouldn't be a problem to a program designed to scan entire drives of files. Even in a single folder and considering sloppy programs I wouldn't worry just yet, and git's not putting them in a single folder.

bombcar 3 days ago | [-0 more]

I love nothing more than running strange git commands found in HN comments.

Let's ride the lightning and see if it does anything.

3 days ago | [-0 more]
[deleted]