An obsession with data (Git version)

In the process of thinking about doing a post about my (somewhat depressing) writing career, I got curious about how much I write. At work, this came up and so I did a quick word count (`wc -w`) on my repositories.

$ find -name *.txt | xargs wc -w
4288655 total

There is no way I've written 4.2 million words. So, I got curious of actually how many words are in there, but also when I wrote it. Now, as much as I'm fond of data, I actually haven't tracked when I finished a story.

I realized since I'm obsessed with source control systems, I could get a rough estimate of when I posted something. In general, I don't take more than 1-2 weeks to finish anything I've started, so I can say that I finished a story about the time I started it (e.g., checked it into source control).

Currently, I use git to track my files. I also previously used Subversion, so I'll get to that in the next few days.

Git can list all the commits inside the repository:

$ git log -1
commit 4168500550f4c27ed2be6e1fa97d846df366e483
Author: Dylan R. E. Moonfire
Date: Sun Apr 22 20:53:42 2012 -0500

Working on edits.

Two things. One, I am terrible at check-in comments when I write. I use this to make checkpoints on my story and I don't really go back. So, usually I just have a one-line comment like "Worked on chapter 3." Two, I'm using "-1" to limit this to one entry for illustration purposes.

The above output is really verbose. Ideally, we don't care about the message, who did it (it is only me), and the commit message. Fortunately, Git has the ability to control the output with the "--pretty" option. In this case, we are using "%ai" which gives us a pretty little ISO timestamp.

$ git log -1 --pretty=%ai
2012-04-22 20:53:42 -0500

Pretty good, except that it doesn't really show the files that changed. Well, we can fix that with the "--name-only" opton.

$ git log -1 --pretty=%ai --name-only
2012-04-22 20:53:42 -0500


There we go. We have the date we made a change and which files we changed (or added or deleted). Of course, not really in a useful form. I'm fond of Perl programming for one-off programs, so I banged this up:


git log –name-only –pretty=%ai



use strict; use warnings;

Directory Parsing

Go through all the files in the command-line arguments.

while (@ARGV) { # If it isn't a directory, we don't care. my $dir = shift @ARGV; $dir =~ s@/$@@sg;

if (! -d $dir)
	print STDERR "Ignoring $dir (not a directory)n";

my $git_dir = "$dir/.git";

if (! -d $dir)
	print STDERR "Ignoring $dir (no .git inside)n";

# We're processing this directory.
print STDERR "Processing $dirn";

# Set the GIT_DIR and GIT_WORK_TREE so we don't have to move into
# that directory.
$ENV{GIT_DIR} = $git_dir;

# We want to build up a log of the entire repository. We want to
# know each of the files and date that they were checked in.
my $last_timestamp;
my %files = ();

open GIT, "git log --name-only --pretty=%ai |"
	or die "Cannot open Git for $dir ($!)";

while ()
	# Clean up the line and ignore blanks.
	next if /^s*$/;

	# In general, we will have two types of lines. One is in a
	# timestamp and the other is the name of the file.
	if (/^(d+)-(d+)-(d+) (d+):(d+):(d+)/)
		# We only care about the date of the check-in.
		$last_timestamp = "$1-$2-$3";
		# For everything else, we get a filename. There are a few
		# things that we frequently ignore, such as hidden files
		# (start with a period).
		next if /^./;
		next if m@/.@;

		# We add the file and the current timestamp to the
		# hash. Since we replace with each one, and `git log` goes
		# backwards in time, the last time we see the file is the
		# point it was first added to the repository.

		# Print out the line.
		$files{$_} = $last_timestamp;

close GIT;

# Now that we are done parsing, we output the merged results.
open REPORT, ">$dir.files" or die "Cannot write $dir.files ($!)";

foreach my $file (sort(keys(%files)))
	# Pull out the date.
	my $date = $files{$file};

	# Keep track if this file exists.
	my $exists = 0;
	$exists = 1 if -f "$dir/$file";

	#print "$filet$daten";
	#print "$datet$existst$filen";
	print REPORT "$filet$existst$daten";

close REPORT;


And the output of this program is put into a ".files" for the directory.

$ perl moonfire
Processing moonfire
$ head moonfire.files
Makefile 1 2010-06-26
another-werewolfs-tail.odt 0 2010-06-26
another-werewolfs-tail.txt 0 2012-04-22
best-enemies.txt 0 2010-08-25
best-laid-plans.odt 0 2010-06-26
best-of-enemies.odt 0 2010-06-26
brickpunk.odt 0 2010-06-26
change-of-honor.odt 0 2010-06-26

Now, this catches all the various versions of the file as I (constantly) renamed files, changed formats, and basically mucked around. I'm not afraid of shifting things around so it reflects that. I put in the exists column (0 or 1) so I know which file is actually there verses the (constant) renames.

In the above example, I would combine "best-enemies.txt" and "best-of-enemies.odt" together and take the earliest date. A bit of manual work isn't too bad for this project, but there you go. The first time a file shows up in a git repository.