An obsession with data (Subversion edition)
It took me a number of days to clean up the output from the Git repository parse. I could have written a parser that handled the renames through Git but it was easier just to juggle data and handle the format changes which were never a Git rename (for example when I went from an XML-based system to Open Document and then again to a Creole-based system).
However, there was an interesting problem: the day 2012-06-26 showed up a lot.
$ wc -l git.csv
606 git.csv
$ grep 2010-06-26 git.csv | wc -l
421
$
In specific, 421 of the entries all had the same date. It is pretty obvious what happened on that day, I imported my old Subversion repository into Git because I fell in love with Git. However, in the quest of tracking down the dates of the files, I had to go into the Subversion file. Fortunately, I still have the Subversion on my web host, so I could do much the same thing.
Subversion's log output is pretty simple:
$ svn log
------------------------------------------------------------------------
r1072 | dmoonfire | 2010-06-20 14:39:48 -0500 (Sun, 20 Jun 2010) | 2 lines
Changed paths:
M /moonfire/stories/flight-of-the-scions/fots-15.txtTweaking chapter 15.
Now, I was getting a bit tired of parsing, so I didn't try to optimize it like I did the Git. However, I wanted the same output so I pulled out the Git Perl program I used last time and just tweaked it to handle both the different date time line and also the fact that files start with "M", "R", or "D".
#!/usr/bin/perlSetup
Directives
use strict; use warnings;
Input Process
Since this was all in one Subversion tree, we just pipe “svn log” to
the input.
my $last_timestamp = “undef”; my %files = ();
while () { # Clean up the line and ignore blanks. chomp; next if /^s*$/; # In general, we will have two types of lines. One is in a # timestamp and the other is the name of the file. # rd+ | user | 2010-07-10 21:45:57 -0500 (Sat, 10 Jul 2010) | 1 line if (/(d+)-(d+)-(d+) (d+):(d+):(d+)/) { # We only care about the date of the check-in. $last_timestamp = “$1-$2-$3”; } else { # For everything else, we get a filename. There are a few # things that we frequently ignore, such as hidden files # (start with a period). s/s+//sg; unless (/(M|A|D)s+(.*?)$/) { # An unkown line, so skip it. next; }
# We add the file and the current timestamp to the # hash. Since we replace with each one, and `git log` goes # backwards in time, the last time we see the file is the # point it was first added to the repository. # Print out the line. my $file = $2; $file =~ s/s*(.*?$//; $files{$file} = $last_timestamp; }
}
Now that we are done parsing, we output the merged results.
open REPORT, “>svn.files” or die "Cannot write svn.files ($!)";
foreach my $file (sort(keys(%files))) { # Pull out the date. my $date = $files{$file}; # Keep track if this file exists. my $exists = 0; # They don't exist in this case $exists = 1 if -f “$dir/$file”;
#print "$filet$daten"; #print "$datet$existst$filen";
print REPORT “$filet$existst$daten”; }
close REPORT;
The output of the above program is identical to the Git version:
$ cat svn.files
moonfire/trial-by-steam.txt 0 2007-09-03
moonfire/wind-bear-moon.txt 0 2007-03-08
$
Now, Subversion wasn't the first source control system, but I don't have any more after that. Like the Git initial check-in, there is an epoch where all files start. In this case, 2006-10-15. This is quite a few years after I started writing, but before that was a CVS repository and before that file copying. Fortunately, there aren't too many chapters before the Subversion epoch, which is good since I would have to do some serious work to figure out the earlier dates.
$ grep 2006-10-15 svn.files | wc -l
83
$ wc -l svn.files
702 svn.files
$
The cleanup tasks for both the Git and Subversion are the same, mainly figuring out which filenames are actually the same one through the various renames and file format changes. The next post will be just a few techinques I used to clean up data while figuring out all of this.
Metadata
Categories:
Tags: