An obsession with data (normalization)

I had someone ask me why I'm writing these blog posts. It started with just a question I wanted to know (how much have I written), but it lead into something more. Over the years, I've ended up parsing a lot of data. It was my job for twenty years to take customer's files for any operating system, any database, and just about any format, and create normalized and consolidated reports. I wrote programs to take normalized data and produce more data which lead into the rest of our business processes.

Parsing data isn't really rocket science. For me, most of it means taking different input (say the Git and Subversion logs) and normalize them into a common format. And then use that common format to process data.

This post is how I took the output from the last two posts and come up with the earliest date for any significant file in my current repository. Having them in a common format (I threw the output into a file called combined.files) means I don't have to care if they are from Subversion or Git, it is just a file.

$ cat combined.files
flight-of-the-scions/fots-01.odt 2010-01-22
flight-of-the-scions/fots-01.txt 2010-12-30
flight-of-the-scions/chapters/fots-01.txt 2011-08-19
flight-of-the-scions/chapters/chapter-01.txt 2011-11-03
$

In the above example, all of these files are effectively the same thing. The only difference is where I put it in the file system and which format I used. I decided not to write the Git and Subversion parses to handle renames, mainly because I knew I had to manually do it (and the time to write the rename parsing would take far more effort).

You might notice that I removed the "exists" column in the middle. I ended up not needing it because I wrote a little program that took the input and told me if a file was missing.

#!/usr/bin/perl
# Usage: perl test-files.pl combined.csv

Go through all the files in the combined and pull out the first column (the filename).

while () { # Clean up the line. chomp;

# Pull out the fist column.
my @p = split(/t/);
my $f = $p[0];

# Test to see if the file exists.
if (! -f $f)
{
	# The file doesn't exist, so complain.
	print STDERR "Can't find $fn";
}

}

This was a one-off program and sadly I did a horrible job of naming variable and comments, but basically it just takes the combined.csv file and tells me which files don't exist. That tells me if there is a file that exists and I can use it to figure out which ones still have to be renamed, found, or merged.

As an interesting side note, I actually found six stories (mostly flash) that I had lost in my current repositories. Fortunately, I had Subversion to go to and pull them out to add to my list.

Using the test-files.pl program above, I could easily rename and move files around. During the process, I quickly got into the case where I had lots of duplicates of the same file with different dates.

$ cat combined.files
flight-of-the-scions/chapters/chapter-01.txt 2010-01-22
flight-of-the-scions/chapters/chapter-01.txt 2010-12-30
flight-of-the-scions/chapters/chapter-01.txt 2011-08-19
flight-of-the-scions/chapters/chapter-01.txt 2011-11-03
$

To cut down on the information, I wrote another quick program to remove all but the earliest date. Because I always use ISO dates (yyyy-mm-dd) that means the file can be sorted alphabetically and the first one removed.

#!/usr/bin/perl

Keep track of files we've already seen.

%seen = ();

Go through the input.

while () { # Clean up the input line. chomp;

# Split out the columns.
my @p = split(/t/);

# If we already saw it, tell the user we skipped it. If not, then
# print it out and add it to the hash so we don't print it again.
if (exists $seen{$p[0]})
{
	print STDERR "Skipping $p[0]n";
}
else
{
	print "$_n";
	$seen{$p[0]} = 1;
}

}

$ clear;sort combined.csv | perl remove-duplicates.pl > a && mv a combined.csv
Skipping flight-of-the-scions/chapters/chapter-01.txt
Skipping flight-of-the-scions/chapters/chapter-01.txt
Skipping flight-of-the-scions/chapters/chapter-01.txt
$

(Note: I pipe to "a" and then move it over to combined.csv because most shells including Bash and PowerShell both blow away the input if it is part of a redirected output. In other words, if I don't, it erases the combined.csv and I had to start over... twice.)

Using these two programs together, I got into a nice tight development cycle that let me make changes and reduce the data I was working with. I used an editor with auto-revert-mode (Emacs in this case) but I could have used Notepad++ or any other editor that would refresh the buffer on a file change. Once I made change, I re-ran the above program, and let it refresh in the buffer. And then I made another set of changes and repeated.

The final program I used was to tag my files with the actual dates. This is a bit more complicated, but my text-based file system using a bullet list to include metadata.

$ head flight-of-the-scions/chapters/chapter-01.txt -n 3
= The Water Screw

> In Miw?fu, they call those who cannot use magic miw: bachir?ma. Translated into Lorban, it means “cursed to be forever mundane.” – Awakened Magic, Dastor Malink $

I want to put a "* Date: 2010-01-22" right after the title line. The corresponding line in the combined file is:

$ grep flight-of-the-scions/chapters/chapter-01.txt combined.csv
flight-of-the-scions/chapters/chapter-01.txt 2010-01-22
$

So, I wrote a little Perl program that takes the combined.csv file, picks out all the files from the first column, adds a date line into the file if it is missing. This way, a single run means that every file listed in the combined.csv will have the date I figured out from parsing the Git and Subversion logs.

#!/usr/bin/perl

Setup

Directives

use strict; use warnings;

Parameters

The first parameter is the directory.

my $dir = $ARGV[0];

die “USAGE: dir input” unless -d $dir;

my $input = $ARGV[1];

die “USAGE: dir input” unless -f $input;

Slurp up the contents of the file.

my %files = ();

open FILES, "<$input" or die "Cannot open input $input ($!)";

while () { chomp; my @p = split(/t/); $files{$p[0]} = $p[1]; }

close FILES;

Get a list of all the files in the directory.

open PIPE, “find ‘$dir’ -type f |” or die "Cannot open pipe ($!)";

while () { # Clean up the line. chomp;

# Ignore . files.
next if m@/.@;
next unless m@.txt$@;

# Trim off the leading characters.
s@./@@;

# See if the file exists.
my $file = $_;

if (exists $files{$file})
{
	# We found the file.
	my $date = $files{$file};
	print "HIT  $date $filen";

	# Pull out the entries so we can report what was missing after
	# we're done processing.
	delete $files{$file};

	# Open up the file and read in the metadata section, looking
	# for an already existing date.
	my $found_date = 0;

	open FILE, "&lt;$file&quot; or die &quot;Cannot open $file ($!)&quot;;

	while ()
	{
		# Clean up the line.
		chomp;

		if (m@^* Date:s*(.*?)$@)
		{
			my $old_date = $1;

			$found_date = 1;

			if ($date eq $old_date)
			{
				# Nothing to do, we're good.
				last;
			}

			print "       $1n";

		}
	}

	close FILE;

	# If we didn't find the date, we need to add it into the file.
	unless ($found_date)
	{
		my $need_date = 1;

		print "       Date: $daten";

		open IN, "tmp" or die "Cannot open tmp ($!)";

		while ()
		{
			print OUTPUT $_;

			if ($need_date &amp;&amp; $_ =~ /^= /)
			{
				print OUTPUT "* Date: $daten";
				$need_date = 0;
			}
		}

		close IN;
		close OUTPUT;

		rename($file, "$file.bak");
		rename("tmp", $file);
	}
}
else
{
	# Print the file.
	#print "SKIP $_n";
}

}

close PIPE;

Write out any remaining files.

foreach my $file (sort(keys(%files))) { print “MISS $filen”; }

You'll notice that the longer my program, the more I comment. I also had some debugging code commented out which I used to verify everything worked properly. Once I finished running this, every single story and chapter I've written was given a date line. I spot-checked them (there are quite a few in total) and checked it into the Git repository.

This also means that my next post can finally answer the question "how many words I've written?" with certain caveats but because I also dated them with a day, month, and year, I will also be able to graph my rough writing output over time.

Metadata

Categories:

Tags: