Perl

Emacs and Multiple Dictionaries

2015-04-01T05:00:00Z

For the last four years, I've been trying to write a program called Author Intrusion. There were a number of reasons for this, but one of the biggest was that I couldn't find any program that handled dictionaries (really word lists, but a lot of people use the wrong name).

This morning, when I woke up, I ended up doing a random search that took me through a long winding journey that finally gave me an interim solution that is pretty solid until I can get Author Intrusion finished (which may be another four… decades or so).

The problem

As with any long-term writing project, I've created a large number of characters, groups, and locations. Most of them are based on a conlang while others just sounded cool. However, when I'm spell-checking my chapters, I need to have those names in the dictionary otherwise they'll continually show up as a typo.

One common solution is to add those names to the program's dictionary. This works out pretty well, until the end of the project. Then, the hundreds of names are not longer relevant for the next series but still show up in suggestions for every project in the future.

My preferred novel-writing editor, Emacs, has the ability to have per file word lists. This is called “LocalWords”, but it means that I can identify a list of valid words without adding it to my permanent dictionary. Of course, this means I have to keep copying that per file list into each new chapter, which then gets the new words for the characters I've introduced in that chapter. And when I create the chapter after that, it keeps moving and growing.

Because I just finished the draft of Sand and Bone, I have built up a three book collection of proper names. This list is in the top of every file, which means I have to scroll down a little to even see the title of the chapter.

Rutejìmo Chimípu Pidòhu Shimusògo Tateshyúso Pabinkúe Jìmo Mípu Dòhu Pidòhu's Desòchu Sòchu Mapábyo Kechikìma Hyonèku Opōgyo Chimípu's Gemènyo Mènyo Pábyo Kìma Mapábyo's Zotetsūchi Rutejìmo's Hyonèku's Gemènyo's Ryayusúki Wamifuko Nèku Hána Zúchi Mépu Nenemépu Shimusògo's Desòchu's Myunédo Shimusogo Karawàbi Wàbi Tsubàyo Bàyo Tsubàyo's Tejíko Palasaid Markon Tejíko's Mifuníko Yutsupazéso Yutsupazéso's Karawàbi's Nibonyāchu Jyotekàbi Yunujyoraze Byomími nibonyāchu ranuchyahāhi Mípu's shimusogo dépa alchemical dépa's Mifukiga Chobāni Rabedájyo Badenfumi Shigáto Porlin Kamanen Kakasaba Mioshigàma Pabinkue Mikáryo Mikáryo's tazágu Palarin Mistan rikunámi Ryachuikùo Tateshyúso's Nedorómi Chidomifu Kapōra Káryo Chyábi Ganifúma Ralador Markin Kidorīsi Mifúno Mafimára pyābi Mifuno Faríhyo mizonekima chyòre Rolan Madranir Kiríshi Som figaki tòra chyóre's shikāfu Tachìra's Chobìre's Wh Tachìra Monafuma Gidon Kormar Nigímo wabōryo Faríhyo's avian's Ríhyo Gímo ryodifūne Tsudakìmo Myobùshi Funikogo Ganósho Myobùshi's Gidorámi Pyatose myofūne Pyatòse Gichyòbi Higoryo Ríshi Jacin Torabin Kishifín's Makohūni's Tsu Rojikinomi Fimúchi Rojikinòmi Rapinbun Finol Pokīmu Waryōni Nyochizoma clanless Chizoki Miyóna Kyōti Tijikóse Chyobizo Nichikōse Tifukòmi Talsir Shifáni Milifor Krum Opōgyo's banyosiōu kojinōmi kojinōmi's Nyobichóhi Mifúno's helmed Kitópi Piròma Tópi Bakóki Bakóki's Nifùni Byochína Chobìre Midoshina Kafūma Korechyoki Baroshìko Tedoku Nuchikomu Machikimu Garènu Piròma's Kitópi's Nana dépas Kosobyo Kosòbyo nocked Fidochìma Foteramàsu Foteramasu chima Tsupòbi Dimóryo Fùni petabiryōchi Chína Techyomása Mioráshi Kosòbyo's Kidóri Atefómu's Kidóri's ambushers Tikói Menodàka Tateshyuso Kos Ràchyo Záji Gichyòbi's

That's a lot of names, including a couple that were removed for pacing. Almost every single one of them isn't in the final chapter of Sand and Bone, but they were in one of the hundred or so chapters before it.

There is also no easy way of removing the Miwāfu names and passing them into the next story since those are pretty common across any story I have in the desert.

As far as I could tell, there were only two ways of handling all those names: put it in the permanent dictionary or shovel it along the chapters as I went.

Vim

About a year ago, I found out that Vim had a setting that allowed multiple dictionaries, but I didn't want to grok a new writing environment when I had (high) hopes for getting Author Intrusion done.

The idea

This morning, I found a random link that led to another. Eventually, I came up with Wcheck. It looked like it had potential for resolving my dictionary problem, so I spent an hour or so trying it out.

In the end, I couldn't get it to work. But, the process of trying gave me a little epiphany on what could work. Instead of changing the library, I decided to write a wrapper around aspell that interrupted checking words and substituted my own lookups instead.

The results fell into place pretty easily. With a local.words file in the same directory as the chapters, my newly created caspell program loads it into memory. When Emacs asks for a word to spell, it checks to see if it knows about the world already and verifies it as passed even if the base dictionary doesn't know about it.

Likewise, adding a word adds it to the local.words file, not the aspell personal dictionary.

But wait, there's more

The basic format of the file is pretty simple.

word    nibonyāchu
word    dépa

I originally went with “&” as the suggestion used in the pipe, but then I realized I could use readable words without too much of a problem. So, it became “word” and made things a lot easier to process.

Getting the basic lookup was a nice little rush, but then I realized that I could return suggestions. That lead into writing code that gave suggestions for “incorrect” words that I want to expand into real ones.

suggest Shimu = Shimusogo, Shimusògo
suggest shimu = Shimusogo

There is a certain mindset when things are working. It is easy to move into the next code, though times to the results take longer to develop. In this case, I decided to allow one file to include another. This pulls in the words and suggestions from other files but doesn't merge them together.

command include "../../sand-and-blood/chapters/local.words"

And then I had it. Dictionaries for per file, per project, per world, and any other combination that I need. I'm planning on creating them over the next couple files, but I think it will let me chain dictionaries so book two will include book one's words. And book three will add book two's which also includes ones. And then Raging Alone includes all three books.

And then one more

There was one more thing I ended up doing before I stopped. I used Emacs's abbrev-mode to do auto-corrections while writing. That way, I can type “Rute” and have it expand into “Rutejìmo” complete with accents. Same with various greetings, names, and locations.

As you can guess, I added that feature into the file too.

replace GS = Great Shimusogo
replace GT = Great Tateshyuso

This feature isn't built in, so I wrote a special mode for the program that takes a local.words and creates a abbrev.el file for the mode.

$ ls
local.words
$ caspell --emacs -p .
$ ls
abbrev.el local.words

Full example

A larger example for the local.words for Raging Alone:

command include "../../sand-and-blood/chapters/local.words"

suggest Shimu = Shimusogo, Shimusògo
suggest shimu = Shimusogo

replace GS = Great Shimusogo
replace GT = Great Tateshyuso

word    Badenfumi
word    Basamiku

The entire thing is rewritten whenever I add a word to the dictionary. Each section (except for commands) is sorted so it always produces a consistent order. This makes source control easier to work with (always sort output for that reason, it saves a lot of time later).

Tying it all together

Once all the files are created and populated, I had to tell Emacs about the new program and how to hook up the abbrevations. This is done in the .emacs file. I have a hook for text mode that automatically configures what I need.

(defun my-text-hook ()
  (setq fill-column 99999)

  (setq
   abbrev-file-name
   (concat (file-name-directory (buffer-file-name)) "abbrev_defs.el"))
  (quietly-read-abbrev-file
   (concat (file-name-directory (buffer-file-name)) "abbrev.el"))
  (setq save-abbrevs nil)
  (abbrev-mode 1)

  (setq ispell-program-name "caspell")
  (setq ispell-personal-dictionary (file-name-directory (buffer-file-name)))

  (flyspell-mode 1)
  (visual-line-mode)
)
(add-hook 'text-mode-hook 'my-text-hook)
(add-hook 'markdown-mode-hook 'my-text-hook)

The key parts are the “ispell” lines for hooking up caspell. The “personal dictionary” uses the name of the text file ((buffer-file-name)), figures out the directory, and then passes it into caspell via the -p parameter.

The other bit is the “abbrev” lines to look for abbrev.el in the same directory as the text file and uses it. It seems to work and I'm pretty happy with the results so far.

Github

Like almost everything else I write, I threw it up on Github along with a few other programs I've been using. I'll document them eventually but the caspell is pretty functional as-is.

Git tips: Getting the first commit date of a file

2014-04-20T05:00:00Z

A few years ago (2012), I wrote a post which counted up how many words I wrote in eleven years and also broke it down by month.

While I haven't done a post on that again (I figured that most people don't care), I have been maintaining the metadata on the top of each chapter in case I ever do.

However, when I'm occasionally obsessing about writing, I don't put in the headers and I have to do it after the fact. When that comes, I have to go back and figure out when I actually started a chapter (which is my definition of the date header).

When there is only one or two files, it isn't too hard, but when it is thirty chapters of a commission, I usually try to find a program to help me figure out the dates.

I frequently use Perl for my one-off programs. The following Perl script takes one or more files and simply gives the last date in a semi-useful manner.

#!/usr/bin/perl
USAGE: git-first-commit-date [–bare|-b] file…

Setup

Directives
use strict;
use warnings;
Modules
use Getopt::Long;

Options

–bare means don't put the filename in the line. Otherwise it will
–put the filename, followed by a colon and a space.
my $bare = 0;
&GetOptions(
“b|bare!” => $bare,
);

Go through the input files.

while (@ARGV)
{ # Pull out the filename.
my $filename = shift @ARGV;
my $reason = "<missing>";
my $valid = 0;
if (-f $filename)
{
    # Get the date for the file. We tell Git to only give us the
    # ISO date (https://xkcd.com/1179/) for the files using
    # --pretty=format:%ad --date=short. We use --follow to handle
    # renames. Finally, we get the last one (the earliest
    # date). --reverse didn't seem to work, so we skip that.
    $reason = `git log --follow --pretty=format:%ad --date=short "$filename" | tail -n 1`;
    chomp $reason;

    # If we have a date, use it. Otherwise say it is untracked.
    $reason = "&lt;untracked&gt;" if $reason =~ /^\s*$/s;
}

# Write out the results.
if (!$bare)
{
    print "$filename: ";
}

# Print out the reason which will be &lt;untracked&gt;, &lt;missing&gt;, or a
# date.
print $reason, "\n";

}

When it runs, you get something like this:

$ git-first-commit-date untracked-file missing-file chapter-00.markdown 
untracked-file: <untracked>
missing-file: <missing>
chapter-00.markdown: 2014-02-20
$

It's a little one-off program, but it solves a very specific problem for me.

Miwāfu glyphs

2012-10-29T05:00:00Z

One of the inspirations for the cover for BAM and FOTS is DaVinci. In specific, the semi-technical writing with the careful notes written in the whitespace. They used it in Dungeons and Dragons 3.5 in much the same manner and I think it would be perfect for the theme I'm setting for both books.

In BAM, the main language is Miwāfu, though it is notationally written in English. But, I decided the cover could still follow that idea of filling in the space with notes. Naturally, to do this, I want to use Miwāfu directly instead of writing in English. That means I have to create the glyphs for Miwāfu along with enough of the language that it is reasonable accurate.

Introduction to Miwāfu

Glyph Inspiration

I find Tolkien's Elvish and Sanskrit-based languages to be beautiful. Sadly, the attempt to make it closer to Sanskirt failed miserably but after a few weeks of fumbling, I came up with something I'm pretty happy with.

This does take inspiration from a Tengwar version of Lojban. Mostly, it uses a fair amount of diacritics for the vowels. Mostly the smaller marks on the top are the six vowels: a, e, i, o, u, and y.

I also tried to think about how it would be written. In this case, I wanted something flowing as if someone was tracing in the sand or rocks. In my case, I used the steamy door after my shower.

This is a fairly thick font. I could make a narrower version, but until I see it on the cover, I won't know how thick I need it to be. I also wanted something bold to enforce the drawing in the sand with a finger.

Initial Glyphs

Below are the initial consonant I came up with for Miwāfu:

I did decide that voiced are opposites of the unvoiced version. And for the vowels (including accented versions) for the "w" constants.

Drawing These

When I started working on these, I used a graph pad and a pencil. It took a few weeks of just playing around with symbols, trying to get the right combination of appearance without too much duplication.

One notable aspect of written language is being able to identify characters easily. I needed to make sure that characters didn't turn into each other if you drew the beginning slightly curved or had a little flare at the end. There were quite a few times when I said "great, I'm done" and then realized two characters were pretty much identical if written in a hurry.

Once I had the symbols, I used Inkscape to draw them out as vectors. Each vowel, accent, and consonant was put on a different layer. This made it easy to layer them on top of each other, to make sure they were roughly balanced with each other.

Trying it out

Now, manually setting out the characters could be a royal pain. Fortunately for me, I'm perfectly willing to spend an hour writing a program to save myself two. So I banged up a little Perl program that took the SVG, split out each of the layers into a separate PNG image, and threw it into a directory.

A second Perl program took a text file with example text and creates an image using ImageMagick of the various glyphs. That way, I can see how the language looks when written out and make changes.

Making a Font

Since I'm intending to put these glyphs on the cover of BAM, I want to make it a proper font. This will let me do a few things that the Perl program can't easily do, in specific kerning.

Kerning is the space between characters. In the above example, "akope" (which means "additional object of sentence, negated"), the "a" and "ko" could be brought closer so the "ko" is in the curve of the "a". This is kerning and make the spacing look better.

Creating the font is also fun and one of my other interests. I want to create a few fonts for this fantasy world I'm building, this just happens to be the first one I'm doing (because of BAM).

Having the font also means I can create it as a web font. This will let me put it on the wiki-like site for the language to help document it. This is getting rather important since I'm quickly reaching six different works that have this naming language.

Finding sentences that start off the same with Perl

2012-06-01T05:00:00Z

So, even though I said I'm done editing FOTS, I'm not. I feel like I should be, but every time I go to read it, I see more things wrong with it. Plus, there is the entire word count problem (I aimed for 100k and got 124k). So, over my lunch breaks I'm editing and working on reducing the word counts.

Before my vacation, I was working on chapters 16-12 (I work backwards a lot). The words were going down, but something didn't feel right. That night (last Thursday), I realized one thing I was doing wrong. The sentence structure was a bit too consistent in places. In most paragraphs, I had a nice mix of sentences but occassionally I would get a "She did. She did. She did."

I needed to review them, but in the walls of text, I couldn't see them easily. Which means one thing... write a program. This is a simple program, so it doesn't have to be fancy. All I wanted to do was going through every paragraph, split it into sentences, then print out the first five words along with a sentence length.

At the end of each paragraph, I added a sentence count, median length, and the std. deviation. I use the std dev to see if I have a consistent sentence length.

For example, take this paragraph:

She continued until she felt the sharp edge of her anger ebb away. She sniffed and wiped the tears from her face. "miw: I hate him," she sobbed. Her eyes scanned the empty forest, half expecting to see Pahim chasing after her. She knew that she needed him, at least to get back to the Boar Hunt Inn. Maybe Falkin would put her up for the night.

In the middle of the document, this one was hard to pick up at first. When run through the program, I get this output:

74: She continued until she felt (13)
74: She sniffed and wiped the (9)
74: "miw: I hate him," she (6)
74: Her eyes scanned the empty (14)
74: She knew that she needed (16)
74: Maybe Falkin would put her (9)
-- 6 sentences, 11.2 median, 3.8 stddev

From here, I can see that I had three sentences that all start with "She" and a past-tense verb. It makes it a lot easier to notice a pattern, at least with the beginning of sentences.

Likewise, in a different chapter, I have this:

126: Kanéko shook her head and (6)
126: "No, don't let him sing (5)
126: Whatever you do, don't sing." (5)
-- 3 sentences, 5.3 median, 0.6 stddev

There are three sentences in the paragraph and all of them have the same length. This is shown by the small standard deviation. I'm okay with this specific case, but it at least points out that there might be something to pay attention to.

Not sure if it will help in the long run, but at least it gives me an idea of where to focus on this part of the editing.

And here is the Perl program I used to generate this. I just pass in the chapter into the program and get the output.

$ ../bin/sentence-prefix chapter-11.txt  | less

#!/usr/bin/perl

Setup

Directives
use strict;
use warnings;

Parse the input

while ()
{ # Ignore commented lines.
next if /^#/;
# If the line is blank, just output a line break.
if (/^s*$/)
{
	print "n";
	next;
}

# Clean up the line and get rid of the newlines.
s/^s*//sg;
s/s*$//sg;

# First split the line on the periods. It isn't perfect, but
# should give a rough sentence start.
my @sentances = split(/.s+/);
my @counts = ();

foreach my $sentance (@sentances)
{
	# For each sentence, we split apart the spaces to get the words.
	my @words = split(/s+/, $sentance);

	# Figure out how many words are in the sentences.
	my $count = scalar(@words);
	push @counts, $count;

	# We only care about the first five words in the sentence.
	my @prefix = splice(@words, 0, 5);

	# Print out the sentence and its length. We also include the
	# line number so we can easily jump to that line.
	print "$.: ", join(" ", @prefix), " ($count)n";
}

# Calculate the std. dev. for the counts. If there is a small
# average, then we have too much of a consistent sentence length.

# Put in a summary for the paragraph.
my $stddev = stddev(@counts);

if ($stddev &gt; 0)
{
	printf(
		"-- %d sentences, %.1f median, %.1f stddevn",
		scalar(@counts),
		median(@counts),
		stddev(@counts));
}

}
sub median
{ # Get a total of all the counts.
my $total = 0;
foreach (@_)
{
	$total += $_;
}

# Average it out by dividing by the number of elements.
my $average = $total / scalar(@_);
return $average;

}
sub stddev
{ # StdDev doesn't mean anything with no elements, so just return 0.
return 0 if @_ <= 1;
# Figure out the median value.
my $average = median(@_);

# Figure out the total of the squares differences from the median.
my $total = 0;

foreach (@_)
{
	$total += ($average - $_) ** 2;
}

# Calculate the StdDEv and return it.
my $std = ($total / (@_ - 1)) ** 0.5;
return $std;

}

So, if anyone is having trouble with unvarying sentence structure, this might help you focus on the paragraphs that need the most work.

An obsession with data (a.k.a. "writers write")

2012-05-01T05:00:00Z

Well, that took a bit longer than I expected, but I've managed to parse the Git and Subversion logs and turn them into a nice intermediate (I said "normalized" too much last post) format and then wrote another tiny little program to tag all my stories.

All that work just to figure out the answer:

How many words have I written?

Now, this answer isn't exact nor entirely accurate. It doesn't including the four complete rewrites of Flight of the Scions (a.k.a. Wind, Bear, and Moon). It also doesn't include the 100k words I pulled out of Flight for KK. Or the re-writes, struggles, and everything else. It also doesn't include the two novels or anything else I wrote in high school including my two books of poetry.

What it does take is the "final" version of every story, chapter, and commission (I actually kept good records of that) I've written since 2001 and gave me an idea of how much I've written.

1,784,085 words from a total of 195 stories and 228 chapters in 7 novels.

That seems a lot. I figured out that number with the following Perl program.

#!/usr/bin/perl

Setup

Directives
use strict;
use warnings;

Go through all the files in the directory from the first argument.

my %months = ();
my $total = 0;
open FIND, “find ‘$ARGV[0]’ -type f -name ‘*.txt’ |”
or die "Cannot open find ($!)";
while ()
{
chomp;
my $file = $_;
# Make sure the file has a date field.
my $sep = $/;
$/ = undef;
open FILE, "&lt;$file&quot;;
$_ = ;
close FILE;
$/ = $sep;

next unless ($_ =~ m@* Date: (d+)-(d+)-d+@);

my $year = $1;
my $month = $2;

# Figure out how many words.
my $word_output = `wc -w "$file"`;

next unless $word_output =~ m@^s*(d+)s+@s;

my $words = $1;

# Print the file.
print STDERR "Processing $file ($year-$month) [$words]n";

my $key = "$year-$month";

$months{$key} += $words;
$total += $words;

}
close FIND;

Add in the zeros.

foreach my $y (qw(2007 2008 2009 2010 2011))
{
foreach my $m (qw(01 02 03 04 05 06 07 08 09 10 11 12))
{
$months{"$y-$m"} += 0;
}
}

Write out the months and dates.

foreach my $mkey (sort(keys(%months)))
{
my $words = $months{$mkey};
print "$mkeyt$wordsn";

}
print “nTotalt$totaln”;

I took the output of that program and threw it into Google Docs so I could chart it over time.

I did fudge the epoch date for Subversion since I had 403,201 words when I converted over to Subversion. In the above chart, I broke it into four months of 100k and added it there. I also had two dates ahead of then because I could figure out a rough date for those from the contracts I got when I sold them.

As you can tell from the chart, I've had a couple months of writing 100k+ words. Those were the good writing months. The highest was March 2007 when I wrote 158,497 words in a single month. I also noticed that around July is my major writing month, year after year.

It's kind of cool, only to see where I had "bad" months (there were a number of zeros) and good months for writing. But, more importantly, the red line shows the total words over time. Writing isn't about belting out a 50k word novel in a single month or (roughly) three of them. It isn't about getting out a single piece and being done with it. For me, writing is about just keep doing it. Writing whenever I can, whatever I can. Like compound interest, the individual stories and chapters pale under the slow accumulate of writing.

And one could hope that becoming a better writer is part of that running total of all words.

Now, what other conclusions can I take from that chart.

Nothing.

I'm actually serious about that. A million words doesn't make me an expert. I can't tell you if I did the mythical 10,000 hours of writing because at 60 words per minute (half my maximum), that's only 500 hours and I'm very sure I've written for more than 500 hours. Belting out words doesn't make me a great, or even a good writer. It just means I've written since 2001 and I apparently enjoy the process enough to keep doing it.

An obsession with data (normalization)

2012-04-29T05:00:00Z

I had someone ask me why I'm writing these blog posts. It started with just a question I wanted to know (how much have I written), but it lead into something more. Over the years, I've ended up parsing a lot of data. It was my job for twenty years to take customer's files for any operating system, any database, and just about any format, and create normalized and consolidated reports. I wrote programs to take normalized data and produce more data which lead into the rest of our business processes.

Parsing data isn't really rocket science. For me, most of it means taking different input (say the Git and Subversion logs) and normalize them into a common format. And then use that common format to process data.

This post is how I took the output from the last two posts and come up with the earliest date for any significant file in my current repository. Having them in a common format (I threw the output into a file called combined.files) means I don't have to care if they are from Subversion or Git, it is just a file.

$ cat combined.files
flight-of-the-scions/fots-01.odt 2010-01-22
flight-of-the-scions/fots-01.txt 2010-12-30
flight-of-the-scions/chapters/fots-01.txt 2011-08-19
flight-of-the-scions/chapters/chapter-01.txt 2011-11-03
$

In the above example, all of these files are effectively the same thing. The only difference is where I put it in the file system and which format I used. I decided not to write the Git and Subversion parses to handle renames, mainly because I knew I had to manually do it (and the time to write the rename parsing would take far more effort).

You might notice that I removed the "exists" column in the middle. I ended up not needing it because I wrote a little program that took the input and told me if a file was missing.

#!/usr/bin/perl
# Usage: perl test-files.pl combined.csv
Go through all the files in the combined and pull out the first column (the filename).
while ()
{ # Clean up the line.
chomp;
# Pull out the fist column.
my @p = split(/t/);
my $f = $p[0];

# Test to see if the file exists.
if (! -f $f)
{
	# The file doesn't exist, so complain.
	print STDERR "Can't find $fn";
}

}

This was a one-off program and sadly I did a horrible job of naming variable and comments, but basically it just takes the combined.csv file and tells me which files don't exist. That tells me if there is a file that exists and I can use it to figure out which ones still have to be renamed, found, or merged.

As an interesting side note, I actually found six stories (mostly flash) that I had lost in my current repositories. Fortunately, I had Subversion to go to and pull them out to add to my list.

Using the test-files.pl program above, I could easily rename and move files around. During the process, I quickly got into the case where I had lots of duplicates of the same file with different dates.

$ cat combined.files
flight-of-the-scions/chapters/chapter-01.txt 2010-01-22
flight-of-the-scions/chapters/chapter-01.txt 2010-12-30
flight-of-the-scions/chapters/chapter-01.txt 2011-08-19
flight-of-the-scions/chapters/chapter-01.txt 2011-11-03
$

To cut down on the information, I wrote another quick program to remove all but the earliest date. Because I always use ISO dates (yyyy-mm-dd) that means the file can be sorted alphabetically and the first one removed.

#!/usr/bin/perl
Keep track of files we've already seen.
%seen = ();
Go through the input.
while ()
{ # Clean up the input line.
chomp;
# Split out the columns.
my @p = split(/t/);

# If we already saw it, tell the user we skipped it. If not, then
# print it out and add it to the hash so we don't print it again.
if (exists $seen{$p[0]})
{
	print STDERR "Skipping $p[0]n";
}
else
{
	print "$_n";
	$seen{$p[0]} = 1;
}

}

$ clear;sort combined.csv | perl remove-duplicates.pl > a && mv a combined.csv
Skipping flight-of-the-scions/chapters/chapter-01.txt
Skipping flight-of-the-scions/chapters/chapter-01.txt
Skipping flight-of-the-scions/chapters/chapter-01.txt
$

(Note: I pipe to "a" and then move it over to combined.csv because most shells including Bash and PowerShell both blow away the input if it is part of a redirected output. In other words, if I don't, it erases the combined.csv and I had to start over... twice.)

Using these two programs together, I got into a nice tight development cycle that let me make changes and reduce the data I was working with. I used an editor with auto-revert-mode (Emacs in this case) but I could have used Notepad++ or any other editor that would refresh the buffer on a file change. Once I made change, I re-ran the above program, and let it refresh in the buffer. And then I made another set of changes and repeated.

The final program I used was to tag my files with the actual dates. This is a bit more complicated, but my text-based file system using a bullet list to include metadata.

$ head flight-of-the-scions/chapters/chapter-01.txt -n 3 = The Water Screw

> In Miw?fu, they call those who cannot use magic miw: bachir?ma. Translated into Lorban, it means “cursed to be forever mundane.” – Awakened Magic, Dastor Malink $

I want to put a "* Date: 2010-01-22" right after the title line. The corresponding line in the combined file is:

$ grep flight-of-the-scions/chapters/chapter-01.txt combined.csv
flight-of-the-scions/chapters/chapter-01.txt 2010-01-22
$

So, I wrote a little Perl program that takes the combined.csv file, picks out all the files from the first column, adds a date line into the file if it is missing. This way, a single run means that every file listed in the combined.csv will have the date I figured out from parsing the Git and Subversion logs.

#!/usr/bin/perl

Setup

Directives
use strict;
use warnings;

Parameters

The first parameter is the directory.
my $dir = $ARGV[0];
die “USAGE: dir input” unless -d $dir;
my $input = $ARGV[1];
die “USAGE: dir input” unless -f $input;

Slurp up the contents of the file.

my %files = ();
open FILES, "<$input" or die "Cannot open input $input ($!)";
while ()
{
chomp;
my @p = split(/t/);
$files{$p[0]} = $p[1];
}
close FILES;

Get a list of all the files in the directory.

open PIPE, “find ‘$dir’ -type f |” or die "Cannot open pipe ($!)";
while ()
{ # Clean up the line.
chomp;
# Ignore . files.
next if m@/.@;
next unless m@.txt$@;

# Trim off the leading characters.
s@./@@;

# See if the file exists.
my $file = $_;

if (exists $files{$file})
{
	# We found the file.
	my $date = $files{$file};
	print "HIT  $date $filen";

	# Pull out the entries so we can report what was missing after
	# we're done processing.
	delete $files{$file};

	# Open up the file and read in the metadata section, looking
	# for an already existing date.
	my $found_date = 0;

	open FILE, "&lt;$file&quot; or die &quot;Cannot open $file ($!)&quot;;

	while ()
	{
		# Clean up the line.
		chomp;

		if (m@^* Date:s*(.*?)$@)
		{
			my $old_date = $1;

			$found_date = 1;

			if ($date eq $old_date)
			{
				# Nothing to do, we're good.
				last;
			}

			print "       $1n";

		}
	}

	close FILE;

	# If we didn't find the date, we need to add it into the file.
	unless ($found_date)
	{
		my $need_date = 1;

		print "       Date: $daten";

		open IN, "tmp" or die "Cannot open tmp ($!)";

		while ()
		{
			print OUTPUT $_;

			if ($need_date &amp;&amp; $_ =~ /^= /)
			{
				print OUTPUT "* Date: $daten";
				$need_date = 0;
			}
		}

		close IN;
		close OUTPUT;

		rename($file, "$file.bak");
		rename("tmp", $file);
	}
}
else
{
	# Print the file.
	#print "SKIP $_n";
}

}
close PIPE;
Write out any remaining files.
foreach my $file (sort(keys(%files)))
{
print “MISS $filen”;
}

You'll notice that the longer my program, the more I comment. I also had some debugging code commented out which I used to verify everything worked properly. Once I finished running this, every single story and chapter I've written was given a date line. I spot-checked them (there are quite a few in total) and checked it into the Git repository.

This also means that my next post can finally answer the question "how many words I've written?" with certain caveats but because I also dated them with a day, month, and year, I will also be able to graph my rough writing output over time.

An obsession with data (Subversion edition)

2012-04-28T05:00:00Z

It took me a number of days to clean up the output from the Git repository parse. I could have written a parser that handled the renames through Git but it was easier just to juggle data and handle the format changes which were never a Git rename (for example when I went from an XML-based system to Open Document and then again to a Creole-based system).

However, there was an interesting problem: the day 2012-06-26 showed up a lot.

$ wc -l git.csv
606 git.csv
$ grep 2010-06-26 git.csv | wc -l
421
$

In specific, 421 of the entries all had the same date. It is pretty obvious what happened on that day, I imported my old Subversion repository into Git because I fell in love with Git. However, in the quest of tracking down the dates of the files, I had to go into the Subversion file. Fortunately, I still have the Subversion on my web host, so I could do much the same thing.

Subversion's log output is pretty simple:

$ svn log
------------------------------------------------------------------------
r1072 | dmoonfire | 2010-06-20 14:39:48 -0500 (Sun, 20 Jun 2010) | 2 lines
Changed paths:
M /moonfire/stories/flight-of-the-scions/fots-15.txt

Tweaking chapter 15.

Now, I was getting a bit tired of parsing, so I didn't try to optimize it like I did the Git. However, I wanted the same output so I pulled out the Git Perl program I used last time and just tweaked it to handle both the different date time line and also the fact that files start with "M", "R", or "D".

#!/usr/bin/perl

Setup

Directives
use strict;
use warnings;

Input Process

Since this was all in one Subversion tree, we just pipe “svn log” to
the input.
my $last_timestamp = “undef”;
my %files = ();
while ()
{ # Clean up the line and ignore blanks.
chomp;
next if /^s*$/; # In general, we will have two types of lines. One is in a # timestamp and the other is the name of the file. # rd+ | user | 2010-07-10 21:45:57 -0500 (Sat, 10 Jul 2010) | 1 line
if (/(d+)-(d+)-(d+) (d+):(d+):(d+)/)
{ # We only care about the date of the check-in.
$last_timestamp = “$1-$2-$3”;
}
else
{ # For everything else, we get a filename. There are a few # things that we frequently ignore, such as hidden files # (start with a period).
s/^{s+//sg;
unless (/}(M|A|D)s+(.*?)$/)
{ # An unkown line, so skip it.
next;
}
	# We add the file and the current timestamp to the
	# hash. Since we replace with each one, and `git log` goes
	# backwards in time, the last time we see the file is the
	# point it was first added to the repository.

	# Print out the line.
	my $file = $2;
	$file =~ s/s*(.*?$//;
	$files{$file} = $last_timestamp;
}

}
Now that we are done parsing, we output the merged results.
open REPORT, “>svn.files” or die "Cannot write svn.files ($!)";
foreach my $file (sort(keys(%files)))
{
# Pull out the date.
my $date = $files{$file}; # Keep track if this file exists.
my $exists = 0;
# They don't exist in this case $exists = 1 if -f “$dir/$file”;
#print "$filet$daten";
#print "$datet$existst$filen";

print REPORT “$filet$existst$daten”;
}
close REPORT;

The output of the above program is identical to the Git version:

$ cat svn.files
moonfire/trial-by-steam.txt 0 2007-09-03
moonfire/wind-bear-moon.txt 0 2007-03-08
$

Now, Subversion wasn't the first source control system, but I don't have any more after that. Like the Git initial check-in, there is an epoch where all files start. In this case, 2006-10-15. This is quite a few years after I started writing, but before that was a CVS repository and before that file copying. Fortunately, there aren't too many chapters before the Subversion epoch, which is good since I would have to do some serious work to figure out the earlier dates.

$ grep 2006-10-15 svn.files | wc -l
83
$ wc -l svn.files
702 svn.files
$

The cleanup tasks for both the Git and Subversion are the same, mainly figuring out which filenames are actually the same one through the various renames and file format changes. The next post will be just a few techinques I used to clean up data while figuring out all of this.

An obsession with data (Git version)

2012-04-24T05:00:00Z

In the process of thinking about doing a post about my (somewhat depressing) writing career, I got curious about how much I write. At work, this came up and so I did a quick word count (`wc -w`) on my repositories.

$ find -name *.txt | xargs wc -w
4288655 total
$

There is no way I've written 4.2 million words. So, I got curious of actually how many words are in there, but also when I wrote it. Now, as much as I'm fond of data, I actually haven't tracked when I finished a story.

I realized since I'm obsessed with source control systems, I could get a rough estimate of when I posted something. In general, I don't take more than 1-2 weeks to finish anything I've started, so I can say that I finished a story about the time I started it (e.g., checked it into source control).

Currently, I use git to track my files. I also previously used Subversion, so I'll get to that in the next few days.

Git can list all the commits inside the repository:

$ git log -1
commit 4168500550f4c27ed2be6e1fa97d846df366e483
Author: Dylan R. E. Moonfire
Date: Sun Apr 22 20:53:42 2012 -0500

Working on edits.
$

Two things. One, I am terrible at check-in comments when I write. I use this to make checkpoints on my story and I don't really go back. So, usually I just have a one-line comment like "Worked on chapter 3." Two, I'm using "-1" to limit this to one entry for illustration purposes.

The above output is really verbose. Ideally, we don't care about the message, who did it (it is only me), and the commit message. Fortunately, Git has the ability to control the output with the "--pretty" option. In this case, we are using "%ai" which gives us a pretty little ISO timestamp.

$ git log -1 --pretty=%ai
2012-04-22 20:53:42 -0500
$

Pretty good, except that it doesn't really show the files that changed. Well, we can fix that with the "--name-only" opton.

$ git log -1 --pretty=%ai --name-only
2012-04-22 20:53:42 -0500

high/friend-guard.txt
nr-guard.txt
$

There we go. We have the date we made a change and which files we changed (or added or deleted). Of course, not really in a useful form. I'm fond of Perl programming for one-off programs, so I banged this up:

#!/usr/bin/perl
git log –name-only –pretty=%ai

Setup

Directives
use strict;
use warnings;

Directory Parsing

Go through all the files in the command-line arguments.
while (@ARGV)
{ # If it isn't a directory, we don't care.
my $dir = shift @ARGV;
$dir =~ s@/$@@sg;
if (! -d $dir)
{
	print STDERR "Ignoring $dir (not a directory)n";
	next;
}

my $git_dir = "$dir/.git";

if (! -d $dir)
{
	print STDERR "Ignoring $dir (no .git inside)n";
	next;
}

# We're processing this directory.
print STDERR "Processing $dirn";

# Set the GIT_DIR and GIT_WORK_TREE so we don't have to move into
# that directory.
$ENV{GIT_WORK_TREE} = $dir;
$ENV{GIT_DIR} = $git_dir;

# We want to build up a log of the entire repository. We want to
# know each of the files and date that they were checked in.
my $last_timestamp;
my %files = ();

open GIT, "git log --name-only --pretty=%ai |"
	or die "Cannot open Git for $dir ($!)";

while ()
{
	# Clean up the line and ignore blanks.
	chomp;
	next if /^s*$/;

	# In general, we will have two types of lines. One is in a
	# timestamp and the other is the name of the file.
	if (/^(d+)-(d+)-(d+) (d+):(d+):(d+)/)
	{
		# We only care about the date of the check-in.
		$last_timestamp = "$1-$2-$3";
	}
	else
	{
		# For everything else, we get a filename. There are a few
		# things that we frequently ignore, such as hidden files
		# (start with a period).
		next if /^./;
		next if m@/.@;

		# We add the file and the current timestamp to the
		# hash. Since we replace with each one, and `git log` goes
		# backwards in time, the last time we see the file is the
		# point it was first added to the repository.

		# Print out the line.
		$files{$_} = $last_timestamp;
	}
}

close GIT;

# Now that we are done parsing, we output the merged results.
open REPORT, "&gt;$dir.files" or die "Cannot write $dir.files ($!)";

foreach my $file (sort(keys(%files)))
{
	# Pull out the date.
	my $date = $files{$file};

	# Keep track if this file exists.
	my $exists = 0;
	$exists = 1 if -f "$dir/$file";

	#print "$filet$daten";
	#print "$datet$existst$filen";
	print REPORT "$filet$existst$daten";
}

close REPORT;

}

And the output of this program is put into a ".files" for the directory.

$ perl git-create.pl moonfire
Processing moonfire
$ head moonfire.files
Makefile 1 2010-06-26
another-werewolfs-tail.odt 0 2010-06-26
another-werewolfs-tail.txt 0 2012-04-22
best-enemies.txt 0 2010-08-25
best-laid-plans.odt 0 2010-06-26
best-of-enemies.odt 0 2010-06-26
brickpunk.odt 0 2010-06-26
change-of-honor.odt 0 2010-06-26
$

Now, this catches all the various versions of the file as I (constantly) renamed files, changed formats, and basically mucked around. I'm not afraid of shifting things around so it reflects that. I put in the exists column (0 or 1) so I know which file is actually there verses the (constant) renames.

In the above example, I would combine "best-enemies.txt" and "best-of-enemies.odt" together and take the earliest date. A bit of manual work isn't too bad for this project, but there you go. The first time a file shows up in a git repository.

Perl

Emacs and Multiple Dictionaries

The problem

Vim

The idea

But wait, there's more

And then one more

Full example

Tying it all together

Github

Git tips: Getting the first commit date of a file

USAGE: git-first-commit-date [–bare|-b] file…

Setup

Directives

Modules

Options

–bare means don't put the filename in the line. Otherwise it will

–put the filename, followed by a colon and a space.

Go through the input files.

Miwāfu glyphs

Related Posts

Glyph Inspiration

Initial Glyphs

Drawing These

Trying it out

Making a Font

Finding sentences that start off the same with Perl

Setup

Directives

Parse the input

An obsession with data (a.k.a. "writers write")

Setup

Directives

Go through all the files in the directory from the first argument.

Add in the zeros.

Write out the months and dates.

An obsession with data (normalization)

Go through all the files in the combined and pull out the first column (the filename).

Keep track of files we've already seen.

Go through the input.

Setup

Directives

Parameters

The first parameter is the directory.

Slurp up the contents of the file.

Get a list of all the files in the directory.

Write out any remaining files.

An obsession with data (Subversion edition)

Setup

Directives

Input Process

Since this was all in one Subversion tree, we just pipe “svn log” to

the input.

Now that we are done parsing, we output the merged results.

An obsession with data (Git version)

git log –name-only –pretty=%ai

Setup

Directives

Directory Parsing

Go through all the files in the command-line arguments.