Finding sentences that start off the same with Perl

So, even though I said I'm done editing FOTS, I'm not. I feel like I should be, but every time I go to read it, I see more things wrong with it. Plus, there is the entire word count problem (I aimed for 100k and got 124k). So, over my lunch breaks I'm editing and working on reducing the word counts.

Before my vacation, I was working on chapters 16-12 (I work backwards a lot). The words were going down, but something didn't feel right. That night (last Thursday), I realized one thing I was doing wrong. The sentence structure was a bit too consistent in places. In most paragraphs, I had a nice mix of sentences but occassionally I would get a "She did. She did. She did."

I needed to review them, but in the walls of text, I couldn't see them easily. Which means one thing... write a program. This is a simple program, so it doesn't have to be fancy. All I wanted to do was going through every paragraph, split it into sentences, then print out the first five words along with a sentence length.

At the end of each paragraph, I added a sentence count, median length, and the std. deviation. I use the std dev to see if I have a consistent sentence length.

For example, take this paragraph:

She continued until she felt the sharp edge of her anger ebb away. She sniffed and wiped the tears from her face. "miw: I hate him," she sobbed. Her eyes scanned the empty forest, half expecting to see Pahim chasing after her. She knew that she needed him, at least to get back to the Boar Hunt Inn. Maybe Falkin would put her up for the night.

In the middle of the document, this one was hard to pick up at first. When run through the program, I get this output:

74: She continued until she felt (13)
74: She sniffed and wiped the (9)
74: "miw: I hate him," she (6)
74: Her eyes scanned the empty (14)
74: She knew that she needed (16)
74: Maybe Falkin would put her (9)
-- 6 sentences, 11.2 median, 3.8 stddev

From here, I can see that I had three sentences that all start with "She" and a past-tense verb. It makes it a lot easier to notice a pattern, at least with the beginning of sentences.

Likewise, in a different chapter, I have this:

126: Kanéko shook her head and (6)
126: "No, don't let him sing (5)
126: Whatever you do, don't sing." (5)
-- 3 sentences, 5.3 median, 0.6 stddev

There are three sentences in the paragraph and all of them have the same length. This is shown by the small standard deviation. I'm okay with this specific case, but it at least points out that there might be something to pay attention to.

Not sure if it will help in the long run, but at least it gives me an idea of where to focus on this part of the editing.

And here is the Perl program I used to generate this. I just pass in the chapter into the program and get the output.

$ ../bin/sentence-prefix chapter-11.txt  | less



use strict; use warnings;

Parse the input

while () { # Ignore commented lines. next if /^#/;

# If the line is blank, just output a line break.
if (/^s*$/)
	print "n";

# Clean up the line and get rid of the newlines.

# First split the line on the periods. It isn't perfect, but
# should give a rough sentence start.
my @sentances = split(/.s+/);
my @counts = ();

foreach my $sentance (@sentances)
	# For each sentence, we split apart the spaces to get the words.
	my @words = split(/s+/, $sentance);

	# Figure out how many words are in the sentences.
	my $count = scalar(@words);
	push @counts, $count;

	# We only care about the first five words in the sentence.
	my @prefix = splice(@words, 0, 5);

	# Print out the sentence and its length. We also include the
	# line number so we can easily jump to that line.
	print "$.: ", join(" ", @prefix), " ($count)n";

# Calculate the std. dev. for the counts. If there is a small
# average, then we have too much of a consistent sentence length.

# Put in a summary for the paragraph.
my $stddev = stddev(@counts);

if ($stddev > 0)
		"-- %d sentences, %.1f median, %.1f stddevn",


sub median { # Get a total of all the counts. my $total = 0;

foreach (@_)
	$total += $_;

# Average it out by dividing by the number of elements.
my $average = $total / scalar(@_);
return $average;


sub stddev { # StdDev doesn't mean anything with no elements, so just return 0. return 0 if @_ <= 1;

# Figure out the median value.
my $average = median(@_);

# Figure out the total of the squares differences from the median.
my $total = 0;

foreach (@_)
	$total += ($average - $_) ** 2;

# Calculate the StdDEv and return it.
my $std = ($total / (@_ - 1)) ** 0.5;
return $std;


So, if anyone is having trouble with unvarying sentence structure, this might help you focus on the paragraphs that need the most work.