Author Intrusion - I've Been Here Before

When it comes to projects that I probably will never finish, Author Intrusion is probably right up there. Most of the time, it is because I keep restarting because I learned significant lessons. This last week was one of those cases.

Author Intrusion is also one of those projects that I keep coming back because I know it can help me but it requires a lot of effort to turn what I think I need into something that is useful. One might say that is obvious from the sheer number of times I restarted development but it is a moving target even from the beginning.

There are a lot of reasons for the various restarts:

  • Can't handle my 25k word penultimate chapter on a project:
    • ... in a reasonable time.
    • ... or with reasonable memory concerns.
    • ... or without grinding my entire system to a halt.
  • Can't handle my 600k word epic:
    • ... in a reasonable time, overloading my memory, and stopping my machine.
  • Couldn't figure out how to determine parts of speech correctly.
  • I tried to:
    • ... write an entire text editor.
    • ... okay, just the text control.
    • ... okay, just load single chapters.
    • ... okay, just a CLI to parse the files.

This is over eight years of struggling with this tool.

Expectations

Part of the problem is I'm thinking too grandly. I always had that problem when I can't code, silence is probably the worse things for my plots and my projects. I think I've managed to winnow down the list to a more obtainable set of goals:

  • Work with YAML + Markdown, my preferred format.
  • Analyze the chapters of the book at:
    • Identify problem words.
    • Identify clustered echo words.
  • Get me a good count of words for the chapters:
    • ... for the entire book.
    • ... for a range of chapters.
  • Identify sentences:
    • ... that start with the same pattern.
    • ... have the same number of words.
  • Identify concurrent paragraphs that start with the same word.
  • Identify overuses of gerunds (a category of -ing words).
  • Identify overuses/clusters of “to be” verbs.
  • Allow me to insert a chapter easily.
  • Identify present tense outside of dialog.
  • Integrate with Atom as a language service.
  • Work on Windows and Linux.
  • Handle custom spelling.

Okay… that isn't that much of a shorter list. Even so, it will take a while to even get close to it despite these features being 90% of what I use tools to do today.

The Last Failure

The last version used XPath and XSLT. I figured it was a good way of framing the ideas, however I started getting bogged down in making it flexible because I hate the idea of opinionated systems and didn't want to force others to work the way I work (blame James White's Fast Trip for that).

Also, it struggled with the large files and projects.

When I encounter a problem, sometimes it is good to think of it in a different way. This might require retooling it, starting over (I did that with RPG games a lot as a kid too), or trying to come around the problem a different way.

Rust Language

I decided to spend a week trying another iteration of Author Intrusion. This time, I used it as a learning experience for an entirely different (and new to me) language: Rust. Rust is a pretty low-level language but has much of the packaging and niceties of C# (my primary language). Since I knew the initial versions of Author Intrusion so well (I have restarted over ten times), it was old territory that would make it easier to map the language into something I already knew.

Overall, Rust is fairly pleasant to use but there are places where I struggled a lot with it conceptually. I've worked in object-oriented languages for the last twenty years, Rust has some OO-like support but I had to change how I was working to use it. Also I still haven't groked the “borrow” concepts enough, so sometimes I just revert to randomly adding and removing & and * in hopes of figuring it out.

In other words, I beat on it like a black obelisk.

The tooling with Rust is considerably poorer than C#. Not being able to refactor easily definitely ate up significant time in the last week. Also my laptop hates working with Rust, it gets burning hot after only an hour of coding… which was good because I put it down until everything cooled off.

That said, I've been pretty happy with the results so far so I'm hoping to keep it going with this variant instead of the previous ones, at least to see if I can get closer to my end goals.

Running the Tool

So, if you are interested in checking it out with this post, head over the the GitLab repository and clone it. I'm using rustup for my installation with the nightly builds, so the following should work:

$ git clone https://gitlab.com/author-intrusion/author-intrusion-rust
$ cd author-intrusion-rust
$ ./run-000 chapter list

On Linux, I create an alias to run it from anywhere:

$ alias author-intrusion="cargo run --quiet --"

Unless noted, I'll also run all of these examples from the ./examples/000-simple-project directory.

Project Files

Like before, everything hangs off a author-intrusion.aipry project file. This tells the system where the root of the project is located (I love being able to use npm and gulp from sub-directories) and how the project is organized.

The project file is a YAML file. There are two examples in the ./examples/ folder in the source code.

name: First Example Project
content:
  pattern: chapters/chapter-??.md

The name is just there because I like to have something.

The content chapter says how the project is organized for the actual content. Now, in my case, I almost universally use chapter for my files but I've been switching from chapter-01.markdown to chapter-01.md. Someone might prefer a different approach (like the src directory or eventually a chapter/scene approach).

File List

With that, we have the basic ability to list chapters.

$ author-intrusion chapter list
chapters/chapter-01.md 1 Chapter One  9
chapters/chapter-02.md 2 Chapter Two 11
$

This shows off some of the cool things it can do already. The first column is the relative path in project. This is the same as adding -f file.rel_path to the command.

The second column is the chapter number which is parsed from the file (chapter-01.md -> 01 -> 1). This is the same as -f file.num.

The third is pulled from the file itself in the front matter (the section between the --- lines). This uses JMESPath for the queries that lets you do some really tool things. In this case, it is just a simple -f title which pulls the title: line from the files.

Finally, the last column is the number of words inside the file or -f count.words.

The reason I have these individual fields is because maybe I only want the word count in each chapter.

$ author-intrusion chapter list -f count.words
 9
11
$

Using JMESPath

The JMESPath also lets me pull out more detailed information from the query. As I've shown in sample chapters, I put a lot of information in the front matter to help me keep track of the book.

A simple case is chapter 1 of the example file:

---
title: Chapter One
locations:
  - Location A
  - Location B
---

This is chapter 1 for Anton.

So, if I want to pull out the locations with the tool as a comma separated list, I can do the following.

$ author-intrusion chapter list -f file.rel_path -f "locations[] | sort(@) | join(', ', @)"
chapters/chapter-01.md Location A, Location B
chapters/chapter-02.md
$

It looks more impressive with the second example, which is the first nineteen chapters of Sand and Blood. I can use that to list all the characters who are present in each chapter besides the main character:

$ author-intrusion chapter list -f file.rel_path -f "characters.secondary[] | sort(@) | join(', ', @)" | head -n 3
chapters/chapter-01.markdown Hyonèku
chapters/chapter-02.markdown Gemènyo, Somiryòki, Tejíko
chapters/chapter-03.markdown Desòchu, Gemènyo, Hyonèku, Mapábyo, Opōgyo, Panédo
$

This is also Unicode-friendly, which is important to me since I'm fond of constructed languages and have a number of them for my world.

Defining Field Aliases

Now, that can get really wordy, so I added the ability to save those fields in the project file:

name: Sand and Blood by D. Moonfire
content:
  pattern: chapters/chapter-??.markdown
query:
  fields:
    characters.secondary:
      jmes: characters.secondary[] | sort(@) | join(', ', @)

That way, they can be easily used as an aliased field:

$ author-intrusion chapter list -f file.rel_path -f characters.secondary
chapters/chapter-01.markdown Hyonèku
chapters/chapter-02.markdown Gemènyo, Somiryòki, Tejíko
chapters/chapter-03.markdown Desòchu, Gemènyo, Hyonèku, Mapábyo, Opōgyo, Panédo
$

Analysis

The other part I managed to get done this week is the basic version of checking. Using Sand and Blood as an example, I have a problem with an overuse of “sigh” and “sighed”. I can use the tool to highlight when I start to use it too much.

To do that, I first define a check in the project file:

name: Sand and Blood by D. Moonfire
content:
  pattern: chapters/chapter-??.markdown
checks:
  - overused_pattern:
      # The `(?i)` makes this case insensitive.
      pattern: "^(?i)(sigh|sighed)$"
      err: 0.002
      warn: 0.001

The err and warn are ratios in the file. So, 0.002 means if more than 0.2% of the chapter is sighing, then just flag them as errors.

$ author-intrusion check
chapters/chapter-01.markdown:100:17: warning: found overused word "sighed"
chapters/chapter-01.markdown:140:11: warning: found overused word "sighed"
chapters/chapter-02.markdown:92:230: warning: found overused word "sighed"
chapters/chapter-02.markdown:120:48: warning: found overused word "sigh"
chapters/chapter-02.markdown:214:8: warning: found overused word "sigh"
$

The message is formatted so most tools like Atom or Emacs will move you the line and first character of the word being reported. And they can highlight the lines as yellow or red depending on the warning or error.

Additional Problems

The “sighing” problem was also mirrored by a “nodding” one. So, once I identify that word as a problem, I can add it to the file:

checks:
  - overused_pattern:
      # The `(?i)` makes this case insensitive.
      pattern: "^(?i)(sigh|sighed)$"
      err: 0.002
      warn: 0.001
  - overused_pattern:
      pattern: "^(?i)(nod|nodded)$"
      err: 0.002
      warn: 0.001

Then it highlights each one as either a warning (under 0.2%) or error (0.2% or higher).

Side note, I'm probably going to rename it to overused_word.

Speed

The biggest test is how fast this thing is. I don't have a lot done, but I had it process almost three million words in all the chapters I've written in the last twenty years in about twelve seconds. Doing actual English checking (overused words, patterns) will take longer, but twelve seconds is a promising start.

Obviously, it will get slower once I put the various checks in, but this is better than the previous iteration at an hour to handle two million words.

Conclusion

So that is how much I got going in a week with a language I just started learning a week ago Saturday. I'm pretty happy with the results and in the process, I found a lot of places that I was over-engineering the code. While I don't know if I'll ever be “done” with this, I think this is currently my best viable approach and it already has something I can use for this coming week when I work on Raging Alone.

Metadata

Categories:

Tags: