Author Intrusion status for May 1, 2017

After a rough start of the previous week, I got a chance to really focus on Author Intrusion (AI). Despite the rather significant amount of check-ins and coding, I wasn't quite able to get to a really good “show off” point. Instead, this week ended up being a black triangle (significant progress but nothing visible).

Introduction

I haven't really talked about AI for a while. The entire idea started in 2009 or so when I realized I had some major blindspots in my writing. At the time, it was an overuse of gerunds that was hanging over me but as my skill improves, the problems areas also shift around. My writing group has been a great help in tracking these down but I felt a lot of them were things that could be detected ahead of time; basically something that would let me file off the rough edges before someone spent the effort to correct them.

To my surprise, the current offering of grammar checkers doesn't actually look for the same thing. They seem to review things sentence-by-sentence but I wanted something that looked at the entire document and looked for trouble areas. For example, using the same word repeatedly in a few paragraphs (echo words).

At the same time, I'm a programmer. I use a wonderful program called ReSharper which has a lot of refactoring and analysis. There are times when I wanted to see every time a character shows up (name outside of dialog) or is talked about. I also rename characters (frequently) and suffer from the occasional search-and-replace bugs.

That all lead to me want to create an IDE for writers, basically a Visual Studio and ReSharper that was geared toward authors.

One of my major influences is a short story by James White called Fast Trip. I love that story because it comes down to “change the environment to succeed.” Well, that meant a couple of things for AI, it had to be flexible to work the way the author wants and it needed to use the paradigm that I was comfortable with.

Iterations

I've worked on off-and-on for about eight years. I've tried a lot of iterations, gotten to a point, and then hit some conceptal problem that threw everything into disarray. Some of them were understandings on my part (you can't reformat text while writing), getting caught working on the wrong thing (I should reinvent the word processor), and a ton of other dead ends.

This means that it isn't done until it is stable. So if you are not interested in alpha software, this might not be what you are looking for at this point.

The current design goals is to create a separate, independent program that can be called by text editors. Inspired by OmniSharp, I figured it was better to create a program that did one thing well (analysis and refactor novels) and then create hooks that can talk to other editors like Emacs and Atom.

I'm also using Grunt and NPM from my own experiences. Some of the earlier implementations of AI were self-contained with all the plugins installed there. However, as I've been working on my Javascript publishing framework, versioning is very important. I can't break the analysis of novels I wrote ten years ago just because the code has improved to handle the novels of today. NPM has a great way of handling that with the package.json file that lists specific packages and versions, which can be installed and used. So, AI is done the same way, the system uses specific versions of packages to ensure that it always produces the same data. If you upgrade the underlying packages, then you can deal with the changes and check in the results knowing it won't change tomorrow.

I've written AI in a number of languages (Javascript, Typescript, C#, Pythong) along with different UIs. I've gotten lost in a lot of them. At the moment, my skills in C# are considerably more advanced than the others, but more importantly, the tools (ReSharper) make me a lot more effective. I don't want this to be a learning experience at this point because I have a lot of novels I need to write and I have to focus on either making fun tools (AI) or finish books.

I don't want this to be an argument of language. I like C# and I'm good at it, so I'm using it.

Overall Status

It isn't there yet. I'm working for an end-to-end which means setting up the files and being able to run a program (aicli) and have it produce an output that shows echo words (the first analysis I want to write and my current blindspot). What does work requires a relatively specific setup (not that bad, everything is contained inside the repository) and a bit of hand-holding.

This is still in the “black triangle” point since the technical framework is set in place, I'm still combining them to produce something that looks cool.

All of the examples are based on a subset of Sand and Blood used for testing. It can be found here.

Configuration

Like Grunt, there is a single file that controls how AI works with a project. I took inspiration from Visual Studio Code and Atom in that every file isn't explictly added to the project. Instead, AI assumes that all files underneath the directory containing the author-intrusion.yaml file is part of the project. If you add a chapter, the YAML file won't change but the system will pick it up.

I used YAML because it is easy to use, doesn't require a lot of noise, and pretty much handles the relatively simplistic data. Also, unlike JSON, YAML can let you copy a section from one place to another which I consider to be pretty useful.

I am trying to avoid changes a lot of files whenever you add a chapter, that's just noise in most cases. Related to that, AI will create a .ai directory for its internal caches, that shouldn't be checked into source control at all.

Eventually, aicli will search for author-intrusion.yaml. That way, like Grunt, it can be called anywhere inside the directory tree and it will “do the right thing” to produce consistent results. This is the same thing git does among other CLI implementations.

The basic author-intrusion.yaml file is pretty simple:

file:
    - plugin: AddClassFromPath
      match: chapters/*.markdown
      class: chapter
data:
	- plugin: AddClassFromData
	  select: file.chapter
	  class: pov-{pointOfView}
layout:
    - plugin: SplitLines
    - plugin: SplitParagraphs
    - plugin: OpenNlpSplitTokens
      select: para
    - plugin: WordTokenClassifier
analysis:
    - plugin: Echoes
      select: file.chapter token.word
      scope: token.word:within(10)
      threshold: { error: 5, warning: 2 }

In the example file from the link above, there is a lot more in the file either for documentation purposes or just to work out some concept or idea.

There are four sections: file, data, layout, and analysis. These are various operations that are performed on each file to classify (file and data), organize (layout), and analyze (analysis) the files.

Plugins

Most of the processes are based on the idea of a “plugin”. A plugin is a discrete class (typically in its own NuGet package) that is versioned and specific to the current project. They are identified by the plugin property inside the list. A plugin can be used more than once.

analysis:
    - plugin: Echoes
	  id: Words Within 10
      select: file.chapter token.word
      scope: token.word:within(10)
      threshold: { error: 5, warning: 2 }
    - plugin: Echoes
	  id: Sounds Alike
      select: file.chapter token.word
      scope: token.word:within(5)
	  compare: :root:attr(soundex)
      threshold: { error: 5, warning: 2 }

In each case, the plugin property of each plugin identifies which plugin to use. This is usually the base class but it means I'll have to have a registry to list which plugins do what. The reason for doing this is because if someone else writes a plugin (extends AI), they can push it up to NuGet (about that later) and everyone can use it. It doesn't require a release of the main code (assuming AuthorIntrusion.Contracts remains the same).

The rest of the properties are based on that plugin. There is a bit of interesting complexity in making this work (e.g., hack but I wrote it) but everything is type safe when it comes to coding.

Selectors

In the above examples, select and scope are based on CSS selectors. I couldn't find a library that did it generically, so I wrote one that just creates an abstract syntax tree of CSS selectors which is used by this library. Like most of the plugins, I'll break it into a separate project once AI gets stable.

The advantage of using CSS, such as file.chapter token.word or :root is that many developers understand how CSS selectors works. Except for the specifics (pseudo-classes for example), how they chain together, how you combine them, those are known enough that I don't have to force someone to learn something new.

I'm setting up the layout of a file (using the layout plugins) to look like HTML. Those plugins create the tag/element structure that CSS uses.

Right now, the following are implemented:

  • :first-child
  • :last-child
  • :nth-child
  • :root is the top-level item being selected (usually the file but can be something else)

There is also scoped pseudo-classes. These are used to get elements based on their relationship to another element. In every case so far, the scope property is based on each element found in the select of the plugin.

  • :before(x) finds the elements before the scope. It can be used like token.word:before(5) to get the five words before the current one.
  • :after(x)
  • :within(x) which is basically :before(x) combined with :after(x).
  • :self- versions of the three above.
  • :parent matches an item that is the parent of the scope.

These scoped variables are used so we can control what we are looking for. For example, the echoes plugin may look to see if the same word has been used somewhere within ten words of the current one:

analysis:
    - plugin: Echoes
      select: token.word
      scope: token.word:within(10)

It can also be used to make sure the same word isn't used at the beginning of the surrounding paragraphs:

analysis:
    - plugin: Echoes
      select: para token.word:first-child
      scope: para:within(1) token.word:first-child

There is a third category used for comparisons (used by the Echoes). These are where I hack the CSS system but basically let me define operations.

  • :lower get a lowercase version of the text.
  • :attr(attribute-name) uses the value of the attribute instead of the text.

This lets us analyze for something more than just the text. For example, words that sound the same:

analysis:
    - plugin: Echoes
      select: token.word
      scope: token.word:within(10)
	  compare: :root:attr(soundex) # :attr() isn't done yet

Or ones that have the same base word:

analysis:
    - plugin: Echoes
      select: token.word
      scope: token.word:within(10)
	  compare: :root:attr(stem)

A third example is if we are looking for too many sentences that start with the same pattern of parts of speech. This is to find where “bob did this”, “bob did that”, “mary did something”.

layout:
    - plugin: ClassifyElementRange # Not done yet
	  select: sent token.word:first-child
	  scope: sent:parent token.word:self-or-after(2)
	  tag: sent-leading
analysis:
    - plugin: Echoes
      select: sent-leading
      scope: sent-leading:within(5)
	  compare: :root:attr(pos)

File Plugins

Now, how all those topics are used. The first set of plugins are simply based on the structure of the project. Every file has one automatic element, the file which is the html of the file. We can add classes and identifiers to that to limit how the rest of the code works.

The biggest one is AddClassFromPath. You can use that to add the chapter class to files in the chapters directory. That way, the later selectors can use file.chapter or .chapter to limit processing (don't need to do grammar on your notes).

file:
    - plugin: AddClassFromPath
      match: chapters/*.markdown
      class: chapter

Data Plugins

I write using Markdown with a YAML header. My chapters look like this:

---
availability: public
when: 1471/3/28 MTR
duration: 25 gm
date: 2012-02-18
title: Rutejìmo
locations:
  primary:
    - Shimusogo Valley
characters:
  primary:
    - Shimusogo Rutejìmo
  secondary:
    - Shimusogo Hyonèku
  referenced:
    - Funikogo Ganósho
    - Shimusogo Gemènyo
    - Shimusogo Chimípu
    - Shimusogo Yutsupazéso
concepts:
  referenced:
    - The Wait in the Valleys
purpose:
  - Introduce Rutejìmo
  - Introduce Hyonèku
  - Introduce naming conventions
  - Introduce formality rules
  - Introduce the basic rules of politeness
summary: >
  Rutejìmo was on top of the clan's shrine roof trying to sneak in and steal his grandfather's ashes. It was a teenage game, but also one to prove that he was capable of becoming an adult. He ended up falling off the roof.

  The shrine guard, Hyonèku, caught him before he hurt himself. After a few humiliating comments, he gave Rutejìmo a choice: tell the clan elder or tell his grandmother. Neither choice was good, but Rutejìmo decided to tell his grandmother.
---

> When a child is waiting to become an adult, they are subtly encouraged to prove themselves ready for the rites of passage. In public, however, they are to remain patient and respectful. --- Funikogo Ganóshyo, *The Wait in the Valleys*

Rutejìmo's heart slammed against his ribs as he held himself still. The cool desert wind blew across his face, teasing his short, dark hair. In the night, his brown skin was lost to the shadows, but he would be exposed if anyone shone a lantern toward the top of the small building. Fortunately, the shrine house was at the southern end of the Shimusogo Valley, the clan's ancestral home, and very few of the clan went there except for meetings and prayers.

Given that, I would use a plugin to let me add identifiers and classes to files based on the YAML header (the bit between the --- lines). For example, the following would add a pov-ShimusogoRutejìmo class to the file.

data:
	- plugin: AddClassFromData # Not done
	  select: file.chapter
	  class: pov-{characters.primary}

There will be ways of manipulating it, but basically it lets you tag the chapter with “scene” or “sequel” if you follow the Techniques of the Selling Writer or want to identify if a chapter is combat or talking. It doesn't matter how you want to tag it, you can use any element in the data to filter or make decisions later. This will also be used for the querying to let you say “what chapters are from Rutejìmo's point of view” or “what scenes happen at night”. Eventually, I'll tie it into my culture library so you can also say “show me the chapters in chronological order”.

Layout Plugins

The bulk of my effort in the last two weeks has been in the layout. This is what carves up the contents of the file into something that can be selected via the CSS system. The default is only to have the file element which contains everything.

layout:
    - plugin: SplitLines
	  tag: line # implied

The above splits everything into <line> elements. We need that for reporting line number errrors. Below, we have a plugin that splits things into paragraphs based on Markdown rules (a blank line separates a paragraph). Using the above examples, this is what lets us use the para .word:first-child for selectors.

layout:
    - plugin: SplitParagraphs
	  tag: para # implied

It gets complicated when we start adding tokens. A token is a word or puncutation. I'm using OpenNLP to break apart the words at the moment. This splits up the contents of the paragraph (because of the select: para) into tokens.

layout:
    - plugin: OpenNlpSplitTokens
      select: para

Once I have tokens, I can add the word class to the actual words.

layout:
    - plugin: WordTokenClassifier
	  class: word # implied

Eventually, there will be a OpenNlpSplitSentences plugin where you would split the paragraph into sentences and then split the sentences into tokens. That isn't done but we don't need it for the base. I will also eventually create a ParagraphSentenceWord plugin that does most of this and makes it easier to add it as a single line.

The end result is that we have an abstract tree that represents the file. Eventually there may be a lot more such as dialog identification, blockquotes and epigraphs, and whatever else makes sense.

The main reason for the layout plugins is to make the selectors work for the analysis. It is also a complex bit of code that has to run in order unlike analysis which can be multi-threaded for a given file.

Analysis Plugins

Finally, the last bit. I'm not done with anything else, but the analysis plugin is the bit that finds errors and warnings. It is the parts that “does” something. In other words, the entire reason I'm writing this.

analysis:
    - plugin: Echoes
      select: file.chapter token.word
      scope: token.word:within(10)
      threshold: { error: 5, warning: 2 }

I'm working on this next, but I hope to have aicli check produce an error like gcc or a complier that gives a specific file and line number (via the SplitLines plugin) that lists the errors. It will be the same thing Atom uses to highlight problems.

Since analysis plugins only add/remove errors, they don't change the structure. This means they can use all the CPUs using worker processes to make it as efficient as possible. I'm also going to eventually have some optimizations put in (most elements are hashed so they can be cached).

Status and Plans

This last week, I've gotten the folllowing:

  • CSS selectors are working.
  • CSS pseudo-classes working for scope to get ranges of words.
  • CSS for doing transformations to make text lowercase before comparison.
  • The basic layout plugins needed to break a file into lines, paragraphs, tokens, and words.
  • A search path for plugins can find assemblies in a different directory.
  • The starting code for using NuGet for packages is there.
  • Echoes can find the words, just not report them.

The goal is to get a simple echo words analysis done, basically looking for duplicates within ten words.

  • Get TextQuery to compare text between the select and the scope.
  • Add warnings and errors (notices) to elements.
  • Report the notices to the console for the aicli check command.

I really want to get this to the point it can be used. I think it will be benefical even with the basics for finding problem spots with my personal projects and can benefit others. I also think that once I get the end-to-end, additional functionality will be easily added for other needs.

Development

Development is currently on the drem-0.0.0 branch on Gitlab. I'm adding issue of secondary items while tagging them with expected complexity. While I don't expect anyone to contribute, if someone does get inspired the simple and trivial items are good starting points.

I plan on adding notes for contributing “soon”.

So far, everything is MIT licensed but there will be other licenses already involved. OpenNLP is Apache licensed so I'll need to figure out how that interacts with MIT. Once I split the packages into individual files, it will probably be easier but for the time being, life is faster if I keep it as a single repository.

Metadata

Categories:

Tags: