Author Intrusion Tokens

Given that my ideas for Author Intrusion are so large, there is always a danger that I'll get discouraged or “lost in the forest” from my efforts. I'm thinking about doing the occasional post of what I'm trying and why.

This entire thing started with an idle thought a month ago. Even though I haven't touched AI in months, I still think about it as I work through the problems that caused me the stop or lose track of my efforts. Yeah, I was working on a novel, but AI's development still stalled.

I was trying to figure out how to load my 600k word weekly serial into the editor. This is one thing that AI is suppose to handle: I've not had a lot of luck with single documents over ten thousand words much less something that large; IDEs in general can't handle a single document of that size either.

I'm not planning on loading it all into memory either, but I'm going to have processes that will look across the entire piece which means I'm going to load and unload in the background as I go.

Background Processing

One of the biggest features of AI is going to be the background processing. For programs like Microsoft Word, you might occasionally notice that spell-checking happens a few seconds after you type the word. In Visual Studio, it can take a while for ReSharper or NCrunch to finish what it was doing. I'm planning on doing the same thing, but it will be more complicated than spell-checking and probably on par with code analysis that ReSharper does.

The problem that I had was updating the UI. The previous C# implementation of AI tried to use locking to handle the updating, but it didn't work well. Actually, it crumbled under the weight of even lightweight locking. With this version, I'm going to use the idle threads to update the UI that was calculated in a Task.

That will handle the UI updating issue, but introduces a different problem. Every time the writer changes the line, all the processes are invalidated. Even things like highlighting character names or spell-checking. So, to avoid redoing thing, I jumped through a lot of hoops to update the string in place, which… ended up being more difficult that I imagined.

Tokens

My idea came down to breaking the lines into tokens, individual words or components. So, the line “You are a cheese head.” would become ten tokens: “You”, " ", “are”, " ", “a”, " ", “cheese”, " ", “head”, ".".

With tokens, I can have the background process say “update line 23, token 4” and it would be easier to update from a background process using an idle thread.

Memory

The problem of breaking apart tokens is memory pressure.

Assuming a 32-bit platform, a pointer to a string is 4 bytes. According to this page, the size of the string is 20 plus double the length.

“You are a cheese head.” requires 68 bytes (4 for the pointer, 20 for the overhead, and 44 for the length).

If I broke apart the line, the token version requires: 284 bytes (30, 26, 30, 26, 26, 26, 36, 26, 32, and 26 bytes for a single pointer to the individual tokens respectively).

Fortunately, the text can be shared (interned) so I don't have to allocate four copies of " ", which lowers the cost of the tokenized version to 218 bytes (since I don't need three of the 22 byte allocations).

While 218 bytes is much larger than 68, I think it will still work. If I use my weekly serial as an example, in 600k words, the most common tokens are:

  • and: 18786
  • she: 19143
  • her: 25504
  • ,: 29788
  • the: 31865
  • .: 50501

This means that allocating 50k “.” would take 1,282 kB but interning that data would be 197 kB. And that's for a single character. Obviously, I'm not planning on loading the entire thing in memory, but I'm going to have to load and unload chapters while processing and this would reduce the memory pressure.

Editing

The other complexity of tokens is editing. It gets more difficult to insert and delete text in the middle of the token, but it is possible. Most of the early effort of the UI will be handling that, but I think I can work through that using unit tests and a proper MVC approach.

Complexity

This may not work. Writing an IDE or a text editor is not a simple problem and I can find very little discussions on how to write one. Also, I haven't found discussions on working with very large documents since most the editors that provide code and discussion load the entire thing into memory. I could be wrong, but I'm fumbling through this as I go, but it seems to be working out pretty well at this point.

Metadata

Categories:

Tags: