Messing with Markdown

I have a lot of projects in my head, which is one reason why I try to avoid starting new things while working on others. It got too easy to start a project and then never finish it when I got bored or just moved on.

At the same time, many of my ideas don't “stick.” They seem like they will, but then it either doesn't work out, the ideas don't really gel, or simply I lose the passion. Some of them, like my old Exalted webcomic, haunt me for years with the nagging voice of “finish me” but I keep not doing it. Others end up being bigger projects, but suffer through constant revisions as I try to figure it out.

I do have some major projects that I haven continued to update and maintain quite a few years after I start them. One of the biggest is MfGames Writing Python which I started in 2010 when my father said I'd love Python. Even though I've long-since decided that I don't like Python, I've still been writing updates to the tools in support of my own writing efforts.

Like Glorious Saber, I've been thinking about converting my Python writing tools over to C#, but never really got to it. I had a couple stabs at it but nothing really “stuck.”

The Side Project

After ICON, something Jim C. Hines and Scott Lynch said during their “Beyond SF 101” panel was still echoing in my head.

Write more words and don't be a dick.

Now, Sand and Ash is still stuck in limbo, so I couldn't really do anything with that. So I had this idea for a writing project that would be complimentary to my Fedran world and let me get different ideas out. It involved smaller pieces, short stories and essays and lessons, so I was trying to figure out how to pull it all together.

For some reason, my Python tools just choked. They are optimized toward writing novels and single-file DocBook 5 files, but not a multitude of smaller DocBook XML files which would these individual pieces.

I started to look into my C# version, which had a different “gather” utility and, once again, realized that I probably wasn't that far off from getting the C# version working with its flaws and maybe move away from the Python.

One of my major difficulties with Python is that it doesn't handle UTF-8 characters natively. I seem to have a lot of non-ANSI characters in my desert world (macros are a killer) and it kept choking on them.

MfGames Writing CIL

So I shifted from my side project to MfGames Writing CIL which is the C# version of the Python tools “plus” additional functionality. The biggest is that I wrote the Python around Creole instead of Markdown, which is the markup language I've migrated my writing to.

While I was working with writing up a Markup conversion utility for the tools, I came upon CommonMark which is an attempt at a well-documented specification. I figured I could use that to help guide my effort on the conversion utility.

It didn't take long before I realized that I was duplicating my work with Author Intrusion for handling Markdown. Usually when that happens, I figure I should start up a new project to handle the common logic and write it once.

MfGames Text Markup CIL

And then I moved from MfGames Writing CIL to MfGames Text Markup CIL. This is an attempt to create a single, centralized reader (and eventually writer) of markup languages in general and Markdown in specific.

There were a couple reasons I went this project instead of another library:

  • Most Markdown libraries only convert to HTML, which meant I was parsing HTML to get into Author Intrusion or DocBook format. I wanted something that had an intermediate output that was ideal for converting to other formats.
  • Again, most libraries seem to load the entire Markdown file into memory at once (mainly because of the deferred links) and then writing it out. I haven't tested this completely, but I already know that I have 640k word series that have to be parsed; I do not want to have this loaded into memory. This implies a callback interface (SAX verses DOM).

I decided to write this in a similar style to C#'s XmlReader. Instead of loading everything into memory and then writing out the results, it just translates the Markdown file into element types.

// Loop through the Markdown and process each one.
while (markdown.Read())
{
    switch (markdown.ElementType)
    {
        case MarkupElementType.BeginDocument:
            this.WriteBeginDocument(xml);
            break;

        case MarkupElementType.EndDocument:
            xml.WriteEndElement();
            xml.WriteEndDocument();
            break;

        case MarkupElementType.BeginMetadata:
        case MarkupElementType.EndMetadata:
        case MarkupElementType.BeginContent:
        case MarkupElementType.EndContent:
            break;

        case MarkupElementType.BeginCodeSpan:
            this.WriteForeignPhrase(markdown, xml);
            break;

The system seemed to work out pretty well for my test cases, but when I started to throw “real” chapters at it, it started to crumble. Not from the foundation, but simply because I didn't write a good enough Markdown parser to translate them.

CommonMark

And here is where I started down the rabbit hole. The callback system worked great, but I needed to get my parser to be competent enough to handle what I wrote. Once I get that, converting to DocBook is trivial (as the above example probably shows).

A few days ago, I noticed that the CommonMark spec was a Markdown file with some magic for handling the input/output examples. Well, I could write a bunch of unit tests or… I could write a program that converted the 500+ examples into unit tests for me.

Why write things out by hand when I can write a program to do it for me?

Last night, I finished converting most of the unit tests over. Now, I just have to solve 508 unit tests, or at least 300 of them.

Rabbit Holes

I'm pretty sure this is going to be overwhelming but I think it will still further my goal of finishing up my writing tools and get back to my side project. We'll see how it ends up, but I still have a mountain to climb.

This is also the reason I haven't really posted for a few weeks to. I was lost in a rabbit hole.

Metadata

Categories:

Tags: