Thoughts on the Word Spec in Rust

Tritium lives in the Word spec because to deliver great legal tech, we think we need to own the word processor.

The Word spec is giant.

It provides that a valid docx file may contain something like the below XML:

<body>
	<tbl>
     	...
	      	<p>
      		...
 	       		<tbl>...</tbl>
		</p>
	</tbl>
</body>

It thus supports essentially infinite nesting of paragraphs and tables in other words.

And since Word was written in C/C++ and happy to work with multiple mutable ownership, it's no problem to have these deeply nested structures.

But they're hard to do right in Rust.

So, where to start?

An excellent first place was the docx_rs crate maintained by bokuweb.

bokoweb's work seems to follow along the lines of python-docx in creating an excellent API for generating Word documents.

From the repo:

use docx_rs::*;

pub fn hello() -> Result<(), DocxError> {
    let path = std::path::Path::new("./hello.docx");
    let file = std::fs::File::create(path).unwrap();
    Docx::new()
        .add_paragraph(Paragraph::new().add_run(Run::new().add_text("Hello")))
        .build()
        .pack(file)?;
    Ok(())
}

It also supports reading. To ingest a Word file with libtritium would look something like the below.

pub fn main() {
    let bytes = libtritium::fs::slurp_path("./hello_world.docx").unwrap();
    let docx = docx_rs::read_docx(&bytes).unwrap();
    let Some(docx_rs::documents::DocumentChild(paragraph)) = 
docx.children.first() else {
         panic!("Expected a paragraph.");
    };
    println!("{}", paragraph.raw_text());
}

// Hello, World!

As a great Rust crate, it compiles to WASM and can be run on Web front ends. Amazing.

It was instrumental in getting Tritium's first alpha versions of the ground.

But today, Tritium runs a custom docx module, written from scratch.

Why?

As with many other endeavours, if it's your core product, you need to own the stack or at least have control over its destiny.

Tritium's core offering is making surgical edits to legacy legal documents.

While it doesn't have to implement the entire Word spec to be useful, Tritium needs to survive the below round-trip test in all cases to even be useable.

#[test]
fn deserialize_serialize_round_trip() {
    let src = libtritium::fs::slurp_path("/src.docx").unwrap();
    let docx = libtritium::docx::Docx::from_bytes(&src).unwrap();
    docx.save_as("/dst.docx").unwrap();
    let dst = libtritium::fs::slurp_path("/dst.docx").unwrap();
    assert_eq!(*src, *dst);
}

Surprising the user by dropping data on save would be fatal.

Tritium outgrew docx_rs because it's designed for construction, not consumption, of Word files.

Let's go back to our pseudo-docx block.

<body>
	<tbl>
     	...
	      	<p>
      		...
 	       		<tbl>...</tbl>
		</p>
	</tbl>
</body>

One smart thing that docx_rs does is represent this structure as a nested AST of enum variants.

At the time of writing, for example, here's the ParagraphChild definition.

This describes the types of children that a <p> element can have.

#[derive(Debug, Clone, PartialEq)]
pub enum ParagraphChild {
    Run(Box<Run>),
    Insert(Insert),
    Delete(Delete),
    BookmarkStart(BookmarkStart),
    Hyperlink(Hyperlink),
    BookmarkEnd(BookmarkEnd),
    CommentStart(Box<CommentRangeStart>),
    CommentEnd(CommentRangeEnd),
    StructuredDataTag(Box<StructuredDataTag>),
    PageNum(Box<PageNum>),
    NumPages(Box<NumPages>),
}

Why an enum?

Well, for starters, a single Word document may have many thousands, tens of thousands or even hundreds of thousands of these XML entities.

Retaining each XML tag as a String or raw bytes would not only hog a tremendous amount of memory but, on certain platforms, result in memory fragmentation in a manner that would be indistinguishable from a giant memory leak.

So docx_rs uses quick_xml to read each XML entity into an enum using an evented parser.

That process converts the already-allocated set of XML tag bytes into a tiny discriminent value of the enum.

But what if that's not all of the possible children of the p tag?

And what if all of the possible children of the body tag aren't supported for reading yet either?

match e {
    XMLElement::Paragraph => {
        let p = Paragraph::read(&mut parser, &attributes)?;
        doc = doc.add_paragraph(p);
        continue;
    }
    ..
    _ => {} // woops
}

Such a novel tag is just ignored, and we fail our round-trip test.

So while most open source implementations of the Word spec are incomplete, and docx_rs is a great library for building Word files in Rust native or in the browser, it's directionally incorrect for Tritium.

If it’s a core business function — do it yourself, no matter what.
- Joel Spolsky, In Defense of Not Invented Here Syndrome

So we build.