Tritium lives in the Word spec because to deliver great legal tech, we think we need to own the word processor.
The Word spec is giant.
It provides that a valid docx file may contain something like the below XML:
<body>
<tbl>
...
<p>
...
<tbl>...</tbl>
</p>
</tbl>
</body>
It thus supports essentially infinite nesting of paragraphs and tables in other words.
And since Word was written in C/C++ and happy to work with multiple mutable ownership, it's no problem to have these deeply nested structures.
But they're hard to do right in Rust.
So, where to start?
An excellent first place was the docx_rs
crate maintained by bokuweb.
bokoweb's work seems to follow along the lines of python-docx
in creating an excellent API for
generating
Word documents.
From the repo:
use docx_rs::*;
pub fn hello() -> Result<(), DocxError> {
let path = std::path::Path::new("./hello.docx");
let file = std::fs::File::create(path).unwrap();
Docx::new()
.add_paragraph(Paragraph::new().add_run(Run::new().add_text("Hello")))
.build()
.pack(file)?;
Ok(())
}
It also supports reading. To ingest a Word file with libtritium
would look something like the below.
pub fn main() {
let bytes = libtritium::fs::slurp_path("./hello_world.docx").unwrap();
let docx = docx_rs::read_docx(&bytes).unwrap();
let Some(docx_rs::documents::DocumentChild(paragraph)) =
docx.children.first() else {
panic!("Expected a paragraph.");
};
println!("{}", paragraph.raw_text());
}
// Hello, World!
As a great Rust crate, it compiles to WASM and can be run on Web front ends. Amazing.
It was instrumental in getting Tritium's first alpha versions of the ground.
But today, Tritium runs a custom docx
module, written from scratch.
Why?
As with many other endeavours, if it's your core product, you need to own the stack or at least have control over its destiny.
Tritium's core offering is making surgical edits to legacy legal documents.
While it doesn't have to implement the entire Word spec to be useful, Tritium needs to survive the below round-trip test in all cases to even be useable.
#[test]
fn deserialize_serialize_round_trip() {
let src = libtritium::fs::slurp_path("/src.docx").unwrap();
let docx = libtritium::docx::Docx::from_bytes(&src).unwrap();
docx.save_as("/dst.docx").unwrap();
let dst = libtritium::fs::slurp_path("/dst.docx").unwrap();
assert_eq!(*src, *dst);
}
Surprising the user by dropping data on save would be fatal.
Tritium outgrew docx_rs
because it's designed for construction, not consumption, of
Word files.
Let's go back to our pseudo-docx block.
<body>
<tbl>
...
<p>
...
<tbl>...</tbl>
</p>
</tbl>
</body>
One smart thing that docx_rs
does is represent this structure as a nested AST of enum
variants.
At the time of writing, for example, here's the ParagraphChild
definition.
This describes the types of children that a <p>
element can have.
#[derive(Debug, Clone, PartialEq)]
pub enum ParagraphChild {
Run(Box<Run>),
Insert(Insert),
Delete(Delete),
BookmarkStart(BookmarkStart),
Hyperlink(Hyperlink),
BookmarkEnd(BookmarkEnd),
CommentStart(Box<CommentRangeStart>),
CommentEnd(CommentRangeEnd),
StructuredDataTag(Box<StructuredDataTag>),
PageNum(Box<PageNum>),
NumPages(Box<NumPages>),
}
Why an enum
?
Well, for starters, a single Word document may have many thousands, tens of thousands or even hundreds of thousands of these XML entities.
Retaining each XML tag as a String
or raw bytes would not only hog a tremendous amount of memory
but, on certain platforms, result in memory
fragmentation in a manner that would be indistinguishable from a giant memory leak.
So docx_rs
uses quick_xml
to read each XML entity into an enum
using an
evented parser.
That process converts the already-allocated set of XML tag bytes into a tiny discriminent value of
the enum
.
But what if that's not all of the possible children of the p
tag?
And what if all of the possible children of the body
tag aren't supported for reading yet
either?
match e {
XMLElement::Paragraph => {
let p = Paragraph::read(&mut parser, &attributes)?;
doc = doc.add_paragraph(p);
continue;
}
..
_ => {} // woops
}
Such a novel tag is just ignored, and we fail our round-trip test.
So while most open source implementations of the Word spec are incomplete, and docx_rs
is
a great library for building Word files in Rust native or in the browser, it's directionally
incorrect for Tritium.
If it’s a core business function — do it yourself, no matter what.- Joel Spolsky, In Defense of Not Invented Here Syndrome
So we build.