Public Profile

web

You’re offline. This is a read only version of the page.

Joined: 7/1/2025

extract markdown from binary doc format

Tue, 01 Jul 2025 08:56:30 GMT – Dynamics 365 Customer Insights - Data – New

Today, there is no satisfactory solution for a datascientist to extract text from old word files ( binary doc and not docx ).
This is a big problem. We have millions of documents in this format, and it's very difficult to extract text from them.

- the word API only works under windows
- Openoffice doesn't handle parallelism very well.
- Antiword supports very few formats.
- Tikka is a java implementation that works well but remains limited.

However, the Doc binary format specification has been published:
https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

The funny thing is that even tools like Microsoft/MarkitDown don't support the doc format. You have to use openoffice.

What we want is a command-line program, written in a compiled language (C++ / Rust), to extract a text file from a doc format.

extract-doc file.doc > file.txt

The format specification has been published in open access:
https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

If you can't do it, we can do it ourselves in a collaboration with Microsoft. But It has to be done. This is an essential need in the age of LLMs.