extract markdown from binary doc format

Today, there is no satisfactory solution for a datascientist to extract text from old word files ( binary doc and not docx ).

This is a big problem. We have millions of documents in this format, and it's very difficult to extract text from them.

- the word API only works under windows

- Openoffice doesn't handle parallelism very well.

- Antiword supports very few formats.

- Tikka is a java implementation that works well but remains limited.

However, the Doc binary format specification has been published:

https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

The funny thing is that even tools like Microsoft/MarkitDown don't support the doc format. You have to use openoffice.

What we want is a command-line program, written in a compiled language (C++ / Rust), to extract a text file from a doc format.

extract-doc file.doc > file.txt

The format specification has been published in open access:

https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

If you can't do it, we can do it ourselves in a collaboration with Microsoft. But It has to be done. This is an essential need in the age of LLMs.