-
extract markdown from binary doc format
Today, there is no satisfactory solution for a datascientist to extract text from old word files ( binary doc and not docx ).
This is a big problem. We have millions of documents in this format, and it's very difficult to extract text from them.
- the word API only works under windows
- Openoffice doesn't handle parallelism very well.
- Antiword supports very few formats.
- Tikka is a java implementation that works well but remains limited.
However, the Doc binary format specification has been published:
https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
The funny thing is that even tools like Microsoft/MarkitDown don't support the doc format. You have to use openoffice.
What we want is a command-line program, written in a compiled language (C++ / Rust), to extract a text file from a doc format.
extract-doc file.doc > file.txt
The format specification has been published in open access:
https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
If you can't do it, we can do it ourselves in a collaboration with Microsoft. But It has to be done. This is an essential need in the age of LLMs.