1

Suggested by Joseph-Sacha SCHUTZ New 

Today, there is no satisfactory solution for a datascientist to extract text from old word files ( binary doc and not docx ).

This is a big problem. We have millions of documents in this format, and it's very difficult to extract text from them.


- the word API only works under windows

- Openoffice doesn't handle parallelism very well.

- Antiword supports very few formats.

- Tikka is a java implementation that works well but remains limited. 


However, the Doc binary format specification has been published:

https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22


The funny thing is that even tools like Microsoft/MarkitDown don't support the doc format. You have to use openoffice. 


What we want is a command-line program, written in a compiled language (C++ / Rust), to extract a text file from a doc format. 


extract-doc file.doc > file.txt 


The format specification has been published in open access: 

https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22


If you can't do it, we can do it ourselves in a collaboration with Microsoft. But It has to be done. This is an essential need in the age of LLMs.