web
You’re offline. This is a read only version of the page.
close
  • extract markdown from binary doc format

    Today, there is no satisfactory solution for a datascientist to extract text from old word files ( binary doc and not docx ).

    This is a big problem. We have millions of documents in this format, and it's very difficult to extract text from them.


    - the word API only works under windows

    - Openoffice doesn't handle parallelism very well.

    - Antiword supports very few formats.

    - Tikka is a java implementation that works well but remains limited. 


    However, the Doc binary format specification has been published:

    https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22


    The funny thing is that even tools like Microsoft/MarkitDown don't support the doc format. You have to use openoffice. 


    What we want is a command-line program, written in a compiled language (C++ / Rust), to extract a text file from a doc format. 


    extract-doc file.doc > file.txt 


    The format specification has been published in open access: 

    https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22


    If you can't do it, we can do it ourselves in a collaboration with Microsoft. But It has to be done. This is an essential need in the age of LLMs.