Scan and automatically OCR receipts, bills, letters, etc. without turning on your computer

I was looking for a solution to digitally archive incoming (paper) bills, receipts, letters, etc. to make them easily searchable (OCR). Ideally, this solutions would be as simply as putting a document on your scanner and pressing a button—done: the digitized document would be stored on a network drive. I really don’t like having to turn on your computer, opening up the scanner software, starting the OCR, saving the file, etc.

Existing Solutions

There are a few interesting scanning solutions out there, but I found them either too expensive or missing some functionalities. Here are some examples, they might work for you:

  • Doxie Go: looks promising (“scan anywhere”), but I couldn’t figure out if you’d be able to scan directly to a network drive; also, it is sort of pricey (> 200 EUR for the Wifi version)
  • Fujitsu ScanSnap iX500: even more expensive (close to 400 EUR)

Solution Based on NAS + All-In-One Scanner-Printer

In the end, I decided to use my existing NAS (a Synology DS213) to both OCR and store documents. I bought a HP Officejet Pro 8160 as scanner because it can directly scan to network drives—and priced at about 130 EUR it is cheaper than the other scanners (while being a printer and fax at the same time).

Usage

The work flow goes like this:

  • Put a document on the scanner glass (or use the scanner’s document feeder)
  • Using the scanner’s touch screen, select “scan to network” and the destination drive (pre-configured via the web interface)

What happens in the background:

  • The scanner saves the document as JPG on a shared network folder on the NAS
  • A script on the NAS is watching this shared folder:
    • Once a new scan appears, it runs OCR and saves the scan as searchable PDF
    • It also tries to extract a date (e.g., from an invoice) to append to the file name

Installation of OCR Package on the NAS

Setting up OCR was not as easy as I had hoped for, but worked out in the end. Here is a how-to:

  • login via SSH on the Diskstation
  • install the IPKG package manager (instructions)
  • install tesseract-OCR from source (to get the latest version)
    • install GCC via package manager
      ipkg install gcc
    • install leptonica, autoconf, automake, libtool (use 2.4.5, 2.4.6 did not work) from source
    • install tesseract from source
      • run
        ./autogen.sh
      • dirty workaround to make it compile: in ccutils/helpers.h: put smaller numbers in lines 64 and 65 (e.g., 63641362 and 14426950)—I did not investigate, but using a different seed for the random generator does not seem to be critical for an OCR application…
      • fix pthread issue (original post)
        • backup the pthread libraries found in /opt/arm-none-linux-gnueabi/lib/
          mkdir /opt/arm-none-linux-gnueabi/lib_disabled
          mv /opt/arm-none-linux-gnueabi/lib/libpthread* \
             /opt/arm-none-linux-gnueabi/lib_disabled
        •  copy the pthread libraries found in /opt/lib
          cp /lib/libpthread.so.0 /opt/arm-none-linux-gnueabi/lib/
          cd /opt/arm-none-linux-gnueabi/lib/
          ln -s libpthread.so.0 libpthread.so
          ln -s libpthread.so.0 libpthread-2.5.so
      • run
        ./make
        ./make install
      • download tessdata to /usr/local/share/tessdata
      • edit /etc/profil and add
        export TESSDATA_PREFIX=/usr/local/share/tessdata/
  • configure tesseract to also create text files (used for date extraction below) by editing /usr/local/share/tessdata/configs/pdf as follows
    tessedit_create_txt 1
    tessedit_create_pdf 1
    tessedit_pageseg_mode 1
  • test with sample JPG file (this example uses a German dictionary)
    tesseract scan.jpg outfile -l deu /usr/local/share/tessdata/configs/pdf

Setting up Folder Watch

I’m using inotifywait to detect when new files are added to the watched folder. I created a shell script to start and stop watching. Note that everything is configured for German documents and German date format.

Additional Configuration Steps

  • Configuring the scanner to scan to a specific shared network folder is straightforward using the scanner’s web interface.
  • Likewise, sharing a folder from the NAS is straightforward using its excellent web interface.

2 thoughts on “Scan and automatically OCR receipts, bills, letters, etc. without turning on your computer

  1. I get an error trying to ./configure Tesseract, saying

    checking for g++... g++
    checking whether the C++ compiler works... no
    configure: error: in `/volume1/NAS/Download/Compleet/tesseract-3.04.01':
    configure: error: C++ compiler cannot create executables

    It refers to the config.log, if you would be able to take a look at this I’d be happy to send it

Comments are closed.