

Gs -SDEVICE=tiffg4 -r600圆00 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH - filename Gs: The below command should convert multipage pdf to individual tiff files. (i.e I couldn't find a linux pdf2text converter that does OCR).
Apache pdf extract text full#
You might also find the pdf toolkit of use.Ī full list of pdf software here on wikipedia.Įdit: Since you do need OCR capabilities, I think you'll have to try a different tack.
Apache pdf extract text install#
If it's not on your machine, you'll have to install the poppler-utils package sudo apt-get install poppler-utils For example, it does not retain any PDF metadata. Please note that the above script is very rudimentary. Gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf Hocr2pdf -i "$page" -o "$base.pdf" < "$base.html" # OCR each page individually and convert into PDFĬuneiform -f hocr -o "$base.html" "$page" Gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH - "$input" # extract images of the pages (note: resolution hard-coded) # Run OCR on a multi-page PDF file and create a new pdf with the Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them: #!/bin/bash I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. This way you can create "searchable" PDFs from which you can copy text. Youll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others).

Apache POI works with Microsoft Office file formats, which PDF isnt. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. Yes, you are wrong in believing that POI will do that. While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP). No binary packages seem to be available, so you need to build it from source. I have had success with the BSD-licensed Linux port of Cuneiform OCR system.
