Friday, February 19, 2016

ExecuteScript - Extract text & metadata from PDF

This post is about using Apache NiFi, its ExecuteScript processor, and Apache PDFBox to extract text and metadata from PDF files. It is similar to a previous post of mine, using Module Path to include JARs. But this is a good use case as well, so I thought I'd write a bit about it. This one will be short and sweet, but the aforementioned post has more details :)

In my example, I'm using the GetFile processor to find all PDFs in a directory. They are sent to an ExecuteScript processor, which uses PDFBox and PDFTextStripper (and other classes) to extract the text into the flowfile content, and adds metadata as attributes. The resulting script is here:

import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*

def flowFile = session.get()
if(!flowFile) return

def s = new PDFTextStripper()
def doc, info

flowFile = session.write(flowFile, {inputStream, outputStream ->
 doc = PDDocument.load(inputStream)
 info = doc.getDocumentInformation()
        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)
flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}")
flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" )
flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" );
flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" );
flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" );
flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" );
flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}");
flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" );   
session.transfer(flowFile, REL_SUCCESS)


I then put the file's text contents out (using PutFile) and also logged the metadata attributes. The flow looks like this:

The template is available as a Gist (here). Please let me know if you find it useful, and/or if you have comments, questions, or suggestions. Cheers!

11 comments:

  1. Hi Matt,

    I got this error:


    18:37:20 CEST

    ERROR

    d78bb216-0fc9-47ea-bf81-39225d626ec1

    ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] failed to process session due to java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;: java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;

    Which PDFBOX jars do I Need?

    Hans

    ReplyDelete
  2. I used version 1.8.11, for that I needed pdfbox, jempbox, and fontbox JARs (of the same version).

    ReplyDelete
  3. how can i extract text from a normal text file and use its content as attributes , lets say i have a configuration file with key : value pairs , need to extract the content of this file and use the key as attribute name and the value as the attribute value ....

    ReplyDelete
  4. why this is not working for a simple pdf, which consist of this?

    Name : Saurabh Bidwai

    Age : 22

    Education : Post Graduate

    Address : Pune

    ReplyDelete
    Replies
    1. What's not working? Is it not outputting the text, or is it giving an error, or both? If it seems to be "working" but not generating text, then it might have something to do with the content. If PDFBox can't find it with PDFTextStripper, then you may need a different class or a different library.

      Delete
    2. yhaaa!! not outputting the text,

      I'm not familiar with groovy, can u help me with it??

      Thanks in advance.

      Delete
  5. Hi Matt,
    Can you please help with how to process the metadata(txtfile) thru NIFI

    Thanks
    Swetha

    ReplyDelete
  6. Recently, we are involved in an application development. It is required to recognize and extract text from PDF document. What we concern now is to using a PDF text extractor & converter or an OCR SDK.

    ReplyDelete
  7. Replies
    1. Hi AdeleB,

      Can you help with the groovy script you used and does it also extract impages and its content as well from pdf?

      Thanks
      Deepak

      Delete
  8. Hi Matt,

    Do You have an example of extracting impage and texts in images as well from an PDF.

    Also I had faced a similar error as mentioned above with the code, any PDF which doesn't have a Metadata , the text for the same is not extracted. For example a word converted document into pdf returns error and the text never gets extracted.

    Thanks
    Deepak

    ReplyDelete