Friday, February 19, 2016

ExecuteScript - Extract text & metadata from PDF

This post is about using Apache NiFi, its ExecuteScript processor, and Apache PDFBox to extract text and metadata from PDF files. It is similar to a previous post of mine, using Module Path to include JARs. But this is a good use case as well, so I thought I'd write a bit about it. This one will be short and sweet, but the aforementioned post has more details :)

In my example, I'm using the GetFile processor to find all PDFs in a directory. They are sent to an ExecuteScript processor, which uses PDFBox and PDFTextStripper (and other classes) to extract the text into the flowfile content, and adds metadata as attributes. The resulting script is here:

import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*

def flowFile = session.get()
if(!flowFile) return

def s = new PDFTextStripper()
def doc, info

flowFile = session.write(flowFile, {inputStream, outputStream ->
 doc = PDDocument.load(inputStream)
 info = doc.getDocumentInformation()
        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)
flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}")
flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" )
flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" );
flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" );
flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" );
flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" );
flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}");
flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" );   
session.transfer(flowFile, REL_SUCCESS)


I then put the file's text contents out (using PutFile) and also logged the metadata attributes. The flow looks like this:

The template is available as a Gist (here). Please let me know if you find it useful, and/or if you have comments, questions, or suggestions. Cheers!

3 comments:

  1. Hi Matt,

    I got this error:


    18:37:20 CEST

    ERROR

    d78bb216-0fc9-47ea-bf81-39225d626ec1

    ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] failed to process session due to java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;: java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;

    Which PDFBOX jars do I Need?

    Hans

    ReplyDelete
  2. I used version 1.8.11, for that I needed pdfbox, jempbox, and fontbox JARs (of the same version).

    ReplyDelete
  3. how can i extract text from a normal text file and use its content as attributes , lets say i have a configuration file with key : value pairs , need to extract the content of this file and use the key as attribute name and the value as the attribute value ....

    ReplyDelete