Fun with Apache NiFi: ExecuteScript - Extract text & metadata from PDF

Friday, February 19, 2016

ExecuteScript - Extract text & metadata from PDF

This post is about using Apache NiFi, its ExecuteScript processor, and Apache PDFBox to extract text and metadata from PDF files. It is similar to a previous post of mine, using Module Path to include JARs. But this is a good use case as well, so I thought I'd write a bit about it. This one will be short and sweet, but the aforementioned post has more details :)

In my example, I'm using the GetFile processor to find all PDFs in a directory. They are sent to an ExecuteScript processor, which uses PDFBox and PDFTextStripper (and other classes) to extract the text into the flowfile content, and adds metadata as attributes. The resulting script is here:

import org.apache.pdfbox.pdmodel.*
import org.apache.pdfbox.util.*

def flowFile = session.get()
if(!flowFile) return

def s = new PDFTextStripper()
def doc, info

flowFile = session.write(flowFile, {inputStream, outputStream ->
 doc = PDDocument.load(inputStream)
 info = doc.getDocumentInformation()
        s.writeText(doc, new OutputStreamWriter(outputStream))
    } as StreamCallback
)
flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}")
flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" )
flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" );
flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" );
flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" );
flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" );
flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" );
flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}");
flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" );   
session.transfer(flowFile, REL_SUCCESS)

I then put the file's text contents out (using PutFile) and also logged the metadata attributes. The flow looks like this:

The template is available as a Gist (here). Please let me know if you find it useful, and/or if you have comments, questions, or suggestions. Cheers!

14 comments:

UnknownAugust 24, 2016 at 9:49 AM
Hi Matt,

I got this error:

18:37:20 CEST

ERROR

d78bb216-0fc9-47ea-bf81-39225d626ec1

ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] failed to process session due to java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;: java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;

Which PDFBOX jars do I Need?

Hans
ReplyDelete
Replies
MattyBAugust 24, 2016 at 1:39 PM
I used version 1.8.11, for that I needed pdfbox, jempbox, and fontbox JARs (of the same version).
ReplyDelete
Replies
UnknownSeptember 22, 2016 at 6:00 AM
how can i extract text from a normal text file and use its content as attributes , lets say i have a configuration file with key : value pairs , need to extract the content of this file and use the key as attribute name and the value as the attribute value ....
ReplyDelete
Replies
UnknownDecember 1, 2017 at 3:58 AM
why this is not working for a simple pdf, which consist of this?

Name : Saurabh Bidwai

Age : 22

Education : Post Graduate

Address : Pune
ReplyDelete
Replies
UnknownMarch 8, 2018 at 2:25 AM
Hi Matt,
Can you please help with how to process the metadata(txtfile) thru NIFI

Thanks
Swetha

ReplyDelete
Replies
AnonymousMay 6, 2018 at 9:05 PM
Recently, we are involved in an application development. It is required to recognize and extract text from PDF document. What we concern now is to using a PDF text extractor & converter or an OCR SDK.
ReplyDelete
Replies
AdeleBSeptember 11, 2018 at 1:18 AM
I tried XsPDF SDK. It's easy to extract text & metadata from PDF.
ReplyDelete
Replies
Deepak MishraSeptember 23, 2018 at 5:50 AM
Hi Matt,

Do You have an example of extracting impage and texts in images as well from an PDF.

Also I had faced a similar error as mentioned above with the code, any PDF which doesn't have a Metadata , the text for the same is not extracted. For example a word converted document into pdf returns error and the text never gets extracted.

Thanks
Deepak
ReplyDelete
Replies
AnonymousJanuary 28, 2019 at 10:47 PM
Very Nice Article it's very Informative i have learn lot .Thanks keep sharing such informative article metadata extractor
ReplyDelete
Replies
UnknownOctober 11, 2019 at 12:42 AM
i make to follow your code , but it error: unable to resolve class: PDFTextStripper at line: def s = new PDFTextStripper(). I put jar file : jempbox-1.8.12.jar, fontbox-1.8.12.jar, commons-logging-1.1.1.jar in a folder and to call its though path at module directory of ExecuteScrip process.
ReplyDelete
Replies

Add comment