In my example, I'm using the GetFile processor to find all PDFs in a directory. They are sent to an ExecuteScript processor, which uses PDFBox and PDFTextStripper (and other classes) to extract the text into the flowfile content, and adds metadata as attributes. The resulting script is here:
import org.apache.pdfbox.pdmodel.* import org.apache.pdfbox.util.* def flowFile = session.get() if(!flowFile) return def s = new PDFTextStripper() def doc, info flowFile = session.write(flowFile, {inputStream, outputStream -> doc = PDDocument.load(inputStream) info = doc.getDocumentInformation() s.writeText(doc, new OutputStreamWriter(outputStream)) } as StreamCallback ) flowFile = session.putAttribute(flowFile, 'pdf.page.count', "${doc.getNumberOfPages()}") flowFile = session.putAttribute(flowFile, 'pdf.title', "${info.getTitle()}" ) flowFile = session.putAttribute(flowFile, 'pdf.author',"${info.getAuthor()}" ); flowFile = session.putAttribute(flowFile, 'pdf.subject', "${info.getSubject()}" ); flowFile = session.putAttribute(flowFile, 'pdf.keywords', "${info.getKeywords()}" ); flowFile = session.putAttribute(flowFile, 'pdf.creator', "${info.getCreator()}" ); flowFile = session.putAttribute(flowFile, 'pdf.producer', "${info.getProducer()}" ); flowFile = session.putAttribute(flowFile, 'pdf.date.creation', "${info.getCreationDate()}" ); flowFile = session.putAttribute(flowFile, 'pdf.date.modified', "${info.getModificationDate()}"); flowFile = session.putAttribute(flowFile, 'pdf.trapped', "${info.getTrapped()}" ); session.transfer(flowFile, REL_SUCCESS)
I then put the file's text contents out (using PutFile) and also logged the metadata attributes. The flow looks like this:
The template is available as a Gist (here). Please let me know if you find it useful, and/or if you have comments, questions, or suggestions. Cheers!
Hi Matt,
ReplyDeleteI got this error:
18:37:20 CEST
ERROR
d78bb216-0fc9-47ea-bf81-39225d626ec1
ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] ExecuteScript[id=d78bb216-0fc9-47ea-bf81-39225d626ec1] failed to process session due to java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;: java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()Ljava/util/List;
Which PDFBOX jars do I Need?
Hans
This is actually the kind of information I have been trying to find. Thank you for writing this information. http://pdftotext.org
DeleteI used version 1.8.11, for that I needed pdfbox, jempbox, and fontbox JARs (of the same version).
ReplyDeletehow can i extract text from a normal text file and use its content as attributes , lets say i have a configuration file with key : value pairs , need to extract the content of this file and use the key as attribute name and the value as the attribute value ....
ReplyDeletewhy this is not working for a simple pdf, which consist of this?
ReplyDeleteName : Saurabh Bidwai
Age : 22
Education : Post Graduate
Address : Pune
What's not working? Is it not outputting the text, or is it giving an error, or both? If it seems to be "working" but not generating text, then it might have something to do with the content. If PDFBox can't find it with PDFTextStripper, then you may need a different class or a different library.
Deleteyhaaa!! not outputting the text,
DeleteI'm not familiar with groovy, can u help me with it??
Thanks in advance.
Hi Matt,
ReplyDeleteCan you please help with how to process the metadata(txtfile) thru NIFI
Thanks
Swetha
Recently, we are involved in an application development. It is required to recognize and extract text from PDF document. What we concern now is to using a PDF text extractor & converter or an OCR SDK.
ReplyDeleteI tried XsPDF SDK. It's easy to extract text & metadata from PDF.
ReplyDeleteHi AdeleB,
DeleteCan you help with the groovy script you used and does it also extract impages and its content as well from pdf?
Thanks
Deepak
Hi Matt,
ReplyDeleteDo You have an example of extracting impage and texts in images as well from an PDF.
Also I had faced a similar error as mentioned above with the code, any PDF which doesn't have a Metadata , the text for the same is not extracted. For example a word converted document into pdf returns error and the text never gets extracted.
Thanks
Deepak
Very Nice Article it's very Informative i have learn lot .Thanks keep sharing such informative article metadata extractor
ReplyDeletei make to follow your code , but it error: unable to resolve class: PDFTextStripper at line: def s = new PDFTextStripper(). I put jar file : jempbox-1.8.12.jar, fontbox-1.8.12.jar, commons-logging-1.1.1.jar in a folder and to call its though path at module directory of ExecuteScrip process.
ReplyDelete