Fun with Apache NiFi: Validating JSON in NiFi with ExecuteScript

Wednesday, May 25, 2016

Validating JSON in NiFi with ExecuteScript

My last post alluded to a Groovy script for ExecuteScript that would use JSON Schema Validator to validate incoming flow files (in JSON format) against a JSON Schema. The purpose of that post was to show how to use Groovy Grape to get the JSON Schema Validator dependencies loaded dynamically by the script (versus downloading the JARs and adding them to the Module Directory property). For this post I want to show the actual JSON validation script (as I presented it on the Apache NiFi "users" mailing list).

I'll use the schema as it was presented to me on the mailing list:

{
  "type": "object",
  "required": ["name", "tags", "timestamp", "fields"],
  "properties": {
    "name": {"type": "string"},
    "timestamp": {"type": "integer"},
    "tags": {"type": "object", "items": {"type": "string"}},
    "fields": { "type": "object"}
  }
}

This shows that the incoming flow file should contain a JSON object, that it needs to have certain fields (the "required" values), and the types of the values it may/must contain (the "properties" entries). For this script I'll hard-code this schema, but I'll talk a bit at the end about how this can be done dynamically for a better user experience.

Since the schema itself is JSON, we use org.json.JSONObject and such to read in the schema. Then we use org.everit.json.schema.SchemaLoader to load in a Schema. We can read in the flow file with session.read, passing in a closure cast to InputStreamCallback, see my previous post for details, and call schema.validate(). If the JSON is not valid, validate() will throw a ValidationExecption. If it does, I set a "valid" variable to false, then route to SUCCESS or FAILURE depending on whether the incoming flow file was validated against the schema. The original script is as follows:

import org.everit.json.schema.Schema
import org.everit.json.schema.loader.SchemaLoader
import org.json.JSONObject
import org.json.JSONTokener

flowFile = session.get()
if(!flowFile) return

jsonSchema = """
{
  "type": "object",
  "required": ["name", "tags", "timestamp", "fields"],
  "properties": {
    "name": {"type": "string"},
    "timestamp": {"type": "integer"},
    "tags": {"type": "object", "items": {"type": "string"}},
    "fields": { "type": "object"}
  }
}
"""

boolean valid = true
session.read(flowFile, { inputStream ->
   jsonInput = org.apache.commons.io.IOUtils.toString(inputStream,
java.nio.charset.StandardCharsets.UTF_8)
   JSONObject rawSchema = new JSONObject(new JSONTokener(new
ByteArrayInputStream(jsonSchema.bytes)))
   Schema schema = SchemaLoader.load(rawSchema)
   try {
      schema.validate(new JSONObject(jsonInput))
    } catch(ve) {
      log.error("Doesn't adhere to schema", ve)
      valid = false
    }
  } as InputStreamCallback)

session.transfer(flowFile, valid ? REL_SUCCESS : REL_FAILURE)

This is a pretty basic script, there are things we could do to improve the capability:

Move the schema load out of the session.read() method, since it doesn't require the input
Allow the user to specify the schema via a dynamic property
Do better exception handling and error message reporting

A worthwhile improvement (that would include all of these) is to turn the script into a proper Processor and put it in an InvokeScriptedProcessor. That way you could have a custom set of relationships, properties, to make it easy for the user to configure and use.

Of course, the best solution is probably to implement it in Java and contribute it to Apache NiFi under the Jira case NIFI-1893 :)

Cheers!

6 comments:

noedMarch 7, 2018 at 12:13 PM
thanks for the script.
forgot to mention
"need to download the two JAR dependencies ([2] and
[3]) and add them to your Module Directory property."
[2] http://mvnrepository.com/artifact/org.everit.json/org.everit.json.schema/1.3.0
[3] http://mvnrepository.com/artifact/org.json/json/20160212
ReplyDelete
Replies
keerthanaJuly 21, 2020 at 5:10 AM
There are lots of information about latest software analyzing huge amounts of unstructured data in a distributed computing environment.This information seems to be more unique and interesting.
Thanks for sharing. PHP Training in Chennai | Certification | Online Training Course | Machine Learning Training in Chennai | Certification | Online Training Course | iOT Training in Chennai | Certification | Online Training Course | Blockchain Training in Chennai | Certification | Online Training Course | Open Stack Training in Chennai |
Certification | Online Training Course
ReplyDelete
Replies
paviSeptember 10, 2020 at 7:40 AM
Am really impressed about this blog because this blog is very easy to learn and understand clearly.This blog is very useful for the college students and researchers to take a good notes in good manner,I gained many unknown information.

Data Science Training In Chennai

Data Science Online Training In Chennai

Data Science Training In Bangalore

Data Science Training In Hyderabad

Data Science Training In Coimbatore

Data Science Training

Data Science Online Training
ReplyDelete
Replies
Let2knowJune 20, 2022 at 7:25 AM
you have executed an uproarious undertaking upon this text. Its completely change and very subjective. you have even figured out how to make it discernible and simple to make a get accord of into. you have a couple of definite composing dexterity. much appreciated likewise a lot. Activation Key For Movavi
ReplyDelete
Replies
json validator onlineMay 10, 2025 at 6:40 AM
Great post! This is a concise and effective guide for using the json formatter validator processor in Apache NiFi. The practical example makes it easy to follow, especially for those new to NiFi. It would be helpful to include a note on how validation errors are handled or routed—especially for large flows. Thanks for sharing this!

ReplyDelete
Replies

Add comment