Thursday, June 1, 2023

Generating Random Schemas for GenerateRecord

As of NiFi 1.20.0 (via NIFI-10585) there is a GenerateRecord processor, that can either take user-defined properties to define fields with values (populated by java-faker) or use an Avro Schema via the Schema Text property to fill in the records with random values based on the given datatypes in the schema. The Record Writer can be specified so even though the schema is an Avro Schema, it will write the records out in whatever format the writer supports.

For a small number of fields or for a predefined schema, GenerateRecord works well to randomly generate data and takes a small amount of time to configure. But if you're using GenerateRecord to try and generate random records with a large number of fields for testing purposes, you either have to add that many user-defined properties or design your own Avro Schema with that many fields. If you just care about generating random fields to test something downstream, you can use the following Groovy script to generate an Avro Schema file (.avsc) with the specified number of fields using random datatypes:

def numFields = 10
File file = new File("columns_${numFields}.avsc")
def primitiveDataTypes = ['string', 'int', 'float', 'double']
file.write  '''{
  "type" : "record",
  "name" : "NiFi_Record",
  "namespace" : "any.data",
  "fields" : ['''
 
fields = []
def r = new Random()
(1..numFields).each {
  fields << "\t{ \"name\": \"col_$it\", \"type\": \"${primitiveDataTypes[r.nextInt(primitiveDataTypes.size())]}\" }"
}
file << fields.join(',\n')
file << ''' ]
}
'''

The script above writes a valid Avro Schema to the file pointed at by file, and numFields specifies how many fields for each record. The datatype is randomly selected by a choice from primitiveDataTypes. One possible output (for the above script as-is):

{
  "type" : "record",
  "name" : "NiFi_Record",
  "namespace" : "any.data",
  "fields" : [ { "name": "col_1", "type": "float" },
{ "name": "col_2", "type": "float" },
{ "name": "col_3", "type": "int" },
{ "name": "col_4", "type": "string" },
{ "name": "col_5", "type": "int" },
{ "name": "col_6", "type": "double" },
{ "name": "col_7", "type": "string" },
{ "name": "col_8", "type": "string" },
{ "name": "col_9", "type": "string" },
{ "name": "col_10", "type": "string" } ]
}

For 10 fields, this probably isn't the way you want to go, as you can add 10 user-defined properties or come up with your own Avro Schema. But let's say you want to test your target DB for 5000 columns with GenerateRecord -> UpdateDatabaseTable -> PutDatabaseRecord? You can set numFields to 5000, run the script, then use the contents of the generated columns_5000.avsc file as your Schema Text property in GenerateRecord. You'll find that you get the number of records specified by GenerateRecord, each with random values corresponding to the datatypes generated by the Groovy script. Using UpdateDatabaseTable, it can create such a table if it doesn't exist, and with PutDatabaseRecord downstream it will put the records generated by GenerateRecord into the newly-created table.

I admit the use case for this is perhaps esoteric, but it's a fun use of NiFi and that's what this blog is all about :)

As always, I welcome all comments, questions, and suggestions. Cheers!