Friday, April 15, 2016

Inspecting your NiFi DistributedMapCacheServer with Groovy

The DistributedMapCacheServer in Apache NiFi is a Controller Service that allows you to store key/value pairs for use by clients (such as NiFi's DistributedMapCacheClient service), see the official documentation for more details.

To create one, go to Controller Services (on the management toolbar on the right), create a new Controller Service, and select type DistributedMapCacheServer. Once named and saved, you can click the edit button (pencil icon) and set up the hostname and port and such:


Once the properties are saved, you can click the Enable button (lightning-bolt icon) and it will be ready for use by some NiFi Processors.

The DistributedMapCacheServer is mainly used for lookups and for keeping user-defined state. The PutDistributedMapCache and FetchDistributedMapCache processors are good for the latter. Other processors make use of the server to keep track of things like which files they have already processed. DetectDuplicate, ListFile, ListFileTransfer, GetHBase, and others make use of the cache server(s) too.

So it is certainly possible to have a NiFi dataflow with FetchDistributedMapCache scheduled for some period of time (say 10 seconds) connected to a LogAttribute processor or something to list the current contents of the desired keys in the cache. For this post I wanted to show how to inspect the cache from outside NiFi for two reasons. The first is to illustrate the use case of working with the cache without using NiFi components (good for automated population or monitoring of the cache from the outside), and also to show the very straightforward protocol used to get values from the cache. Putting data in is equally as simple, perhaps I'll add a follow-on post for that.

The DistributedMapCacheServer opens a port at the configured value (see dialog above), it expects a TCP connection and then various commands serialized in specific ways.  The first task, once connected to the server, is to negotiate the communications protocol version to be used for all future client-server operations.  To do this, we need the following:

  • Client sends the string "NiFi" as bytes to the server
  • Client sends the protocol version as an integer (4-bytes)


If you are using an output stream to write these values, make sure you flush the stream after these steps, to ensure they are sent to the server, so that the server can respond. The server will respond with one of three codes:

  • RESOURCE_OK (20): The server accepted the client's proposal of protocol name/version
  • DIFFERENT_RESOURCE_VERSION (21): The server accepted the client's proposal of protocol name but not the version
  • ABORT (255): The server aborted the connection


Once we get a RESOURCE_OK, we may continue on with our communications. If instead we get DIFFERENT_RESOURCE_VERSION, then the client needs to read in an integer containing the server's preferred protocol version. If the client can proceed using this version (or another version lower than the server's preference), it should re-negotiate the version by sending the new client-preferred version as an integer (note you do not need to send the "NiFi" again, the name has already been accepted).
If the client and server cannot agree on the protocol version, the client should disconnect from the server. If some error occurs on the server and it aborts the connection, the ABORT status code will be returned, and a message error can be obtained by the client (before disconnect) by reading in a string of UTF-8 bytes.

So let's see all this working in a simple example written in Groovy. Here is the script I used:
def protocolVersion = 1
def keys = ['entry', 'filename']

s = new Socket('localhost', 4557)

s.withStreams { input, output ->
  def dos = new DataOutputStream(output)
  def dis = new DataInputStream(input)
  
  // Negotiate handshake/version
  dos.write('NiFi'.bytes)
  dos.writeInt(protocolVersion)
  dos.flush()  
 
  status = dis.read()
  while(status == 21) {
     protocolVersion = dis.readInt()
     dos.writeInt(protocolVersion)
     dos.flush()
     status = dis.read()
  }
  
  // Get entries
  keys.each {
      key = it.getBytes('UTF-8')
      dos.writeUTF('get')
      def baos = new ByteArrayOutputStream()
      baos.write(key)
      dos.writeInt(baos.size())  
      baos.writeTo(dos)
      dos.flush()
      def length = dis.readInt()
      def bytes = new byte[length]
      dis.readFully(bytes)
      println "$it = ${new String(bytes)}"
  }
  
  // Close 
  dos.writeUTF("close");
  dos.flush();
}

I have set the protocol version to 1, which at present is the only accepted version. But you can set it higher to see the protocol negotiation work.

Also I have the variable "keys" with a list of the keys to look up in the cache. There is no mechanism at present for retrieving all the keys in the cache. This is probably for simplicity and to avoid denial-of-service type stuff if there are tons and tons of keys.  For our example, it will fetch the value for each key and print out the key/value pair.

Next you can see the creation of the socket, using the same port as was configured for the server (4557). Then Groovy has some nice additions to the Socket class, to offer an InputStream and OutputStream for that socket to your closure. Since we'll be dealing with bytes, strings, and integers, I thought a DataInputStream and DataOutputStream would be easiest (also this is how the DistributedMapCacheClient works).

The next two sections perform the protocol version negotiation as described above. Then for each key we write the string "get" followed by the key name as bytes. That is the entirety of the "get" operation :)

The server responds with the length of the key's value (in bytes). We read in the length as an integer, then read in a byte array containing the value. For my case I know the keys are strings, I simply create a String from the bytes and print out the key value pair. To my knowledge the only kinds of serialized values used by the NiFi DistributedMapCacheClient are a byte array, String, and CacheValue (used exclusively by the DetectDuplicate processor).

Once we're done reading key/value pairs, we write and flush the string "close" to tell the server our transaction is complete. I did not expressly close the socket connection, that is done by withStreams() which closes the streams when the closure is finished.

That's all there is to it! This might not be a very common use case, but it was fun to learn about some of the intra-NiFi protocols, and being able to get some information out of the system using different methods :)

Cheers!


4 comments:

  1. hi,
    thanks for this post, it's interesting and i would like to know how it is possible for using javascript instead of groovy, thank you

    ReplyDelete
  2. Hi there. I have an urgent issue that needs to be solved . I have a process group that splits data into batch and batch into records. I have a wait processor which needs to wait for each batch to be completed and than notify other service. I have a notify processor that keep tracks of records processed and notifies wait processor. both have ReleaseSignalIdentfier set to fragment.identifier. but my wait processore is failing some batches to terminate state that is expired state. I cannt understand why because none of the process is taking longer than few seconds to processdata. and my wait time is 5 mins.

    ReplyDelete
  3. I am facing an issue in distributed fetch cache. i.e, the fetchdistributed cache doesn't taking a value from putDistributedcache which has cached attribute value is true..It's always going in not found flow. Actually I was working earlier. I don't know what's the problem is. Could anyone please help me to sort out this. Its very urgent.

    ReplyDelete
  4. I am little improved your work: https://hub.docker.com/r/uudecode/nifi-distributed-map-cache-server-browser

    ReplyDelete