Friday, January 6, 2017

Inspecting the NAR classloading hierarchy

I've noticed on the NiFi mailing lists and in various places that users sometimes attempt to modify their NiFi installations by adding JARs to the lib/ folder, adding various custom and/or external NARs that don't come with the NiFi distribution, etc.  This can sometimes lead to issues with classloading, which is often difficult for a user to debug. If the same changes are not made across a NiFi cluster, more trouble can ensue.

For this reason, it might be helpful to understand the way NARs are loaded in NiFi. When a NAR is loaded by NiFi, a NarClassLoader is created for it. A NarClassLoader is an URLClassLoader that contains all the JAR dependencies needed by that NAR, such as third-party libraries, NiFi utilities, etc.  If the NAR definition includes a parent NAR, then the NarClassLoader's parent is the NarClassLoader for the parent NAR.  This allows all NARs with the same parent to have access to the same classes, which alleviates certain classloader issues when talking between NARs / utilities. One pervasive example is the specification of an "API NAR" such as "nifi-standard-services-api-nar", which enables the child NARs to use the same API classes/interfaces.

All NARs (and all child ClassLoaders in Java) have the following class loaders in their parent chain (listed from top to bottom):
  1. Bootstrap class loader
  2. Extensions class loader
  3. System class loader

You can consult the Wiki page for Java ClassLoader for more information on these class loaders, but in the NiFi context just know that the System class loader (aka Application ClassLoader) includes all the JARs from the lib/ folder (but not the lib/bootstrap folder) under the NiFi distribution directory.

To help in debugging classloader issues, either on a standalone node or a cluster, I wrote a simple flow using ExecuteScript with Groovy to send out a flow file per NAR, whose contents include the classloader chain (including which JARs belong to which URLClassLoader) in the form:
<classloader_object>
     <path_to_jar_file>
     <path_to_jar_file>
     <path_to_jar_file>
     ...
<classloader_object>
     <path_to_jar_file>
     <path_to_jar_file>
     <path_to_jar_file>
     ...

The classloaders are listed from top to bottom, so the first will always be the extensions classloader, followed by the system classloader, etc.  The NarClassLoader for the given NAR will be at the bottom.

The script is as follows:

import java.net.URLClassLoader
import org.apache.nifi.nar.NarClassLoaders

NarClassLoaders.instance.extensionClassLoaders.each { c ->

def chain = []
while(c) {
  chain << c
  c = c.parent
}

def flowFile = session.create()
flowFile = session.write(flowFile, {outputStream ->
  chain.reverseEach { cl ->
    outputStream.write("${cl.toString()}\n".bytes)
    if(cl instanceof URLClassLoader) {
      cl.getURLs().each {
        outputStream.write("\t${it.toString()}\n".bytes)
      }
    }
  }
} as OutputStreamCallback)
session.transfer(flowFile, REL_SUCCESS)
}

The script iterates over all the "Extension Class Loaders" (aka the classloader for each NAR), builds a chain of classloaders starting with the child and adding all the parents, then iterates the list in reverse, printing the classloader object name followed by a tab-indented list of any URLs (JARs, e.g.) included in the classloader.

This can be used in a NiFi flow, perhaps using LogAttribute or PutFile to display the results of each NAR's classloader hierarchy.

Note that these are the classloaders that correspond to a NAR, not the classloaders that belong to instances of processors packaged in the NAR.  For runtime information about the classloader chain associated with a processor instance, I will tackle that in another blog post :)

Please let me know if you find this useful, As always suggestions, questions, and improvements are welcome.  Cheers!

6 comments:

  1. Matt - appreciate this write up. I'm very new to Nifi & Groovy and other folks I work with do the real work - I'm more of an admin. We're having issues in nifi that I think come down to "cache bloat". Succinctly, we graph in grafana the nifi api call /nifi-api/flow/status, with the count of flow files queued and bytes. We are constantly fighting an ever-increasing backlog (queue) - we have plenty of resources (controller configure to 24 - 32 threads for Timer/Event driven based on host config). We make extensive use of groovy files and scripts (strings). Mostly files which I understand are cached by GroovyShell.

    Here's what I do - i jstack {nifipid}, look at the Timer-Driven threads - invariably I see half, nearly all of the configured threads (20, 24 or 32) waiting to lock the NarClassLoader. One of them is runnable, the rest are waiting on monitor entry. Java 1.8.0_112, CentOS 6.9.

    at java.lang.ClassLoader.loadClass()
    - waiting to lock <0x...> (org.apahce.nifi.nar.NarClassLoader)
    - locked 0x... (GroovyClassLoader)
    at groovy.lang.GroovyClassLoader.loadClass:677
    huge pile of compile/visitor/transform/resolve/findClass here
    - locked 0x... (HashMap)
    at groovy.lang.GroovyShell.parseClass():694
    at groovy.lang.groovyShell.parse()
    at groovy.text.TemplateEngine.createTemplate()
    at InsertJsonRecords.insertMultiple()

    was blaming nifi, but looking at groovy class loader, i see:

    protected Class getClassCacheEntry(String name) {
    if (name == null) return null;
    synchronized (classCache) {
    return classCache.get(name);
    }
    }

    and i wonder, based on (what little i understand) that if we're not doing our classloading right (this NAR inheritance model) are we simply, over time, bloating this HashMap cache with unique copies of each of the scripts we run? We do HEAVY ETL, millions of records a day, extensive use of groovy in complex pipelines, we even fork/exec groovy scripts and bash scripts (kind of a test system). For example, our LogWriter that is included in nearly every step on dozens of pipelines is a .groovy script. We log A TON.

    Is there any advice you can give to help me understand what's happening. I'm intrigued by this post because (I think!) that what you're saying is we can try this code and view our class loading chain.

    TIA

    ReplyDelete
  2. Meant to add - when this backlog is growing and clearly the rate of work being done is down to a trickle, we simply restart nifi and the backlog is processed very quickly - sometimes 60G, 55k files will process in the newly started nifi in < 15m. So, it's not a code/resource issue up/downstream of nifi (like cpu or a database sink). I think it's contention in the jvm with classloading due to our use of groovy files. We're trying nifi 1.2 over next few weeks, but we didn't see anything about this directly mentioned.

    ReplyDelete
    Replies
    1. It looks like you have a custom processor (InsertJsonRecords)? I'm not sure what your scripts are doing, but if they need to load classes often or in large numbers, NiFi uses the default strategy of parent-first classloading, so if/when your script loads a class, it will try all the parents first, which is pretty inefficient. If you have control over how your classes will be loaded (and can guarantee that you won't load classes that are already in the parent classloading chain), then you could create a self-first (aka child-first) classloader to load all of your script-specific classes. This should increase performance as it will not look up to the NarClassLoader for classes that are only available and germane to your script(s).

      Here are a list of other options (if possible in your environment):
      1) Compile the Groovy scripts into JAR(s) and execute them from your custom processor or wherever.
      2) If you must call out to GroovyShell, consider moving the Groovy stuff off-JVM, perhaps using Jexler (https://www.jexler.net/) or Groovyserv (https://kobo.github.io/groovyserv/) or something
      3) Try InvokeScriptedProcessor using your Groovy scripts. I'm not sure how the Invocable interface is implemented under the hood, but since the script is only eval()'ed once, I would hope it is significantly faster (see my latest post: http://funnifi.blogspot.com/2017/06/invokescriptedprocessor-template-faster.html)

      Delete
  3. Wanted to circle back and thank you. After much ado, we've discovered that:
    1. using a template for a SQL statement always called:
    groovy.text.TemplateEngine.createTemplate()
    which _somehow_ (probably, possibly) was saved/cached internal to either GroovyShell or the JVM class loader and over time, it kept taking longer to search the cache. We use Influx/Grafana to store metrics from NiFi (like live threads, flow/bytes queued etc) and can see the queue slowly backing up.
    2. We bounce NiFi and the queue nearly immediately drains to 0. Same code, processors, pipelines. Just an empty cache (my theory)
    3. We've installed 1.2 and are moving away from using the template engine in groovy, or even using custom groovy processors and have NOT experienced this problem again.

    FWIW, we did look at your other options. I really would like to thank you for replying and hope this bottleneck due to our nuanced workflow is behind us.

    ReplyDelete
  4. A FlowFile is a message or event data or user data, which is pushed or created in the Nifi. A FlowFile has mainly two things attached with it. Its content (Actual payload: Stream of bytes) and attributes. Attributes are key value pairs attached to the content (You can say metadata for the content).

    ReplyDelete