fine tuning network automation

NA 2022.11 on Linux server

Linux server has around 24 GB and we had on-boarded around 600 devices and our target is 1500 devices.

We had restarted NA server last friday and now we are seeing only 5 GB left when we run free -g

Morever we are seeing NA application also shows low memory as seen in below output,

Used Memory (Total-Free): 11640 MB
Free Memory: 327 MB
Total Memory: 11968 MB
Maximum Memory: 11968 MB

Please can you provide if there is any fine tuning parameters we can apply similar to NNM like JVM heap memory allocation and garbage collection.

  • 0  

    Hi Ramesh,

    So, you mention your NA Core has 24 GB RAM, did you or someone else do the install?  I ask as there is / are some performance steps that are available and curious what might have been done already...

    Performance tuning - Network Automation (microfocus.com)

    Is your NA instance single core or multiple cores?  

    When this happens / happened, what tasks were running?  Anything stuck (running long)?  Do you have any custom tasks (change plans or diagnostics)?  Has this happened more than just this one time?  Like every Monday night, you see this happen on Core 3?  

    Has anyone changed the default task values?  Max Tasks / Max Concurrent Tasks?  Do you have an external DB or is it the embedded one?  

    Also, are the ~600 devices "typical" devices (switches, routers, load balancers, firewalls) or do you have anything that might be more complex (ACI / APIC devices)?  

    Have you looked at the appserver_wrapper.log file?  There may be some useful information that'll point you to a problem.  Perhaps old driver(s) or something else but quite possible you can find the beginning of this bad behavior.  

    Lastly, and this is just from my history:

    1) It's tempting to think that if some memory is good, (tons) memory is better and you throw almost all your memory to JVM - not really a good idea.  Same with increasing tasks.  

    2) Like life, there is a balance here.  You can increase your task numbers but if you do that, then you need to make sure you have JVM set to handle it as well as have the number of DB connections too.  

    3) Small steps and use caution.  Make changes slowly, document what you had and are changing and then test carefully.  You always want to be able to get to prior steps.

    Good luck!

    -Chris

  • 0 in reply to   

    Chris

    Yes NA server has 24 GB, we have only NA and Operations agent running on the server.

    NA instance is running on single core.

    There were no tasks running when the issue happens and there are also no stuck tasks and we do not have custom tasks and this has happened twice.

    Max tasks has been changed from 20 to 30 and other than there is no change done and we have external DB.

    We had only on-boarded only cisco switches so far.

    I also checked the file /opt/NA/server/ext/wrapper/conf/appserver_wrapper.conf where we can configure initial and max JVM memory, but I am not seeing any option for garbage collection like in NNM.

  • 0 in reply to 

    Chris

    Currently we have only 3 tasks running.

    resolve FQDN/Diagnostics/Snapshot

  • 0   in reply to 

    Are your drivers current?  

    OK, how about this - the last time this happened, can you go back and see what task(s) was / were running?  Anything out of the ordinary?

    If you go to /opt/NA/server/log and do this:

     grep -i gc appserver_wrapper.log*

    For example, should look similar to this:

    appserver_wrapper.log:timestamp goes here INFO [stdout] {system/scheduler} [SubTaskExecutorThread] 75 PausableThreadPoolExecutor: Memory low, explict request for GC. -

    what do you get back?  Look at ls -ltr /opt/NA/server/log/appserver_wrapper.log* - what's the oldest date?  

    You say you have three tasks running - do you mean three tasks scheduled or when the problem happened, there were three tasks running or just when you replied, three were running?  

    The low memory issue, normally there's some reason that memory gets run down.  So, you may want to look at when this condition happened and see what was running (what task(s) on what device(s)) and go from there.  There should be some detail to get you started within the appserver_wrapper that'll give you a start.  

  • 0 in reply to   

    Chris

    Yes, we are seeing the line often in appserver_wrapper.log

     [stdout] {system/scheduler} [SubTaskExecutorThread] 75 PausableThreadPoolExecutor: Memory low, explict request for GC. -

  • 0   in reply to 

    OK, two options and both might be good.

    1) Look at the log and try to see when this message first started showing up.  You can check and see (task report) what task / tasks were running at that time.  

    2) Again, looking at the log, you should see taskIDs before / during these messages - you can take a look at those IDs and again, see what was running.  

    But, at this point, you hopefully can determine what task was running when this happens.  If I recall, you've said this GC error has happened multiple times, so try to do the same task searching for each time as if it's different tasks OR the same tasks, that might provide some needed detail to resolving things.  

    But, once you have the task information, might be worth running separately with some additional logging to see if the problem continues.  

    Other suggestion, though not necessarily easy, there is a script you can get from support that will let you do a tread dump.  With that information, if you can recreate the problem and while that is happening, run the thread dump, you can get more detail.  If this was at one time a jboss system (2020.02 I think was when they switched) so you might have the "old" script, just make sure to switch it.  

    Good luck.

  • 0 in reply to   

    Chris

    I checked when the issue first reported, the snapshot task was running on all the devices.

    I can see on most of the devices snapshot task has succeeded and on some devices I see task is failing and on some devices, the status shows as skipped.

  • 0   in reply to 

    OK, so I'd look to see at the ones showing as skipped - why are they skipped.  You can re-run the task, turn on session logging and look to see how the task is running.  Check to see if it is skipped because there was a task that was run recently - but if this was happening, then try to determine why you have multiple tasks are running back to back - may be nothing, but something to look in to.  

    Do these device tasks always get skipped?  

    But that would be my suggested next step.  

    You have a single core, correct?  If so, another option would be to use the thread dump script (from support) and kick off that task say 5 minutes before the parent that that seems to be running when the problem happens.  This way, you can configure the thread dump script to run for longer than that parent task normally does and you'll get additional details.