Huge DNS traffic from ArcSight
We are having an issue where Firewall cpu utilization is going high.
On logs analysis we have found that huge traffic from ArcSight related devices (ESM, Logger and Connector servers) are sending DNS request (UDP 53) to Domain controller.
Any idea what could be the possible cause.
Thanks in advance.
Connectors are designed to perform DNS lookups on the events and maintain a internal DNS cache. Depending of the number of events and the number of different hosts that are coming it maybe possible to have excess traffic related to DNS lookups. If it is acceptable you can disable the Name Resolution on the connector level.
The DNS traffic you are seeing is related to certain ArcSight components attempting to do lookups and attempted reverse DNS resolution.
For example: if you have a connector collecting logs from your FW the SmartConnector will attempt to resolve the IPs it sees in logs to a host name. As a large majority of FWs look at IPs rather than host names this could be a big contributor.
This setting is enabled by default but can be changed if need be.
I would highly recommend you do an analysis of what actual device is contributing to the DNS load (most likely a SmartConnector host server).
Once identified you could edit the DNS settings on the SmartConnector appropriately.
Edit SmartConnector>Default Tab>Content Tab>Network Section
You will see settings such as "Enable Name Resolution", "Don't Resolve Host Names Matching", etc
Editing these settings can help cut back on the DNS traffic generated by the connectors. Be aware changing some settings could affect your ability to see hostnames in ESM which may hamper log analysis.
This is 1) a great observation and 2) a rather longer term interesting administrative topic (issue?).
It is often interesting to me that such realizations are based on firewall performance rather than DNS hosting, e.g. Domain Controller. In the larger sense, only of course if you consider this an infrastructure topic (issue?), quantifying and understanding why DNS has commonly been recognized to impact intermediate systems (firewalls). So, a great observation.
Connector resolution is driven by Destination. Connectors will have one or more Destinations. If you are oriented towards using the agent setup, e.g. on unix-like systems in the Connector /bin path executing './runagentsetup -I console', this is a per-Destination configuration maintenance option, specifically option #5 and the 2nd selection where the options are true/false (0/1). For the more 'confident' administrator you may edit the xml associated with the agented in the Connector agent.properties and per-Destination desired find the HostNameRes property and set the true/false. When using agentsetup the results will be applied during that execution whereas for (more confident) direct xml editing the Connector will need to be stopped and restarted.
Again, thinking about resolution one must decide the relative benefit of the Destination having a possible event enhancement where the IP/hostname are both populated. As an example if your Connector feeds one or more ESM, one or more Logger and also acts as a pump across some network connector (UDP/TCP) sending CEF to a 'third party' (think BigData solution). You may have decided that in your ESM('s) and Logger('s) you will be interested in possibly have DNS enhancement whereas for that 'third party' (BigData) you decide that you will not worry about such enhancement and defer that action to that downstream consumer. In this example using your preferred solution you would enable DNS for the ESM('s) and Logger('s) and disable for the 'third party'.
You may also wish to manipulate the nameResolutionTTL which is the 'hold time' that the Connector will 'cache' the translation, should it exist, and use its value before doing a refresh of the last retrieved value, by determining if you feel there is a more appropriate TTL value, e.g. multi-hour, daily,....
Also note that as the result of a failure to resolve that the translation produces no result and thus subsequent events that are processed with translation enabled (true) will continue to try if the 'cache' doesn't include an entry, basically resolution looks in the cache first and if an entry is found the values therein re used instead of new external resolutions.
I find this last behavior terrible, basically if resolution has failed then this could lead to more than necessary resolution attempt actions which increases overall traffic volume, and I have written an Enginneering change request for a resolution failure TTL that holds down (caches) the failed result in a way similar to a success in order to reduce resolution failures driving increased resolution attempt volume. You may monitor your Connector agent.log for name resolution failures.
Lastly in your Connector agent path, ../user/agent/, you will notice a hosts.txt file that contains the results of successful resolutions which is periodically dumped by Connector and which is 'read in' during a Connector startup in order to populate the cache as its last known state.
Ok, so I probably gave much more information that you may wish however alternatively you may enjoy the details. Also, I feel that all of the above is 'accurate' and encourage your continued discovery, research, etc and if you feel appropriate, augmentation of the collective knowledgebase is encouraged.
PS. apologies for syntax/spelling - I having done any proofreading.....
To add, we usually add a local caching nameserver on connector hosts which does negative dns caching. DNSMASQ can do this.
# Negative replies from upstream servers normally contain time-to-live
# information in SOA records which dnsmasq uses for caching. If the
# replies from upstream servers omit this information, dnsmasq does not
# cache the reply. This option gives a default value for time-to-live
# (in seconds) which dnsmasq uses to cache negative replies even in the
# absence of an SOA record.
If all your arcsight components are in the same zone I would highly suggest that you install a secondary or read-only DNS server in the same zone and just do zone transfer. Each event that are coming in could generate many DNS lookups and since it is UDP this put a very high stress on your firewall.
Please use the Like button below, if you find this post useful or mark it as an accepted solution if it resolves your issue.
A couple of good comments on improving resolution performance, thank you.
If you go that route then you may find it helpful to add the property 'name.resolver.dns.server=aaa.bbb.ccc.ddd', noting of course that a.b... is some actual ipv4 address , to your per-Connector .../current/user/agent/agent.properties file (down on the bottom???) and make use of that local resolution resource.
For the referrers to local resolution it would be wonderful if the knowledgebase was augmented with at least a minimally necessary example of install/configuration details, preferably from a known working example, as a possible way to short-circuit and accelerate adoption.
The reference to '...in the same zone...' is a valuable comment in my estimation. The subtle point I would make is that when your environment is such that you have overlapping address ranges and have begun to implement a NetworkModel when you update the per-Connector customerUri information you then realize that the Connector must be per-customerUri otherwise the tagging may misrepresent the true event source. One approach would be to address this 'issue' with Connectors dedicated to Uri, tag Uri appropriately and point your local resolution to an appropriate host (property above).
You have at the very least started reading and researching the NetworkModel by now, haven't you. The topic may become complex and maintenance is an added workload but including this in your workflow, even if as something for the future, puts you on a larger and more stable footing with regard to accurate attribution of that portion of your event flow that is 'protected'. The 'protected' reference relates to the NetworkModel, those that are tagged as 'protected' indicate 'you', i.e. those IP that you administer rather than 'them' in the remainder of the world wide wasteland. The guides and some Protect724 user conference materials will provide you with insight should you choose to pursue this. Certainly don't want to hijack the base topic, just trying to amplify that everything is connected and you do need to figure out how far down the configuration rabbit hole you go.
I must be going....
Its connector normal behavior for collect information
the ways to reduce the DNS traffic are:
a) Increase the hosts.txt cache file by add/change the following entries on agent.properties
name.resolver.ttl=(default 3600000 value in milliseconds)
These entries also can be managed by ESM on the Connector Parameters
b) add host entries to the /etc/hosts or c:\windows\system32\drivers\etc\hosts
c) complete disable the DNS resolution
Additionally for explaination/sources of the DNS resolution on the SmartConnectors:
"- The name resolver component (AgentNameResolver) looks up IP addresses for host names and vice versa. The name resolution can be enabled, disabled or be Source/Dest Only". If is set to No, then nothing more is done with events. If the Source/Dest Only choice is made, the device address and name are not resolved, which is especially useful in environments when frequent IP address changes would otherwise cause the device side table to grow too large.
Next the name resolution or reverse resolution are done for the source, the destination, and (unless the Source/Dest Only mode is set) the device. The order is:
1.The host name is looked up in the cache,
2.If the host name was not found, then it is looked up in the negative cache, if one is configured:
-If it is found in the negative cache and the entry is older than the TTL, it is removed from the negative cache (in this case the algorithm continues the same as if there was no negative cache)
-If it is found in the negative cache and is younger than the TTL, then the corresponding IP address is not set (in this case the algorithm completes)
3.If it is found, decide if it is stale (older than the TTL if Wait For Resolution is set, or twice that otherwise),
4.If the host name was not found or was stale, and Wait For Resolution is enabled, then look it up and (if successful) update the cache,
5.If the host name was not found or was stale, and Wait For Resolution is disabled, then add the host name to a queue of names that need to be resolved by a separate set of threads,
6.If either the cache or the Wait For Resolution lookup was successful, set the corresponding IP address event field(s).
Please note the following:
If Wait For Resolution is disabled, name resolution will never affect the event that triggered the lookup. This means that even in the best case, the first event with a given host name will not have the IP address event field set, regardless of how fast the resolution occurs. Note also that name resolution is not done if the host name is a valid IP address, and also that if the host name matches the regular expression specified in the Don't Resolve Host Names Matching parameter then it is not added to the queue of names to be resolved.
Reverse resolution is done if the IP address event field(s) (IPv4 and/or IPv6, depending on the configuration) is/are set and the host name event field is not set. Similarly to name resolution, this starts with a cache lookup, and if the IP address (es) is/are already known and the cache entry is not stale (defined the same as above), or a Wait For Resolution lookup is successful, then the host name event field is set (as well as the corresponding DNS domain event field if the Name Resolution Host Name Only parameter is set to No).
By default there are queues of up to 100,000 host names that need to be resolved, and up to 100,000 IP addresses that need to be reverse resolved. If either or both of these queues are full, nothing more will be added
The name resolution options made available by the platform OS itself – it doesn’t attempt name resolution ‘on its own’.
Name resolution will only be attempted at all if one or the other of Hostname or IP is missing from the parsed event. Obviously, if both are missing, nothing can be done. If both are already populated then nothing will be attempted either.
Given the nature of DNS, it is common to find that events are processed before a valid DNS response is received for any embedded device Addresses/Hostname and, by default, the connector is configured not to wait for name-resolution when it processes events.
Once any given IP/hostname is resolved (through the underlying DNS mechanism), any subsequent events needing that IP/hostname we expect to be fully resolved during event processing and populated as such (and therefore recorded in any future status commands). Of course, DNS responses are cached according to a TTL, so it is also possible that a previously resolved IP/hostname could once again become unavailable later on – until re-resolved. "
I hope this will clear your problem
I saw this myself at my last employer but I don't have the actual files handy because I didn't take them with me when I left. I know someone who is still there and I will have him post what we did to resolve this. Give me a couple of days.
Few additional items to note regarding the host.txt file:
- Default name.resolver.cachesize is 50000 entries
I apologize if someone had already covered any of these. One item that's never really been that clear to me is what actually happens when an ip/host entry has changed, and the cache is completely full (ie: all 50000 entries are used). My assumption is that it would simply update (change) the existing hostname/IP entry to reflect the new hostname/IP but i've never been given a solid "yes, that's how it works", or "no".
Couple of great points.
Bullet 1 - this category of 'tuning' has been about several time herein. To know how many entries you have, i.e. to drive tuning, stop the Connector o flush the hosts.txt file and line count, e.g. on unix-lie systems, 'wc -l' and adjust if necessary.
Bullet 3 - yup, that is my observation. There are probably a number of reasons for periodically restarting services, including Connectors, and this would b one if only for a minor reason. Relates to Bullet 1 now doesn't it.....
Bullet last-four (approximately) - yup, basically we have some reason to do resolution so we just might wish to think about all the things that have been exposed in this seemingly simple thread. If you 'cache' is growing large you my wish to 'prune' it, unfortunately (by now) you have noticed that the timestamp for the hosts.txt per-entry is all the same and that of the last orderly update so you don't really have any clues about when hosts haven't been seen recently in the event stream .vs. those that are new. The note of flooding on restart is a valid observation and is what happens based on TTL.
So the comment on deleting, I actually like to rename out of the way, e.g. mv hosts.txt hosts.txt.saved, rather than delete but such behavior is individual style and there is no one more right than another so follow your own 'style'. The point here is that over time it is highly likely that the event stream is representing a different set of hosts than it did a 'long' time ago so given that all of this thread has really evolved maybe it is desirable to 'clear the cache' periodically, i.e. rename/delete it (while the Connector is down). This also is me taking a different approach than the opined 'update the server hosts file' because this discussion, behavior and accumulated resolution cache is per-Connector based and unrelated to the server. I also think that this some reinforce deploying Connector 'near' (within resolution domains) the event sources, e.g. your company BU's (with you treating in an MSSP-like way), and steering your resolution host via the agent.properties property previously noted.
Clearing Entries - some above. By now it should also be relatively clear that one could actually 'seed' the hosts.txt with entries. An example might be that there are known resolution failures, see a previous thread note, that is driving continuous resolution attempts that fail in a viscous cycle and my JIRA to 'hold down' failures in their own cache continues to not be implemented, but you do have a way to identify those hosts that are internal but not resolving. Create a separate hosts file matching the syntax of the Connector maintained hosts.txt file and append it to the Connector maintained hosts.txt file during Connector shutdown and viola, those lookup failures are gone.
During execution, for the last paragraph, I seem to remember that with a full cache and new entries available for update that the mechanism is to randomly delete an existing entry, primarily due to their being no 'age' for an entry other than its TTL which is by default 3600s (3600000ms). I think I got this out of a very old tech memo many years ago, I don't have access to the systems where such was stored way-back-then so I'll try to get some time to scour about for any collateral on that I might have laying around (but let's not be holding our breath).
a seemingly simply topic, as so often happens within the 'SIEM space' actually has significant depth and breadth.
Caching of unresolvable entries can be done in the connector with configuration option
Remove Unresolvable Names/IPs from Cache = Yes (w/ negative cache)
With this parameter you get an addiotional cache file with unresolvable names/ip and will not be asked over and over to the DNS server.
My 2 cents regarding DNS Floods from the connector:
- increase the name resoluton TTL
- check if the cache is full and increase the amount of entries of the caches
- cache also the unresolvable DNS queries