Our vBulletin migration is complete.
Welcome vBulletin users! All content and user information from the Micro Focus Forums (vBulletin) site has been migrated to this site. READ MORE.
jeremy.kelley@h Absent Member.
Absent Member.
672 views

Streaming ArcSight to Hadoop

This is a discussion of an easy, straightforward implementation for streaming data from a connector into Hadoop (HDFS) using Flume.  Flume is an open source project for handling an ingress stream from external sources and sanely placing that data into HDFS.

Before diving into the specifics, below are a few reasons I like this approach:

  • The import is not batch oriented.  I could give quite a few reasons why this alone is a win, maybe in a future blog post.
  • By streaming from the connector, you don't increase load on your ESM instance, Logger instance, or your database
  • Obvious separation of data taps.  In other words, you can pick and choose what data you'll be putting into HDFS just by controlling which connectors forward.
  • Could scale by running multiple flume instances if ingestion rates become an issue on the hadoop side.

Before we get started

For the purposes of this post, I'll be assuming you have a hadoop instance running and that you can access hdfs at http://master:54310.

Go and download the Apache-Flume-binary package at the Apache Flume site.  Once downloaded, extract the tarball someplace sane (I put it in $HOME/lib/flume-1.4.0-bin and then symlinked it to $HOME/lib/flume for convenience, your mileage may vary).

Configuring Flume

Now that you've got flume installed, download the attached flume_syslog.conf file available on this page.  Place it in your flume/conf directory.

note about the flume_syslog.conf file - I have tuned this file a bit to handle around 5k EPS depending on your hardware.  Further performance tuning could definitely be performed.

Edit your flume_syslog.conf file and make sure the following lines match your hdfs settings.

syslogagent.sources.Syslog.port = 5140

This line indicates which port to bind to.  You shouldn't need to change it, but if you have something on this port, move flume to a different port by changing this line.

syslogagent.sinks.loggerSink.hdfs.path = hdfs://master:54310/cef-raw/%y-%m-%d/%H

In the line above, you can see hdfs://master/54310.   This portion of the url should correspond to your hdfs instance.  The portion stating cef-raw/%y-%m-%d/%H says "put all incoming data in directories at the root of cef-raw"  and then after that root directory, the files are organized by YYYY-MM-DD/HOUR.  This makes it really easy to search by time range later on in hadoop jobs by globbing on the right directory range.  Feel free to modify this to suit your environment.

Everything else in the file should just work (right?!!?).  This file doesn't assume a massive box to run flume on, but if you start seeing errors from flume about resources, post in the comments and we can help get it working.

Starting Flume

Now that you've got flume configured, starting it is simple.  In a shell, from $HOME/lib/flume,  just run the following:

bin/flume-ng agent --conf-file conf/flume_syslog.conf --name syslogagent -Dflume.root.logger=INFO,console

This will start the flume process, and if it's configured properly, you'll have a listening service ready to accept incoming syslog events over UDP.

Connecting the Connector

For simplicity's sake, I'll be tweaking a replay connector, but the setup's the same elsewhere.  Just add a new destination of "syslog CEF" to your connector.  Choose UDP, set the host and port that flume is running on (5140 above).  Make sure the connector is running.

On the flume side you should start seeing log entries related to importing your data.  That's all there is to it.

To check your data import, and assuming you're using the path I setup above, do the following:

hdfs ls /cef-raw

You should see a listing of time based directories being created.  If everything is working properly, you'll have files created and divided into directories by hour.  These files (with my conf here) are uncompressed raw CEF records.

Post comments if you run into issues and I can try to help you.

I'll be posting a followup on monday with a library for working on your newly imported CEF data.

Labels (3)
12 Replies
marcellolino1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

hi Jeremy,

I am interested if anyone else is using this. I got a few issues with a similar setup. The SmartConnector does not send the syslog header information when forwading the message and Flume complains about it for every event. " Event created from Invalid Syslog data". In additional, a large CEF event >1024 is truncated on HDFS when using syslogUDP.

I tested your configuration file and it produced the same WARN when receiving Firewall data.

I got around the WARN using Flume netcat, set the event lenght to 4096 and using Raw TCP on the connector side. Also disabled the ACK on flume for every event.

Cheers,
Marcello

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

I haven't seen that warning.  Odd...

Just to make sure, you're on flume 1.4.0 and you're using just plain old "CEF Syslog" (not CEF Encrypted Syslog)?

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

Marcello,

I was wrong.  I am seeing the WARN on "Invalid Syslog data" now when I just recreated my fwd'ing destination.  I'm not sure what I've changed.  Let me investigate a bit.

-j

0 Likes
marcellolino1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

If you tcpdump the record you can spot the missing header. I asked support yesterday whether there is a way to keep the header by tweaking the agent.properties. The agent default has syslog header information but header is removed so the remaining part of the message is processed by the subagents.

Also have you tried  events > 1024?

Marcello

0 Likes
marcellolino1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

And BTW, I sent you the entire config I am using with NetCat in case you want to try it out.

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

Have you heard back from support?  Can you email me details of your support tix and maybe I can poke around from inside the fortress of arctitude?

I checked some of my older logs and that WARN was there, I'd just ignored it and had completely forgotten about this.  http://mail-archives.apache.org/mod_mbox/incubator-flume-dev/201206.mbox/%3C1124033401.2434.1339409263531.JavaMail.jiratomcat@issues-vm%3E

0 Likes
marcellolino1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

Nothing really useful from support besides asking to get  professional services . I replied back yesterday and have not heard back yet. (#4646330088).

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

Alright.  I've got some contacts over in connector dev land.  Let me ping them.  No promises, but we'll try.

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

- decided to move this conversation here from original post, as I wanted the configuration available to all.

I have found a configuration option in the source to flume-ng that allows you to explicitly set how much gets read per syslog record.

  public static final String CONFIG_READBUF_SIZE = "readBufferBytes";

  public static final int DEFAULT_READBUF_SIZE = 1024;

Flume's default buffer size per event is 32K but it seems syslog is tuned by default to smaller than that.  So setting your  syslogagent.sources.Syslog.readBufferBytes to something larger than 1024 should help you.

Note - I haven't tested this yet. Will try to later this afternoon.

0 Likes
marcellolino1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

The readBufferBytes is documented in the Flume guide for syslog TCP --> http://flume.apache.org/FlumeUserGuide.html

Just in case I tested for both UDP and TCP.

UDP does not work. All Records are truncated @ 769bytes. 

TCP does work. I extended readBufferBytes = 4096 and compared it to the netcat source. Record looks fine. But we still have the WARN due to the missing header.

BTW, no news from support on the ticket for the header. A lot of back and forth and nothing useful.

I have not seem any trouble using the netcat source in Flume to receive CEF syslog TCP from  the SmartConnectors.  I wish I could use UDP rather than TCP for some data sources.  Does anyone see any issue with Flume NetCat as a destination for CEF syslog TCP?

Cheers

0 Likes
jeremy.kelley@h Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

I don't see any issue with using netcat.

Concerning patching the UDP Syslog, I have isolated the code that creates the warning, but it turns out adding a configuration option to flume and then passing that through to the different providers is proving ... onerous.

0 Likes
Highlighted
tbarella1 Absent Member.
Absent Member.

Re: Streaming ArcSight to Hadoop

I built on Jeremy's work with Flume and posted it here: 

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.