04 « August « 2009 « TechOpsGuys.com

August 4, 2009

1 Billion events in Splunk

Filed under: Monitoring — Nate @ 10:43 pm

I was on a conference call with Splunk about a month or so ago, we recently bought it after using it off and on for a while. One thing that stuck out to me on that call was the engineer’s excitement around being able to show off a system that had a billion events in it. I started a fresh Splunk database in early June 2009 I think it was, and recently we passed 1 billion events. The index/DB(whatever you want to call it) just got to about 100GB(the below screenshot is a week or two old). The system is still pretty quick too. Running on a simple dual Xeon system with 8GB memory, and a software iSCSI connection to the SAN.

We have something like 400 hosts logging to it(just retired about 100 additional ones about a month ago, going to retire another 80-100 in the coming weeks as we upgrade hardware). It’s still not fully deployed right now about 99% of the data is from syslog.

Upgraded to Splunk v4 the day it came out, it has some nice improvements, filed a bug the day it came out too(well a few), but the most annoying one is I can’t login to v4 with Mozilla browsers(nobody in my company can). Only with IE. We suspect it’s some behavioral issue with our really basic Apache reverse proxy and Splunk. The support guys are looking at it still. That and both their Cisco and F5 apps do not show any data despite having millions of log events from both Cisco and F5 devices in our index. They are looking into that too.

1 billion logged events

Comments Off

Will it hold?

Filed under: Monitoring,Storage — Tags: 3par, sata, widestriping — Nate @ 10:21 pm

I went through a pretty massive storage refresh earlier this year which cut our floorspace in half, power in half, disks in half etc. Also improved performance at the same time. It’s exceeded my expectations, more recently though I have gotten worried as far as how far will the cache+disks scale to before they run out of gas. I have plans to increase the disk count by 150% (from 200 to 300) at the end of the year, but will we last until then? MyÂ past(admittedly limited) storage experienceÂ says we should already be having lots of problems but we are not. The system’s architecture and large caches are absorbing the hit, the performance remains high and very responsive to the servers. How long will that hold up though?Â There are thousands of metrics available to me but the one metric that is not available is cache utilization, I can get hit ratios on tons of things, but no info on how full the cache is at any particular period of time(for either NAS or SAN).

To illustrate my point, here is a graphic from my in-house monitoring showing sustained spindle response times over 60 milliseconds:

Physical Disk response time

And yet on the front end, response times are typically 2 milliseconds:

Fiber channel response time to NAS cluster

There are spikes of course, there is a known batch job that kicks off tons of parallel writes which blows out the cache on occasion, a big gripe I have with the developers of the app and their inability to(so far) throttle their behavior. I do hold my breath on occasion when I personally witness the caches(if you add up both NAS+SAN caches it’s about 70GB of mirrored memory) getting blown out. But as you can see both on the read and especially write side the advanced controllers are absorbing a huge hit. And the trend over the past few months has been a pretty steep climb upwards as more things run on the system. My hope is things level off soon, that hasn’t happened yet.

The previous arrays I have used would not of been able to sustain this, by any stretch.

Will it hold?

Comments Off

Making RRD output readable

Filed under: Monitoring — Tags: cacti, Monitoring, rrd, Ruby, script — @ 8:19 pm

I have been doing a lot of work lately with creating new data points to monitor with cacti and when trouble shooting why a new data point is not working I have been running into a bit of an issue.Â I can see what my script is handing to the cacti poller, I can see what cacti is putting in the RRD file (with increased logging), but I can’t easily see what RRD has done with that data before handing off to cacti.Â By default RRD store’s the time stamps in Epoch Time (seconds since midnight on Jan 1st, 1970) and data in scientific notation. Now, I don’t know about you, but I can’t read either of those without some help so here is my little Ruby script helper

#!/usr/bin/env ruby
# Author: W. David Nash III
# Version 0.1
# August 3, 2009

count = 0
STDIN.each do|l|

        count += 1

        printf("%-3i | ",count)

        if !l.match(/^\d+/)
                header = l.to_s.split
        else

                (td, data) = l.split(/:/).map
                time = Time.at(td.to_i)
                printf("%s:", time.strftime("%Y-%m-%d %H:%M:%S"))

                data.to_s.split.map do |d|
                        if (d.eql? "nan") then d = "0.00" end
                        printf(" | %20.2f", d.chomp)
                end
        end

        if(count == 1)
                printf("%20s", "Time")
                header.each do |h|
                        printf(" | %20s",h)
                end
        end
        puts "\n"
end

and you use it like so

rrdtool fetch rra/.rrd AVERAGE -s -1h -r 60  | ./readRRD.rb

and here is some sample output

1   |                 Time |   Heap_Mem_Committed |         Heap_Mem_Max |        Heap_Mem_Used | Non_Heap_Mem_Commit |    Non_Heap_Mem_Init |     Non_Heap_Mem_Max |    Non_Heap_Mem_Used |             CPU_TIME |            User_Time |         Thread_Count |    Peak_Thread_Count |        Heap_Mem_Init
2   |
3   | 2009-08-03 13:18:00: |         213295104.00 |         532742144.00 |         130720632.67 |          36405248.00 |          12746752.00 |         100663296.00 |          36383328.00 |         623333333.33 |         531666666.67 |               111.33 |               184.00 |                 0.00
4   | 2009-08-03 13:19:00: |         213295104.00 |         532742144.00 |         132090801.60 |          36405248.00 |          12746752.00 |         100663296.00 |          36383328.00 |        1818000000.00 |        1704000000.00 |               111.80 |               184.00 |                 0.00
5   | 2009-08-03 13:20:00: |         213295104.00 |         532742144.00 |         122721880.67 |          36405248.00 |          12746752.00 |         100663296.00 |          36383328.00 |        2186666666.70 |        2057500000.00 |               112.92 |               184.00 |                 0.00

Comments Off

TechOpsGuys.com Diggin' technology every day

August 4, 2009

1 Billion events in Splunk

Will it hold?

Making RRD output readable