About Segmentation
Event segmentation is an operation key to how Splunk processes your data as it is being both indexed and searched. At index time, the segmentation configuration determines what rules Splunk uses to extract segments (or tokens) from the raw event and store them as entries in the lexicon. Understanding the relationship between what’s in your lexicon, and how segmentation plays a part in it, can help you make your Splunk installation use less disk space, and possibly even run a little faster.
Peering into a tsidx file
Tsidx files are a central part of how Splunk stores your data in a fashion that makes it easily searchable. Each bucket within an index has one or more tsidx files. Every tsidx file has two main components – the values (?) list and the lexicon. The values list is a list of pointers (seek locations) to every event within a bucket’s rawdata. The lexicon is a list (tree?) containing of all of the segments found at index time and a “posting list” of which values list entries could be followed to find the rawdata of events containing that segment.
Splunk includes a not-very-well documented utility called walklex. It should be in the list of Command line tools for use with Support, based on some comments in the docs page but it’s not there yet. Keep an eye on that topic for more official details – I’ll bet they fix that soon. There’s not a whole lot to walklex – you run it, feeding it a tsidx file name and a single term to search for – and it will dump the matching lexicon terms from the tsidx file, along with a count of the number of rawdata postings that contain this term.
Segmentation example
I have a sample event from a Cisco ASA, indexed into an entirely empty index. Let’s look at how the event is segmented by Splunk’s default segmentation rules. Here is the raw event, followed up with the output of walklex for the bucket in question.
2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443) to vlan9:192.168.120.72/57625 (172.16.1.2/64974)
$ splunk cmd walklex 1399698005-1399698005-17952229929964206551.tsidx "" my needle: 0 1 host::firewall.example.com 1 1 source::/home/dwaddle/tmp/splunk/cisco_asa/firewall.example.com.2014-05-10.log 2 1 sourcetype::cisco_asa 3 1 %asa-6-302013: 4 1 00 5 1 00:00:05.700433 6 1 05 7 1 1 8 1 10 9 1 101 10 1 101.123.123.111/443 11 1 111 12 1 120 13 1 123 14 1 16 15 1 168 16 1 172 17 1 172.16.1.2/64974 18 1 192 19 1 2 20 1 2014 21 1 2014-05-10 22 1 302013 23 1 443 24 1 57625 25 1 6 26 1 64974 27 1 700433 28 1 72 29 1 9986454 30 1 _indextime::1399829196 31 1 _subsecond::.700433 32 1 asa 33 1 built 34 1 connection 35 1 date_hour::0 36 1 date_mday::10 37 1 date_minute::0 38 1 date_month::may 39 1 date_second::5 40 1 date_wday::saturday 41 1 date_year::2014 42 1 date_zone::local 43 1 for 44 1 host::firewall.example.com 45 1 linecount::1 46 1 outbound 47 1 outside 48 1 outside:101.123.123.111/443 49 1 punct::--_::._%--:_______:.../_(.../)__:.../_(.../) 50 1 source::/home/dwaddle/tmp/splunk/cisco_asa/firewall.example.com.2014-05-10.log 51 1 sourcetype::cisco_asa 52 1 tcp 53 1 timeendpos::26 54 1 timestartpos::0 55 1 to 56 1 vlan9 57 1 vlan9:192.168.120.72/57625
Some things stick out immediately — all uppercase has been folded to lowercase, indexed fields (host,source,sourcetype,punct,linecount,etc) are of the form name::value, and some tokens like IP addresses are stored both in pieces and whole. But let’s look at a larger example..
I’ve indexed a whole day’s worth of the above firewall log – 5,707,878 events. The original file unindexed file is about 782MB, and the resulting Splunk bucket is 694MB. Within the bucket, the rawdata is 156MB and the tsidx file is 538MB.
Cardinality and distribution within the tsidx lexicon
When we look at the lexicon for this tsidx file, we can see the cardinality (number of unique values) of the keywords in the lexicon is about 11.8 million. The average lexicon keyword occurs in 26 events.
$ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx "" | egrep -v "^my needle" | wc -l 11801764 $ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx "" | egrep -v "^my needle" | awk ' BEGIN { X=0; } { X=X+$2; } END { print X, NR, X/NR } ' 309097860 11801764 26.1908
Almost 60% of the lexicon entries (7,047,286) have only a single occurrence within the lexicon — and of those, 5,707,878 are the textual versions of timestamps.
$ splunk cmd walklex 1399784399-1399698000-17952400407545127995.tsidx "" | egrep -v "^my needle" | awk '$2==1 { print $0 }' | grep -P "\d\d:\d\d:\d\d\.\d{6}" | wc -l 5707878
Do we need to search on textual versions of timestamps?
Probably not. Remember that within Splunk, the time (_time) is stored as a first-class dimension of the data. Every event has a value for _time, and this value of _time is used in the search to decide which buckets will be interesting. It would be infrequent (if ever) that you would search for the string “20:35:54.271819”. Instead, you would set your search time range to “20:35:54”. The textual representation of timestamps might be something you can trade-off for smaller tsidx files.
Configuring segmenters.conf to filter timestamps from being added to the lexicon
I created a $SPLUNK_HOME/etc/system/local/segmenters.conf as follows:
[ciscoasa] INTERMEDIATE_MAJORS = false FILTER= ^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d.\d{6} (.*)$
Then I added to $SPLUNK_HOME/etc/system/local/props.conf a reference to this segmenter configuration:
[cisco_asa] BREAK_ONLY_BEFORE_DATE=true TIME_FORMAT=%Y-%m-%d %H:%M:%S.%6N MAX_TIMESTAMP_LOOKAHEAD=26 SEGMENTATION = ciscoasa
Starting with a clean index, I indexed the same file over again. Now, the same set of events requires 494MB of space in the bucket – 156MB of compress rawdata, and 339MB of tsidx files, saving me 200MB of tsidx space for the same data. The lexicon now has 5,115,535 entries (down from 11,800,000) – and of those 1,332,323 are entries that occur only once in the raw data. As I look at the items occurring once, a large fraction (1,095,570) are of the form 123.123.124.124/12345 – that is, an IPv4 address and a port number. Some of the same IP addresses occur with many different values of port number – can we do anything to improve this? Again, back to segmenters.conf:
[ciscoasa] INTERMEDIATE_MAJORS = false FILTER= ^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d.\d{6} (.*)$ MAJOR = / [ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29 MINOR = : = @ . - $ # % \\ _
This changes from the default so that “/” becomes a major segmenter. Now, each IP address and port number will be stored in the lexicon as separate entries instead of there being an entry for each combination of IP and port. My lexicon (for the same data) now has 2,767,084 entries – 23% of the original cardinality. The average lexicon entry now occurs in 94 events. My tsidx file size is down to 277MB – just a little over half of its original size.
Conclusions
What have I gained? What have I lost? I’ve lost the ability to search specifically for a textual timestamp. I’ve gained a reduction in disk space used for the same data indexed. I’ve slightly reduced the amount of work required to index this data. I’ve made the job of splunk-optimize easier.
The improvement in disk space usage is significant and easily measured. The other effects are probably not as easily measured. Any data going into Splunk that exhibits high cardinality in the lexicon has a chance of making your tsidx files as large (if not larger) than the original data. As Splunk admins, we don’t expect this because this is atypical for IT data. By knowing how to measure (and possibly affect) the cardinality of the lexicon within your Splunk index buckets, you can be better equipped to deal with atypical data and the demands it places on your Splunk installation.
Awesome post, Duane. Any chance you can post the results of the same walklex command to show the segmentation of the index after the change?
Certainly. I’ll try to get that done this weekend.
!duckfez++
Awesome. Did you seen any effect on performance of searches also ?
I have not tested enough yet to know for sure. I have a theory I’m going to work on to see if I can measure the O(n) for a single-term search as a function of the size of the lexicon. If Splunk implements the lexicon as a tree it should be close to logarithmic.
!duckfez++
Just found this post. Very nice!
Just a quick comment on the ip/port combo. You also loose the ability to search TERM(123.123.124.124/12345) but assuming you don’t care to do so, that’s not a big loss. On the flip side, you now CAN search for TERM(123.123.124.124), which wouldn’t have matched that event before. (Overall I think this is probably a win).
It’s been a while since I’ve played around with custom segmentation values, historically the problem I’ve run into to is click-though support in the UI. If the search segmentation doesn’t line up with indexed segmentation you can get unexpected results, which can screwup newbies!