Index ICU: Assertion `_sourceMetaData != __null’ failed, part 1
| Topics: | tech |
|---|---|
| Tags: | crash, index, repair |
| Share: |
There you were, merrily going along and Boom! Somebody kicks the power switch, your filesystem goes off the deep end, something Very Bad happens. You start to understand why fsck is a four-letter word. After using some additional four-words, you get things up and running. But what’s with Splunk? It won’t start!? You only get some cryptic error and “Splunkd appears too be down.” Welcome to the world of WordData. You had a backup, right? Yeah, thought so.
Buried deep in the index are a bunch of *.data files:
www.feorlen.org[feorlen]:/Applications/splunk/var/lib/splunk/defaultdb/db$ ls -lr *.data
-rw-r–r– 1 root admin 10276 Sep 3 07:41 Sources.data
-rw-r–r– 1 root admin 5085 Sep 3 07:41 SourceTypes.data
-rw-r–r– 1 root admin 252 Sep 3 07:41 Hosts.data
-rw-r–r– 1 root admin 21 Jul 26 19:19 EventTypes.data
You will find them in every bucket, they contain event counts for sources, sources, hosts and event types along with some timerange info. During indexing, these are constantly being updated. They are supposed to look something like this (note my timestamping oops there for host::grumpy):
$ more Hosts.data
0 0 2147483647 0 0
1 host::grumpy 11194556 900458000 1231448496 1220453014
2 host::www 1953184 1194131619 1220452994 1220452994
3 host::www.feorlen.org 2350 1207761050 1216665145 1216665145
4 host::localhost 7482 1203904810 1217973661 1217973661
Except when they look like this:
$ more Hosts.data
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@
Hosts.data (END)
That isn’t very good. splunkd doesn’t much like it when somebody messes with it’s *.data files. There are also supposed to be at minimum Sources.data, SourceTypes.data, and Hosts.data. (EventTypes.data may legitimately not be there in some cases.) Your crash log will likely contain something like this:
Backtrace:
[0x00002B51C8EEFB6E] abort + 270 (/lib/libc.so.6)
[0x00002B51C8EE8266] __assert_fail + 246 (/lib/libc.so.6)
[0x000000000066661D] ? (splunkd)
[0x0000000000697BA6] _ZN23DatabasePartitionPolicy20getSourceWordForCodeEmmR3Str + 182 (splunkd)
and here is the real smoking gun in splunkd_stderr.log:
splunkd: /opt/splunk/p4/splunk/branches/3.2/src/pipeline/indexer/TimeInvertedIndex.cpp:974: void TimeInvertedIndex::getSourceWordForCode(long unsigned int, Str&): Assertion `_sourceMetaData != __null' failed.
Ok, so you’ve got a horked *.data file. Where? Well, based on frequency of writes, it’s going to be in a db-hot directory because that is where active indexing is going on. And the most active indexes are usually fishbucket, _internal and defaultdb. Start by looking for *.data files that are binary. Here’s one way you can find which files are binary, a big clue on where the problem is:
$ cd /opt/splunk/var/lib/splunk
$ find . -name *.data | xargs grep "." % | grep Binary
grep: %: No such file or directory
Binary file ./_internaldb/db/db-hot/Hosts.data matches
Binary file ./_internaldb/db/db-hot/Sources.data matches
Binary file ./_internaldb/db/db-hot/SourceTypes.data matches
Binary file ./fishbucket/db/db-hot/Sources.data matches
file will do it also, but beware false positives:
$ for i in `find . -name *.data`; do file $i | grep -v text ;done
./_internaldb/db/db-hot/Hosts.data: data
./_internaldb/db/db-hot/Sources.data: data
./_internaldb/db/db-hot/SourceTypes.data: data
./defaultdb/db/db_1214955936_1210836930_38/Hosts.data: Bio-Rad .PIC Image File 2352 x 12297, 14601 images in file
Another check is to see if the line numbers in the file are in ascending order. If they aren’t, then something is seriously wrong:
for i in `find . -name *.data`; do sort -nc $i;done
Have a look at these files and see what’s in them. If they are only partially corrupted, you may be able to edit out the garbage. If they are totally full of junk, you will need to find replacements. For _internaldb and fishbucket, you may not care if your event counts are exactly correct so you can lift some files from another bucket. If the problem were in defaultdb or another index containing your real indexed data, you’ll need to pay more attention to the contents.
In the simple case, if the files in db-hot are trashed, see if there is a warm bucket next to it you can copy some from. Warm buckets are in the same directory as db-hot and look something like db_1218802821_1218658318_17. Copy the *.data files from there into db-hot and try to restart Splunk. If it does, then you are good to go. If not, that means there is more damage to repair. If there are other binary *.data files, make sure you deal with all of them.
This should handle the most common types of problems. I’ll go into more detailed debugging and reconstruction in another post.

February 9th, 2009 at 4:12 pm
Another possibility is the complete absence of a .data file.
Each bucket should have a Hosts.data, Sources.data and SourceTypes.data.
for dir in $(find /opt/splunk/var/lib/splunk -name “db_*” -type d); do
echo -n $dir; ls $dir/*.data | wc -l;
done
You should see 3 for each line.
June 4th, 2009 at 1:20 pm
FmyI:
for dir in $(find /opt/splunk/var/lib/splunk -name “db_*” -type d); do echo -n $dir; ls $dir/*.data | wc -l; done
July 1st, 2009 at 10:37 am
I’ve had success using the following command to rebuilt the .data files from the index information:
recover-metadata
I’m not sure of all of the side-effects of this, but here are two considerations: (1) This will compleetly relplace your existing .data files (assuming the index itself isn’t corrupt; in which case you can end up with less data that you had in the first place). So you should make a backup of your .data file first. And (2), all of your hosts, sources, and sourcetypes will be in all lowercase. I wrote a small python script to guess what the original case was based on other .data files in the top-level db directory. You can see/get a copy here: http://pastebin.ca/1481049