Splunking a Microsoft Word document for metadata and content analysis

The Big Data ecosystem is nowadays often abbreviated with ‘V’s. The 3Vs of Big Data, or the 4Vs of Big Data, even the 5Vs of Big Data! However many ‘V’s are used, two are always dedicated to Volume and Variety.

Recent news provides particularly rich examples with one being the Panama Papers. As explained by Wikipedia:

The Panama Papers are a leaked set of 11.5 million confidential documents that provide detailed information about more than 214,000 offshore companies listed by the Panamanian corporate service provider Mossack Fonseca. The documents […] totaled 2.6 terabytes of data.

This leak illustrates the following pretty well:

  • The need to process huge volume of data (2.6 TB of data in that particular case)
  • The need to process different kind of data (Emails, databases dumps, PDF documents, Word documents, etc).

So, let’s see what we could do to Splunk a Word document!

 

A Word document is a Zip file!

As illustrated by the results of the Linux file command, a Word document is a Zip archive.


# file document.docx
document.docx: Zip archive data, at least v2.0 to extract
#

Splunk being able to uncompress Zip files to read the logs it contains, let see what happen if we try to Splunk a Word document “as this”.

MS Word - 001

Pretty ugly. Unfortunately, Splunk 6.4 will only provide ineligible results as illustrated by the above screenshot because it cannot index a Word document without prior preprocessing.

 

Word document format

XML representation of Word documents was introduced by Microsoft with Word 2003, and it evolved to a multiple files representation since then (aggregated under the now familiar .docx extension). As a result of not losing any functionality of moving from a binary to a XML representation, the produced XML files could be intimidating as they contain a lot of information that is not related to the actual content of the file, but to the presentation of such content.

A Microsoft Word 2007 file format consists of a compressed ZIP file, called a package, which contains three major components:

  • Part items, the actual files
  • Content type items, the description of each part item (ex: file XYZ is an image/png)
  • Relationship items, which describes how everything fit together.

Readers expecting a complete and precise description of the format of a Word 2007 document are invited to go through the Walkthrough of Word 2007 XML Format from Microsoft.

 

Uncompress & Index

After using the regular unzip command to extract the files from the docx package into a directory named “document”, the listing of the files is as follow:


# find document/ -type f | sort
document/[Content_Types].xml
document/customXml/item1.xml
document/customXml/itemProps1.xml
document/customXml/_rels/item1.xml.rels
document/docProps/app.xml
document/docProps/core.xml
document/docProps/thumbnail.jpeg
document/_rels/.rels
document/word/document.xml
document/word/fontTable.xml
document/word/media/image1.emf
document/word/media/image2.emf
document/word/media/image3.emf
document/word/media/image4.png
document/word/media/image5.png
document/word/media/image6.png
document/word/media/image7.png
document/word/numbering.xml
document/word/_rels/document.xml.rels
document/word/settings.xml
document/word/stylesWithEffects.xml
document/word/styles.xml
document/word/theme/theme1.xml
document/word/webSettings.xml
#

As we can see, many files are XML, so flat ASCII files that Splunk can ingest. To ingest that directory, a custom sourcetype has been created with the property TRUNCATE set to false (props.conf):

TRUNCATE = 0

The TRUNCATE option is required to make sure Splunk completely index all the files (except the binary ones like images; see option NO_BINARY_CHECK for that).

After ingesting the whole directory, here is how one event looks into Splunk

MS Word - 002

The resulting events are more user friendly, but not really operationally exploitable yet.

 

Content Types

At the root of our document directory, the file [Content_Types].xml contains the content types specifications. As this is a flat XML files, we can parse it with Splunk spath command to visualize what kind of content we have into our Word document as illustrated by the following screenshot. In that example, we have two kinds of data: XML files, and images.

 

The MSDN walkthrough details the construction of that file:

  • A typical content type begins with the word application and is followed by the vendor name.
  • The word vendor is abbreviated to vnd.
  • All content types that are specific to Word begin with application/vnd.ms-word.
  • If a content type is a XML file, then the URI ends with +xml. Other non-XML content types, such as images, do not have this addition.
  • etc…

 

So, using regular Splunk-fu, we can parse our content type file to have access to more useable fields:

MS Word - 003

The search is detailed hereafter:

source="*[Content_Types].xml"
| spath input=_raw
| rename Types.Override{@ContentType} AS ContentType Types.Override{@PartName} AS PartName
| fields PartName ContentType
| eval data = mvzip(ContentType, PartName)
| mvexpand data
| eval tmp = split(data, “,")
| eval ContentType = mvindex(tmp, 0)
| eval PartName = mvindex(tmp, 1)
| eval tmp=split(ContentType, “/“)
| eval family_type=mvindex(tmp,0)
| eval part2=substr(ContentType,len(family_type)+2)
| rex field=part2 “vnd\.(?<vendor>[^.$]+)"
| eval part3=substr(part2, len(vendor)+6)
| eval isXML = if(match(part3, "\+xml$"),"Yes", "No")
| eval filetype = if(match(part3, "\+xml$"),substr(part3, 0, len(part3)-4), part3)
| table PartName family_type vendor isXML filetype ContentType
| sort PartName

 

Document Properties (Word metadata)

Two very interesting files exist within a Word 2007 package: core.xml and app.xml from the docProps directory. A simple parsing using Splunk command spath can give us insights into the author of the document, the creation time, the modified time, the number of pages composing the document, the system on which the document was created, the number of characters, etc.

core.xml

MS Word - 004

app.xml

MS Word - 005

 

Revision IDs (RSID)

To dive more on the actual content of such file, one key mechanism to understand about Word documents is revision identifiers (rsids). It’s very well explained here:

Every time a document is opened and edited a unique ID is generated, and any edits that are made get labeled with that ID. This doesn’t track who made the edits, or what date they were made, but it does allow you to see what was done in a unique session. The list of RSIDS is stored at the top of the document, and then every piece of text is labeled with the RSID from the session that text was entered.

 

Practically speaking, this leads to such thing:

MS Word - 006

It is to notate here that the sentence in the analyzed Word document was “When a notable event is raised, a security analyst needs […] or identities. This manual task […]”.

Clearly, a lot of noise surrounds the real content of the document (this “noise” is required on purpose, but that level of details in our case isn’t appropriate because we just want to have access to the words composing the document).

 

Accessing the content of the Word document

As the content is actually XML, it could be parsed the same way as the previous files with the Splunk spath command.

MS Word - 007

 

The problem with this method is that firstly some words or sentences are cut in the middle, and we also need to know the exact path in the XML tree (here, <w:p><w:r><w:t> under the root <w:document><w:body>)

However, we know for sure that the actual content for the file will be within the boundaries <w:body>. The idea becomes then to extract the content within those boundaries, and remove the XML tags.

MS Word - 008

 

The Splunk search is presented hereafter. The result is one field containing the whole content of the file as illustrated above.

source="*/document.xml"
| rex field=_raw "\<w:body\>(?<wbody>.+)\</w:body\>"
| fields wbody
| rex field=wbody mode=sed "s/\<[^>]+\>/ /g"
| table wbody

That’s more practicable, but what about searching for a term within that document, which is basically contained into one single field?

One trick could be to split the field into multiple fields based on the punctuation. The output is similar on the first approach with the spath command, the big difference being that words are not cut in the middle!

MS Word - 009

source="*/document.xml"
| rex field=_raw "\<w:body\>(?<wbody>.+)\</w:body\>"
| fields wbody
| rex field=wbody mode=sed "s/\<[^>]+\>/ /g"
| rex field=wbody mode=sed "s/[[:punct:]]/#/g"
| eval wb = split(wbody, "#")
| mvexpand wb
| table wb

From there, we can easily search for simple terms by appending the following to the above search:

| search wb = “*notable*”

In this example, the word “notable” will be searched across the entire document.

 

 

Conclusion

This article is only scratching the surface of the Microsoft Word 2007 format (now a worldwide standard under the references ECMA-376 and ISO/IEC29500), and does not cover core components like relationship items for example. While it is technically possible to Splunk Word documents, that’s not an easy task and operationally limited as illustrated above.

Now, one question remains, what are your use cases around such feature? (-:

 

| eval tmp = split(data, “,”) should be | eval tmp = split(data, “,”)

Lohit
July 4, 2016

2 Trackbacks

  1. […] from Splunk Blogs http://blogs.splunk.com/2016/06/30/splunking-a-microsoft-word-document-for-metadata-and-content-anal&#8230; […]

  2. […] Splunking a Microsoft Word document for metadata and content analysis (Cedric Le Roux) […]