Splunking a Microsoft Word document for metadata and content analysis

The Big Data ecosystem is nowadays often abbreviated with ‘V’s. The 3Vs of Big Data, or the 4Vs of Big Data, even the 5Vs of Big Data! However many ‘V’s are used, two are always dedicated to Volume and Variety.

Recent news provides particularly rich examples with one being the Panama Papers. As explained by Wikipedia:

The Panama Papers are a leaked set of 11.5 million confidential documents that provide detailed information about more than 214,000 offshore companies listed by the Panamanian corporate service provider Mossack Fonseca. The documents […] totaled 2.6 terabytes of data.

This leak illustrates the following pretty well:

  • The need to process huge volume of data (2.6 TB of data in that particular case)
  • The need to
» Continue reading