Andrew Chabak

And Now for Something Different


By Andrew Chabak, Chief Data Officer, Rembrandt Group LLC

Andrew-Chabak2 And Now for Something DifferentSince this is my first time writing for this publication, I wanted to be totally different in approach and subject from what others are writing about Big Data.  There are countless articles on Hadoop, ETL, SQL, NoSQL and other topics.


So, let’s start with the data.  Most of the data for Big Data applications is on the Internet or in purchased databases and libraries.  How do you get it from where it is to where it needs to be: the Big Data platform you just spent millions to create?


  1. First, you need to analyze the problem you wish to solve
  2. Determine the data you need: both type and location
  3. Access the source and retrieve it

Sounds very easy at a high level, but actually it is very difficult, time consuming and costly to do.

Internet or Social Media Information Sourcing is extremely complicated.  Your teams will need to find the right data, extract it, process the information and move various levels of data and information to the Big Data staging or collection area.  Internal data is easy to acquire via ETL or ELT processing with tools such as Informatica, Ab Initio or Data Stage.  However, these tools do not work on the internet well or at all. Data teams in business groups and IT generally acquire this information by hand-generated, detailed and time-consuming searches. Then they extract knowledge nuggets and generate reporting via spreadsheets, or they also need to write SQL and merge with internal sources.


New tools are emerging for Internet, Social Media Information Mining of free and paid external sources of data and information:


  1. Human Intelligence Automation (HIA)
  2. Knowledge Management Facilitation (KMF)
  3. Semantic Searching Analysis and Semantic Web. What I Call Super Semantic Searching (S3)


Benefits include the following:

HIA automates the tedious process of repeatedly searching web sites.  These technologies have a high ROI in mechanizing the work of analysts and processing their findings.  An HIA ROI I just measured was reducing 40 hours of web searching to fewer than two hours via HIA and mechanizing the effort to feed a daily dashboard for Senior Management.  This competitive and vendor analysis is now scheduled nightly.


KMF mines selectively various information sources on the internet, including searching for data such as spreadsheets, PDFs, DOC files and others.  This selective mining prevents unnecessary accumulation of unwanted data on internal systems and storage, thereby greatly reducing the costs that would be associated with storing, processing and weeding out unwanted data, the separation of the chaff from the wheat so to speak. KMF, which I call “little Big Data” can greatly reduce the costs of unregulated data mining. It searches the internet forest, finds the trees and returns to your platform only the blades of grass, the valuable nuggets of information.  Find the information, process and return only what is required at the right level. ROI is the reduction in time, human effort, loading and disk space on the Big Data platform.  Other Use Cases include Fraud Detection and Security Breaches.


Semantics Searching (S3) is probably the hottest topic right now in the data world. Searching the internet and other various sources based on human language. How are my customers thinking about XYZ concept or product?  This may require searching various information stacks including Data Warehouse-OLAP, Big Data and the Internet simultaneously and returning the appropriate answer.

ROI eliminates the need for experts who know SQL, PIG, HIVE, other animals in the Big Data Zoo as well as for internet experts.  Semantic searches also eliminate time, effort and storage because only the answers to the questions posed are returned to the user.  Many companies are spending millions to mechanize this type of information gathering to gain competitive edge.  Semantics Use Cases relate to Competitive Analysis.


Semantics asks the questions: What are people saying about our customer and our products. This new searching capability is used for Data Mining, Campaign Management, Customer Relationship Management (CRM) as well as to attract the best customers, to retain them and to flag or eliminate unprofitable customers.


Hopefully the reader finds this article different and insightful.