Big Data in Action - the Analytics Cure-all?

Most would agree that by now, "Big Data" is a household concept in the strategic enterprise technology circuit. In this post I take an actual example of Big Data use and explore the implications of adopting and integrating it into a traditional business process. While there is a formal Gartner model defining the essential three dimensions of Big Data, for my purpose, a simple working definition might be that it is massive amounts of data, often unstructured (for example free flowing human readable text as in this blog post, as opposed to structured data in relational databases) generated either by humans or devices, that can be analyzed using advanced tools and may lead to unprecedented business insights.

Below is a demo video from IBM which delves into a specific business application of Big Data. I specially like it since it gives "just right" amount of detail (okay I'm speaking for myself here) to understand the usage. The commentary is neither at a thirty thousand foot marketing level nor so technical so as to lose the big picture.

The departmental unit in focus here is the Office of Technology Transfer, which is present at most research universities, corporate research and government research organisations (such as the National Institutes of Health OTT). The unit hangs in delicate balance at a point where incoming resources are thin while the expected output business value (from licensing and selling of IP assets) is enormous, potentially worth millions of dollars in licensing deals. Sounds like a familiar predicament? In fact, especially in universities, where the OTT is critical for creating a steady alernate revenue stream out of IP assets, the unit is frequently understaffed. The staff comprising of a mix of students and full time staff analysts are left to their own devices to perform the arduous task of sifting through massive amounts of heavy duty scholarly content strewn across the Internet hoping to make the crucial match between demand and supply that will result in a signed deal. The video illustrates what it means to apply Big Data to such a business model and how a company may use new capabilities gained from this to its advantage.



So does this mean Big Data is a panacea for all enterprise business analytics ailments?
Here are some of my takeaways from the video:

Being a marketing demo, the video expectedly touts the benefits of Big Data but does not mention anything about the costs and pitfalls. Big Data requires an upfront investment of time and money by the consumer. Some of this is direct while some may be passed on in the form of a price premium or a usage fee. As illustrated in the video, this application requires a domain vocabulary or dictionary to be present. It ranges from a simple list of English keywords and synonyms entered by OTT staff to be able to perform full fledged NER or named entity recognition (In the video example, company names which could be either come preloaded or be "discovered" by a semantic analysis engines such as IBM LanguageWare and IBM Content Analytics ) . Sometimes these terminologies and taxonomies are available for a fee from third parties - for example medical vocabulary used in a clinical diagnostic and decision support system in a hospital. However often one has to be built from scratch or enhanced for the problem domain and context.

On top of its base natural language processing or NLP capabilities, a machine learning system "learns from its mistakes" and gets better over time as the rules that analyze content and correlate with context get refined. Sometimes a corpus of such rules exist for the problem domain to produce reasonably accurate results from the get go. But in new application areas, the time and effort to build these rules must be invested afresh until the system becomes acceptably accurate. A case in point is Watson, IBM's revolutionary Jeopardy! playing machine. While the machine has defeated the human champions in the game, it's human language processing skills suitable for Jeopardy! (dealing with sarcasm, puns, and rhymes) are not sufficient by themselves to understand domain specific scholarly research text such as the application in this video.

On the positive side, this application puts to rest (at least partially) criticism of the basic Big Data approach of dealing with entire datasets (as opposed to using statistical sampling). The argument there is based on the premise that considering the entire set may cause statistically biased results, that good data does not guarantee good decisions (see HBR article) and that eight or nine figure sums are wasted on compute resources whereas acceptable results could be obtained at a fraction of the cost from smaller, "good enough" data sets. The case in the video is a classic situation where every piece of data (such as each  patent disclosure, or license agreement) out there counts, and a sample set would not suffice. The case reinforces the point that as opposed to applications in scientific research and marketing, a comprehensive initial data set is a must in areas such as legal and financial fraud detection, diagnosing rare diseases and the like, where every piece of information counts. In other words, if your initial data does not have "the answer" you will certainly not find it in the end. The other issue is about the role of human judgement in decision making. Big Data analytics gives a uniquely powerful starting point in a growing sea of information inside and outside a firm and also arms researchers with advanced analytics capability they do not always have, boositing productivity in the process. That said, it is not a trivial task to determine what questions to ask your data and what subsequent questions must follow to reach actionable information. In my mind, the role of human judgement and decision making in playing that highly valuable role is not going away anytime soon.

No comments:

Post a Comment