Big data is not the master

Extract tangible business value & insights from quality and integrated data is more important than volume, velocity or variety of Big Data. Business users who consume Big data don’t know or care about its bigness. They want the right data applicable to their particular business problem, and also want to trust that data and analysis derived from it.

Data and models lie in the context of a larger decision support system. The quality of the data, and quality of the model are two independent limiting factors of such a system and one always begin the decision support process with questions like.
• What problems must the business address?
• What questions does the business need answered?
• What insights does the business need for innovation?
• What business decisions and actions need quantitative support?

First, Develop models to support above business needs and then  decide to pursue for  the required data. Big data is the servant, not the master. The real value in decision making is highly influenced by organizational decision support process in  2 ways.

Start with assumption, instead of a hypothesis (Similar to being in a mission to “Prove” something). Team perceive an outcome and try to align data, analysis and do everything necessary to achieve outcome that syncs with assumption. Starting with assumptions make people cherry-pick data, models, algorithms, visualizations, and even share information with objective to “prove”.

Start with a hypothesis to be proved or disproved. Testing happens. Verification happens. Multiple models are generally used instead of one. Here more focus is spend on “whys” in addition to the “whats.” Teams that are open for critical thinking have the risk of slipping in to bias here, while the genuine focus still continues on the accuracy of result.

Big Data Cycle

Big Data brings focus to three “V’s” (volume, velocity, and variety); More  value from Big Data comes from variety. With little or no domain expertise, techies  focus system design to handle larger data volume . Hence business folks need to focus/emphasis on  data variety and needs to be educated to forget more of big volume and focus on big variety.

  • Get value out of your variety is data integration task. Ensure that no Big Silos gets created and first integrate existing data silos.
  • Integrate far flung disparate systems to generate insights with a holistic view of customer and product attributes along with sales data by channel, region and brand.

When Variety takes precedence over volume, exponential data still gets collected, we need to address data error correction on an exception basis. We cannot scale manual data correction to keep up with our increased data volumes. We must automate data quality processes to catch and fix those errors up front with tools at least as robust as our data collection and storage resources. For unstructured data too, we need to apply  the same data quality standards that is applied to more traditional transaction data.  To achieve data quality, it is important to understand how  business users interact with the data.

Traditionally, applications were designed and developed and  consumed by users until it becomes apparent that applications need to be modified or replaced. Computing played role to tells us what happened in the past and allow humans to speculate what happens in future. Today predictive analytics make machine to speculate about the future and allows humans to take decision/action.

Predictive analytics needs continuous process evolution towards “It is not requirement followed by solution.  It is a process and a journey and not a project. It is not plugging in some technology” When predictive analytics helps to get better understanding of customers, focus is needed to “get inside their souls.” and to continually tune adaption to changes in the business and customer activity.

  • How to  prioritize, data present in the report or person consuming the report?
  • Does applying analytics change things for you in role of a professional and in role of a citizen? Are you agreeable with  new change in both roles?
  • What you did with data 5 years ago compared to what you are doing with it today?
  • As effective data usage is a moving target,  Does “more data”  become a more realistic desire than “all data.”
  • Does velocity matter more than having data in a format that’s actionable and in a format that maps to business objectives?

Privacy Vs Security In A Big Data World

[Copied article contents Privacy Vs Security In A Big Data World for my reference]

What I do know, however — and I thank him for this — is that Snowden helped bring the discussion of big data privacy and security to the public square — and not just the American public square, but the global one as well. This is a good thing, because in this era of big data, not to mention the Internet of Things, we can no longer relegate this discussion to the privacy freaks and security geeks in the back room. It’s a discussion in which we all should participate.

To understand it better, let’s take a brief look at some of the privacy and security issues in the context of the (big) data lifecycle.

In data security circles, the six stages of the data lifecycle are well known: create, store, use, share, archive, and destroy. While these six stages have a strong foundation in security, an interesting thing to note is the fact that the two privacy-related stages — use and share — are situated squarely in the middle. Is it just a coincidence that privacy is at the heart of the matter?

If data is not collected and/or created, there is no need to secure it. This may seem obvious, but it’s astonishing how many websites and apps forget or disregard this point. They collect it all “just in case” –- with little consideration on how the data may be handled downstream.
Why this matters: Data security begins at the point of creation or collection. Organizations need to be deliberate in the data they request or receive, and individuals should be mindful of the data they’re sharing — whether it’s sensitive data on a financial site or a viral video on YouTube. If this data is not secured, it could end up in the wrong hands.

With the volume of big data being generated these days, it’s not just a question of what data to store, but also how to store it all without blowing the budget. Open-source big data technologies are helping to greatly reduce the cost of data storage, both on-premises and in the cloud.
Why this matters: If an organization creates or collects data, it becomes their responsibility -– not the individuals’ -– to secure and protect it from corruption, destruction, interception, loss, or unauthorized access. Some organizations take this responsibility more seriously than others.

When an individual sets up a new account with an organization through its website/app, the individual is asked to read and agree to the terms of service and/or privacy policy. This legal contract typically defines how the individual’s data will be managed and used inside and outside the organization. Granted, few people read this legalese, but our expectation is that the organization will use our data “responsibly,” and when this usage changes, we expect to be notified.
Why this matters: It’s the usage — not the collection or storage — of data that concerns most people. It’s this stage where individuals want to be in control. For example, they want to set the dial on how public or private their data should be, who can access their data, and whether their data (aggregated or not) can be sold or rented to third parties. In this big data era, when organizations don’t provide this level of privacy control, they risk losing the loyalty and trust of their customers and users.

Organizations continue to share data between internal systems and external partners, but with the advent of social networks and “smart” devices, sharing data has become a public pastime — even to the point of “selfie” narcissism.
Why this matters: On one hand, individuals want control on how their personal data is being used. Yet some of these same individuals show little to no constraint on what personal data they’re sharing. Even though it’s the responsibility of the organization behind the website or app to secure users’ data and respect privacy settings (if they exist), it’s up to the individual to determine what and how much information they’re willing to share. If you put it on the Internet, it’s not a question of if, but when, your information may be used in unintended ways.

Between big data technologies and the cloud, it’s become relatively cost-effective for organizations to store data for longer periods of time, if not indefinitely. In some cases, regulations stipulate how long certain data will live — like in the US financial and health industries — but, in most cases, the budget and space constraints are being alleviated.
Why this matters: Being able to store more data for longer periods of time at a fraction of the cost is an appealing proposition for organizations. The more exciting proposition, however, is the ability to analyze even more data over greater periods of time to discover new questions, patterns, trends, and anomalies. The gotcha here is: The more data an organization stores and archives, the more data it has to secure.

If and when data is tagged for destruction, the question is to what extent. For example, if a website user requests that his account be deleted, what does this mean? Is it just the access to his account/data removed (so that he can request access later if he changes his mind) or does a deletion request trigger the destruction of all his data, including archived data? The answer most likely lies somewhere in between for most organizations.
Why this matters: Regulations and governance policies will dictate the extent to which data may be destroyed for many organizations. The data that does not get destroyed must then be secured. So using the example above, if a website user requests that his account be deleted, and he receives an email notification to that effect, what he doesn’t know is what personal data, if any, still exists in the organization’s systems. He may still be vulnerable to a potential data breach, long after he’s been deleted.
It cuts both ways
While a citizen’s right to privacy and freedom from government surveillance has been top of mind for Edward Snowden, national security has been top of mind for the US government.

And therein lies the rub: security cuts both ways. On one hand, it’s the responsibility of an organization to secure and protect any digital information it collects, stores, and transmits. But on the other hand, our governments are knocking on organizations’ doors demanding access to this protected information — all in the name of preserving national security.

How to test application implementing ML algorithm?

To perform testing of software programs, one arrives at a set of tests steps to test programs and test data to be provided at each of testable step and the expected output from the program based on the test data and test step. If the actual output from the program is same as expected output, we declare that program is functioning fine.  The working of the program gets tested for correctness for boundary and exception scenarios of both program and data input.

Having spec algorithm’s, coded algorithm’s, unit tested algorithm and tested them as part of application in my earlier days, I want to understand how people test mobile learning programs. This is my current understanding which I want to improve.

Coming to software testing of machine learning program, directly applying conventional software engineering process may not work. It is challenge to detect errors, faults and defects in machine learning program that takes arbitrary input to generate program’s output and to determine whether the program’s output is correct or reliable for the data inputs. Are ML programs non-testable?

Should testing of machine learning program focus less on whether ML algorithm learns well and focus more on whether application using the algorithm implements the specification and fulfills the user’s expectations?

First, start to understand the problem domain and suitability of algorithm in the problem context based on potential range of data inputs arriving in real time, in terms of real world data sets. Thinking of data sets can start with following data-set characteristics. Small vs large, repeating vs non-repeating values, missing vs non-missing attribute values, repeating vs non-repeating attribute labels,  predictable vs non-predictable attribute values, attributes that take non-negative values only , attributes that can also take negative value and the precision required for floating point numbers.

Second, test working of algorithm and third is to test algorithm providing data inputs.

  • Are you implementing algorithm? Design a series of primitive tests for various sub-parts of the algorithms, and an end-to-end test testing the final output or algorithm behavior.
  • Are you making use of some algorithm? Understand the algorithm and required validation for user inputs to ensure getting best possible results and how to arrive in the problem context, whether the algorithm results are sensible or not.
  • Check  upper bound reports on time and space used by the algorithm and get a measure of efficiency in terms of size or complexity of its input (Big O notation).

Think in terms of unit tests and regression tests for machine learning programs.

  • Add unit tests to your code and have approximate testing of your expected results
  • Create multiple data-sets with different difficulty levels like easy, difficult and adversarial. Whenever code changes to add a feature or fix a bug, run code against all of these data-sets to ensure that expected outputs lie in a reasonable error range and do not break existing functionality.

Arrive at criteria to determine meaning of correctness, working with domain specialists.

Discuss, Decide and determine margin of errors or correctness percentage beforehand to testing machine learning program. For example, if program interprets 75% of test data correctly, the programs is considered to be good enough. Remember that it might not be possible to demand test validation of 100% correctness as the intent of machine learning is to tolerate ambiguity.

Testing would benefit with software engineers ability to provide a data set generator, tools that would help to compare the output results and their correctness based on the data inputs.  You need to have methods to capture and view trace options that are inserted in to the ML program and tools to analyse traces to debug, test and validate intermediate results in specific steps of the algorithm.

On Being Broadly Correct Rather Than Precisely Wrong

[ Copied paragraphs from equity master article for my reference.]

The profession of prediction is tricky. The consumers of predictions most often look for instant gratification. Unfortunately, it can become more important for the professional to look right rather than be right.

The talking heads on television predict everything about the next day that helps them look right. Mirroring the sentiment of the world helps them appear right. Rarely does anyone bother to check the accuracy of the prediction the next day.

Predictors go to great lengths to create an impression that appears right. Every impression has the potential to mislead investors.  The truth is that every decision based on predictions can be a double-edged sword. The skill of human beings lies in being broadly correct rather than being precisely wrong.

  • If you invest, you will lose money if the market declines. If you do not invest, you will miss out on gains if the market rises.
  • Market timing can add value if it can be done precisely right. Buy and hold can produce better results if the market timing can’t be done right.
  • Aggressiveness will help when the market rises but hurt when it falls. Defensiveness will help when the market falls but hurt when it rises.
  • If you use leverage, your success will be magnified. If you use leverage, your mistakes will be magnified.
  • If you concentrate your portfolio, your mistakes will kill you. If you diversify, your payoff from your successes will be limited.

It is not difficult but rather impossible to always make favorable decisions for the portfolio based on accuracy of predictions.

Did you verify that news is a fact and not a rumor?

Social networks helped to collect money and relief materials for charity. The smartphone was leveraged across volunteers in Bangalore to collect and identify the right set of relief materials needed in Chennai. People interested to donate reached volunteers using smart phone. With the advent of internet, WhatsApp  via mobile, everyone is connected to the free flow of information. This gives the feeling that smartphone is smart.

Today information is available in few clicks and i observe  that”You do not need to look out for information and the information finds you. ”

Across Chennai rains, people were sharing the need for help via social networks. Few people confirmed the source before sharing on the network.  False News like  “Porur lake is opened and going to flood” creates panic.  One of AID volunteer at ground shared” A genuine need was shared and was also verified. No help reached the person in need. A life was lost at the end without timely help”.  While this message has been shared by large number of people, may be everyone assumed that some one else would reach the person in need to offer support.

  • How do we identify that a particular new is rumor or a fact? What happens when people act based on rumor that can be disastrous?
  • When we see a message asking for help,  Do we check whether some one has offered help already or not? May be we could use phone?
  • When we share message that needs help, do we take responsibility to verify after some time whether real help has reached the needy?

Human nature has desire to feel important. We end up doing things that would satisfy our desire to be important.  The way we gain importance has changed over years and the urge persists.

Is it that one feels important  when shares the news first in social network? I am perfectly fine when the news is confirmed to be authentic and true fact.  We commit a great sin when we share without confirming the source or reality.

When everyone says the same things, there starts a  blind faith or belief that it must be true and also comes a claim that otherwise many people would not say the same thing. We fail to verify the information in reality and the onus of verifying the information get shifted to others.

Remember that information that you receive on your smart phone need not always be ‘smart’ after all.  Please avoid sensational media or new and search for validity of suspicious information. Try to trace the original source of a message before sharing message- news or rumor.


Does “Not picking mobile”means “buyer not at home”?

Few months back I wrote  blog “Pain at work for e-commerce delivery boy”  I was really concerned about the plight of the delivery boys. While I talk from the delivery boy needs, the insufficient or no training offered to delivery boys has become a pain for e-retail buyer.

I  am not regular buyer on  e-commerce website beyond books.  Books can be delivered to security or neighbor and we  faced no issue with delivery boys. we purchased mobile on At time of order, Amazon committed to deliver on 4th or 5th days from the order date. After one day from order date, an email was received that delivery would be done on the second day of the order. Looked impressive  and was open for delivery.

[Day 3] I get a call on my mobile  at 11:30  am and being asked whether the delivery can be done at home. he wants to deliver at my home. I informed my absence at home and asked him to deliver after 4 pm and my wife would be at home.  When asked whether he was in front of my house door, his response was negative. I call this person at 6:30 pm and he  shares that he is present somewhere in Old Airport Road and he would deliver.

After multiple Follow-up  and person agrees to deliver and then fails to deliver causes an irritating experience. The delivery does not happen and status changes as “could not be delivered as no one was available to receive it” . We know for sure that guy did not come to our door and have not even visited apartment security gate

[Day 4] The next day, I forgot my mobile at home. When I returned home, I saw missed calls on the phone and called and the person asked me to call the carrier. When I check my order email,there is no contact number to reach e-commerce retailer. Then I figured an option where I could provide a mobile number and call center reaches me. The service representative offered to get package delivered at earliest.

When I talked to customer representative, I hear interesting words. “We have to request the delivery agency”. On talking to supervisor, you hear underlying message that we have no control & are not sure when things will get delivered. Luckily, package was delivered today [Day 5].

May be e-commerce firms can collect buyer preferences before specifying  delivery date  for order.

  • Provide the buyer with time slots in day  & ask buyer to choose time slot  for delivery. Instead asking  customer to choose time, provide time ranges like 8 to 10 am, 10 am  to 6 pm, 6 pm to 8 pm.
  • For items of low value, ask buyer whether item can be delivered with security or neighbor. May be option is not offered when purchase is high value item.

May be e-commerce firms can capture more data to verify the authenticity of delivery person and buyer in mid of conflict. When delivery person visits house and house is locked, ask person to take picture of locked house and upload the same to site. Display this to buyer and you verified delivery boy really tried.

My wife started worrying whether order will be delivered. Not sure whether her saying ” After 2 times getting message door is locked, you need to cancel existing order and perform re-order” is true. Looks the horror in delivery is known to all customers. Makesme wonder whether delivery boy takes role of important person in e-commerce transaction.

Have you heard of Seattle speech?

You may get surprised with title and follow with question “Is Seattle not a place? How place can address?”. You go to Seattle in Wikipedia and will find page starting with 3 sentences about “Native Americans” and moving to European visiting Seattle. If you have above experience, you are not alone and I had same questions and in similar situation. It if irony that I visited Seattle years back and was not aware of the history.

Please look for wikipedia article “Chief Seattle’s Speech“. This is a good example of what I do not know. Seems the city gets its name from Chief Seattle.I used to like reading history in my school. History of USA starts with Columbus landing in America and gradually moves to Boston Tea Party, American Independence, Abraham Lincoln, great depression, Pearl Harbor and atom bomb of the second world war. I checked current CBSE site and experience is similar. Now I understand why historians in India crib about political influence in history? Do children in Seattle know about this address?

History is one of the largest big data we have and the data is not clean and is subjected by the preference of the historian. In future, Will it become that data viewed my multiple people(with comments and social sharing) is only true? Will data viewed by few people be disappear in search results and never read? What would happen to data that is not documented in digital format? Will creator have challenges to prove that it is indeed real data? Let us leave History and Big Data and focus on Seattle address.

Here is link that has full text and small audio of the re-created speech The address is a good intellectual read and seems to be related/apply for today’s scenario. This speech was introduced to me in Tamil Book and then did research. As respect to Chief Seattle, I typed the Tamil article and found same cultivating empathy and compassion for others.

“வெள்ளையர்களால் எங்கள் வாழ்வுமுறையைப் பரிந்து கொள்ள முடியாது. பூமி மனிதனுக்கு உரிமையானது அல்ல. மனிதன் தான் பூமிக்கு உரிமையானவன்.பூமிக்கு எது நேர்ந்தாலும் அது பூமியின் பிள்ளைகளுக்கும் நேரிடும். வானத்தையோ பூமியையோ விறகவோ வாங்கவோ எவ்வாறு இயலும். அந்த எண்ணமே எங்களுக்கு புதிரானது.

விலங்குகள் இல்லாமல் மனிதன் எப்படி வாழ்வான்? எல்லா விலங்குகளும் இந்த பூமியில் இருந்து மறைந்துப்போய் விடுமானால் தனிமையில் மனிதன் இறந்து போய் விடுவான். விலங்களுக்கு எற்படுவது நாளை மனிதனுக்கும் நேரிடும். எல்லாம் ஒன்று தெடர்பு கொணடவையே.

நீரின் முணுமுணுப்பு என் தந்தையின் தந்தையினுடைய குரல். எங்கள் காலடியில் உள்ள நிலம் எம் மூதாதையரின் சாம்பல்.இதை நீங்கள் உங்கள் பி்ள்ளைகளுக்குக் கற்பிக்க வேண்டும். இந்த பூமி எங்களின் தாய் என நாங்கள் நம்புவதை உங்கள் குழந்தைகளுக்குக் கற்றுகொடுங்கள்

நதிகள் எங்கள் சகோதரர்கள். அவர்கள் எங்கள் தாகத்தை தீர்க்கிறார்கள். எங்கள் படுகுகளை சுமக்கிறார்கள். எங்கள் குழந்தைகளின் பசியாற்றுகிறார்கள். நாஙகள் உங்களுக்கு எமது நிலத்தை விற்றால் நதிகள் எங்களுக்கும் உங்களுக்குமான சகோதரர்கள் என்பது உங்களுக்கு நினைவிருக்க வேண்டும். அதை உங்கள் குழந்தைகளுக்குச் சொல்லிக் கொடுக்கவும் வேண்டும். அது மட்டும் அல்லாமல் நீங்கள் எந்த ஒரு சகோதரனிடமும் காட்டும் கருணையை நதிகளிடமும் காட்ட வேண்டும்.

வெள்ளை மனிதன் உருவாக்கிய நகரங்களில் அமைதியே இல்லை. வசந்தகாலத்து இலைகள் உதிர்வதைக் கேட்கவோ பூச்சிகள் தங்கள் சிறகுகளை உரசிக் கொள்ளும் ஓசையை அறியவோ சாத்தியம் இல்லை. நகரங்களின் குழப்பமான ஓசைகள் செவியை அவமதிப்பதாகத் தோன்றுகிறது.

சிவப்பு மனிதனுககு காற்று மிகவும் மதிப்புடையது. ஏனென்றால் எல்லா உயிர்களும் பகிர்ந்துகொள்வது ஓரே சுவாசத்தைத்தான்; வலிங்கும் மரமும் மனிதனும் ஓரே சுவாசத்தைத்தான் சுவாசிக்கிறார்கள். வெள்ளை மனிதன் தான் சுவாசிக்கும் காற்றை அறிந்ததாக தெரியவில்லை.

ஆண்டுகளுக்கு முன்பாக சியாட்டில் என்ற சிவப்பிந்திய மனிதர் கேட்ட கேள்விகள் இன்றும் பொருத்தமானதாக தோன்றுகிறது. வாழ்நிலமாக இருந்த நிலம் இன்று பெருநகரமாக உருக்கொண்டு நிற்கிறது”