Confessions of an Analytics junkie: THE EVOLUTION OF ANALYTICS

THE EVOLUTION OF ANALYTICS - PART 2

An analysis of the Business Intelligence Industry; past, present and predictions for the future.

Limitations with Current Generation technologies and options to solve them

Even with all the software advances that have improved modern analytics tools, there are still limits on the level of interaction available to users in this community.

Barriers are in place that, in varying ways, prohibit greater take up of self-service analytics and BI solutions. We look at a few here in detail and offer potential solutions based on existing products and current day practices that may not have been self-evident:

Problem: Insights have been mainly the domain of specially trained staff

There is still the notion that data analytics is the domain of only a handful of specially trained staff coming from either specific educational backgrounds or having received specialised training to operate the tools needed to gain insights from the data. This notion can be traced back to the fact that a lot of the current technologies require extensive technical expertise to operate and even the simpler ones still require some coding knowledge.

The belief is that not enough of these types of people exist so there is high demand for them but not enough supply.

A report by E-Skills (UK) and SAS sees the need for more big data specialists over the coming years. The demand for big data specialists will grow over the next 5 years by 243% to 69,000 in the UK alone.[xii]

The above graphic comes from Gartner’s 2012 report on how to deliver Self-Service BI[xiii]. It sees a divide between information consumers and power users. This line in the sand propagates the image that there are distinct differences between users in terms of involvement with data and data tools and the skill level necessary to interact with data. This leads to problems like the supply and demand of users with the necessary knowledge to do research on the data a company holds.

”The first mistake we made was in the organisational model. Centralised, IT-dominated BI teams are not conducive to empowering end users.”[xiv]

A team that blends IT and business skills is in a much better position to service this need than a strictly IT focused one.

Solution:

Instead of waiting for users to mature via traditional methods (ie create more and more specialised users) in the new suite of analytics applications there will no longer be the need to divide users into distinct groups of power users and information consumers. With the right tools, the majority of staff members become empowered enough to be able to call themselves power users too.

However, it is important to have the right tools in place before this can happen. If the tools are still complex then users will still need to be trained to find the answers they seek. On the other hand, if the tools are intuitive, easy to use and require little training then users are more likely to become involved and start getting real benefits from data analytics.

Whilst providing technical training to users is in some way beneficial, doing so takes time and with software technologies constantly evolving, going down this path means that ongoing training and development is likely required. This will ultimately prove costly to organisations in terms of their time, money and efforts, all of which could be better spent elsewhere.

It is both more efficient and cost-effective to give users access to tools that require little to no training because they are intuitive and simple to use allowing the user to focus on more value adding tasks in the day to day running of the business.

Whenever a user starts working with a new software tool, there is always a divide between their starting state and their ability to use the software, as they would typically need training, reading and course materials and practice time to even be able to start using and finding the answers they want or need:

However, self-service models that are intuitive and easy to use can help reduce the gap and the time to reach benefits and make that divide much smaller:

The full move to having the complexity divide much smaller is an ongoing piece of work and involves the use of further techniques like smarter predictive analytics and augmented intelligence (topics I will discuss in other blogs).

Search-based BI tools with “Google-like” interfaces allow users to get started right away exploring data with little training. Analysts do not spend substantial amounts of time preparing reports but rather, can create reports with a few clicks and provide value by spending more time providing insights into the data.

This type of environment also means changes for the typical IT service staff. They are no longer required to be heavily involved in the report building process when the right tools are in place. The important thing to remember here is that whilst business users may start to do what traditionally was the role of IT resources, there is still going to be a need for IT resources. It’s just that their role will evolve from being report providers or creators to being solely focused on data management, from custodianship, security and privacy and efficiency in getting the data to the right people.

Problem: Aggregated datasets to answer organisational questions

Another issue is that although the current tools may have started to move towards a self-service model, they are only doing so over limited datasets. A lot of solutions in this space have to aggregate the information available, either for security reasons or due to the amount of time it takes to prepare large datasets for self-service dissemination.

These tools will serve up analytics to the limit of what can be achieved within hardware and software capacity. Tools like this do not necessarily connect to all the data to begin with and may require a lot of configuration in the build process rather than just being able to plug and play. Hardware, software, implementation or timing constraints mean that even with all the right accessibility and authorisations in place, a user might still be limited to looking at only a portion of the available data.

This is especially true where a program claims to be self-service for its end users but is really looking at a small sample of the data or a pre-aggregated report. The user can explore the information in the view, but if the view is limited it cannot really be considered a full self-service option. Others have already made a decision about what data is made available and what is not. And the end user might not even be aware that information is missing.

Of course an organisation needs to control what it can show but doing this in collaboration with users allows the users to set the agenda. This leads to a better user experience and less time iteratively creating and updating reports. By giving users access to most of the data available rather than a small amount, the organisation can also ensure that there is future proofing against having to create new reports when additional information is needed.

Solution:

To solve this problem, we need a software tool that automatically gives the full set of data to users so they can decide what is important. The tools must be capable of looking at entire datasets and have the ability to give this power to all users not just a select few. Any restrictions on who can see what data should only be imposed due to business rules, not hardware or software limitations.

In this system, end users create the reports they want to see. The data providers can build a few pre-packaged reports as a guide if they want, but they no longer need to handle all the report building, freeing them up for work that provides other value added benefits to the data.

The tool also needs a feedback loop for end users, to understand their data needs and ensure those needs are met in building any future self-service capabilities.

Problem: Privacy Concerns

There are serious privacy challenges faced by organisations that collect and disseminate personal and business information.

While statistical information can lead to insights into trends, growth and demographics, organisations dealing with this information must be careful not to disclose private information.

In the past official statistics providers have given external researchers and analysts limited and tightly controlled access to the microdata from their censuses and surveys because of their duty to protect the privacy of their survey respondents.

Typically this controlled access takes the form of in-house or remotely accessed data laboratories or research centres, or the provision of pre-confidentialised sample files. All of these scenarios typically involve a statistics provider’s staff having to do some form of manual review and vetting of the information generated in response to a data query before it is delivered back to the researcher.

The demands to release greater volumes of data with increasing levels of detail are becoming more and more the norm, especially in light of open data policies of federal and state governments.

Experiences that were usually felt by National Statistical Organisations (NSOs) are now being felt by a lot more private and public organisations.

Ensuring confidentiality of the data gathered by an organisation is a necessity to ensure that individuals and organisations are not reluctant to provide information, and to maintain their trust.

Solution:

There are a variety of disclosure control methods that play an important role in helping companies achieve a certain level of confidentiality. For example:

- Aggregation – creating summary tables (“cubes”)

- Confidentialisation of microdata - sampling or perturbing the values of data records so that an anonymous set can be safely released

- Confidentialisation of tabular data - concealing or adjusting values in aggregate data before being released

- Business rules - controlling the level of detail in queries using pre-defined rules

- Trust and access control - providing more detailed information to trusted groups

- Monitoring - recording and reviewing the types of queries executed by users

When selecting the appropriate disclosure control methodology, organisations need to strike the right balance between making information available and meeting their privacy obligations. The ideal solution will be one that conceals just enough data to meet those obligations. Perturbation is typically the best method for achieving this.

More on perturbation can be found here: http://www.spacetimeresearch.com/s=perturbation&Submit=Search

This is a topic I will discuss in further detail in later blogs.

Problem: Information overload and over reliance on machine based rules

Information overload problem

Current generation technology can now capture data faster than ever before. There is a danger that users might become overwhelmed by all this information. If the ability of users to understand the abundance of reports and data out there cannot keep up with the amount of information collected, then there is as much chance of burying the useful information as of uncovering it.[xv] However this is not necessarily a problem of too much information, as long as the right tools are in place to help users manage the information.

Machine based rules problem

Additionally, in the current climate, there are a limited number of users with the capability and know-how to traverse these huge databases. This leads to another part of the information overload problem: an over-reliance on machine driven analysis.

For example, the National Security Agency in the US has a separation step for its Big Data repository that strips out “noise”. But it’s possible that what the software perceives as noise is in fact a signal; a signal that could have been seen if there was human intervention in the process.

Solution:

Software becomes part of the solution here – but it is vital that the software is easy to use. If it is easy to use it can help to create an information economy, where all members of a company have the potential to mine data. They can all add value by becoming managers of information, data miners, data analysers, and data explorers.

It simply becomes a numbers game. In the past, users struggled to understand the wealth of information because there were not enough users. By adopting tools that are easy enough for all employees to use, user numbers can increase dramatically.

Whilst it appears to be useful to create smarter systems and algorithms that can automatically find the relevant correlations in data, an over reliance on software algorithms can bring its own problems.

By increasing the numbers of competent users we can create an environment where the rules written into any data dissemination engines are reviewed and re-reviewed by many human eyes. This vastly reduces the chances that important data will fall through the cracks.

Problem: BI User discussion

This last section showcases the problems noted from the “2014 Analytics, BI, and Information Management Survey”. This survey was conducted with 248 respondents answering questions on organisations using or planning to deploy data analytics, BI or statistical analysis software[xvi].

59% said data quality problems are the biggest barrier to successful analytics or BI initiatives
44% said "predicting customer behaviour" is the biggest factor driving interest in big data analysis
47% listed "expertise being scarce and expensive" as the primary concern about using big data software
58% listed "accessing relevant, timely or reliable data" as their organisation's biggest impediment to success regarding information management

Solutions:

Data quality problems

Organisations can go a long way towards eliminating data quality problems by implementing a single source of truth, and apply and maintain proper metadata practices.

To ensure a single source of truth, data captured by the business must be recorded only once and held in a single area, accessible by the different enterprise software systems. Whether those software systems are across geographic areas or not, reading from the one system ensures that everyone looks at the same consistent data and issues.

Metadata is data that serves to provide context or additional information about other data, such as information about the title, subject, or author of a document. It may also describe the conditions under which the data stored in a database was acquired, such as its accuracy, data, time, method of compilation and processing.

Proper metadata practices mean that users of the data know exactly where it is from as well as any understanding any contextual information that is necessary for analysing the data.

Furthermore, usage of the Generic Statistical Business Process Model (GSBPM), an international best practice model, will help ensure that data lifecycle management process guides organisations to get the most from their data and metadata while determining appropriate use and retention to safely navigate the various legislative minefields.[xvii]

Predicting behaviour

In order to use data to predict customer behaviour, users need to have access to appropriate analytical tools.

For example, statistical methods and functions that have traditionally been solely available in specialised statistical software packages. As BI tools develop, these techniques are becoming more mainstream.

Of course, it is important to involve the users of the data in this process, to understand how they currently use and manipulate data to gain insights and then work out ways that software can automate all or part of that process.

Later posts will talk about the exciting world that is predictive analytics so stay tuned to those.

Expertise limitations

We have already looked at the problem of expertise limitations. As the tools become easier to use, more and more users will be able to take advantage of the power of data analytics, no longer having to rely on a select few individuals with the software expertise.

Accessing relevant, timely or reliable data

Software can help to solve the problem of reliable and up to date information. It is vitally important that the analytical tools make it as easy as possible to update the data. It must become an easily automated process, as opposed to a time consuming and highly manual exercise.

It is also important that data only needs to be updated once, in a single source of truth, rather than having to update many different databases.

Conclusion

Being aware of the gaps and limitations of current generation technology allows software developers to look at creating the capabilities that end users will start demanding in the future.

It is clear that disclosure control will become increasingly important: data must only be made available to users with the right credentials, and the system must automate this process as much as possible, making it easier to protect data and distribute it.

Finally, the tools must be easy to use and intuitive, helping to build a smarter user base from the ground up, and increasing the number of insights that can be gained as the number of users looking at the data grows.