Credit scoring represents the process of classifying or categorizing different individuals according to their credit worthiness. The term ‘credit’ originated from the Latin word ‘credo’, which means, ‘trust in’, or ‘rely on’. Therefore, in these kinds of problems, the objective is to identify individuals who are reliable to the banks i.e. least probable to default or delay in payment of interest. Let us discuss the process and model of building credit risk and credit score in analytics.

Why Credit Scoring

Broadly speaking, the method of credit scoring can be used to address the following concerns:

  • Estimation of credit Worthiness (willingness and ability to repay) of ‘ANY’ Customer
  • Identification of potential credit risk (the potential financial impact of any real or perceived change in borrowers’ Creditworthiness)
  • Classification of prime and sub-prime lenders (Classification among good customers)
Credit Risk Modeling

Credit Risk Modeling

Factors that Potentially affect Credit Worthiness in Analytics

There can be many factors that potentially affect credit worthiness of the individuals or borrowers and these factors need to be considered while developing a credit scoring model in analytics. Some of those factors are listed below. This is indicative and not an exhaustive list.

  • Borrower’s Age
  • Borrower’s Gender
  • Borrower’s Educational Qualification
  • Borrower’s Job Type (e.g. Private, Govt, Professional such as Doctor, Lawyer etc.)
  • Number of Years in Current Job
  • Borrower’s Total Experience
  • Borrower’s Current Income
  • Borrower’s Family Details (e.g. Age, Qualifications, Whether Working, Income, Number of Children and their Age etc.)
  • Borrower’s Previous Credit History (such as Current Obligations, EMIs, Payment Default Cases etc.)
  • Borrower’s Health Conditions and Insurances
  • Borrower’s Total Worth (e.g. Savings and Assets)

Techniques for Credit Scoring

There are different data mining techniques that can be used for classification and identification of credit worthiness of borrowers. Some of those are –

  • Logistic Regression
  • Decision Tree
  • Bayesian Networks
  • Random Forests

Among the above methods, logistic regression is the most popular one. Other methods are also used and known for their classification power.

Like any predictive modeling, in credit scoring, sampling plays a very important role. The possible sampling techniques are as follows:

  1. Random Sub-Sampling
  2. K-fold Cross Validation
  3. Leave-One-Out
  4. Bootstrap (Random Sampling With Replacement)

In the next part, we will discuss variable selection and model estimation for credit Scoring.


Apache Hadoop is a pen-source implementation of distributed storage and distributed processing for the analysis of big datasets. To manage storage resources across the distributed cluster, Hadoop uses a distributed user-level file system. This file system — HDFS — is written in Java and designed for portability across heterogeneous hardware and software platforms. In this article we discuss the performance of HDFS and uncover several performance issues.

First, hadoop’s architectural bottlenecks exist in the Hadoop implementation that results in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Second, portability limitations prevent the Java implementation from exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions about how the native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior.

The bottlenecks in HDFS are divided into three parts:

Software Architectural Bottlenecks — HDFS is not utilized to its full potential due to scheduling delays in the Hadoop architecture that result in cluster nodes waiting for new tasks. Instead of using the disk in a streaming manner, the access pattern is periodic. Further, even when tasks are available for computation, the HDFS client code, particularly for file reads, serializes computation and I/O instead of de-coupling and pipelining those operations. Data prefetching is not employed to improve performance, even though the typical MapReduce streaming access pattern is highly predictable.

Portability Limitations — Some performance-enhancing features in the native filesystem are not available in Java in a platform-independent manner. This includes options such as bypassing the filesystem page cache and transferring data directly from disk into user buffers. As such, the HDFS implementation runs less efficiently and has higher processor usage than would otherwise be necessary.

Portability Assumptions — The classic notion of software portability is simple: does the application run on multiple platforms? But, a broader notion of portability is: does the application perform well on multiple platforms? While HDFS is strictly portable, its performance is highly dependent on the behavior of underlying software layers, specifically the OS I/O scheduler and native filesystem allocation algorithm.

MapReduce systems such as Hadoop are used in large-scale deployments. Eliminating HDFS bottlenecks will not only boost application performance, but also improve overall cluster efficiency, thereby reducing power and cooling costs and allowing more computation to be accomplished with the same number of cluster nodes.

Hadoop application performance suffers due to architectural bottlenecks in the way that applications use the Hadoop filesystem. Ideally, MapReduce applications should manipulate the disk using streaming access patterns. The application framework should allow for data to be read or written to the disk continuously, and overlap computation with I/O. Many simple applications with low computation requirements do not achieve this ideal operating mode. Instead, they utilize the disk in a periodic fashion, decreasing performance.

The behavior of the disk and processor utilization over time for the simple search benchmark is “% of Time Disk Had 1 or More Outstanding Requests”. Disk utilization was measured as the percentage of time that the disk had at least one I/O request outstanding. This profiling did not measure the relative efficiency of disk accesses (which is influenced by excessive seeks and request size), but simply examined whether or not the disk was kept sufficiently busy with outstanding service requests. Here, the system is not ac-cessing the disk in a continuous streaming fashion as desired, even though there are ample processor resources still available. Rather, the system is reading data in bursts, processing it (by searching for a short text string in each input line), and then fetching more data in a periodic manner. This behavior is also evident in other applications such as the sort benchmark, not shown here.

The overall system impact of this periodic behavior is “Average Processor and HDFS Disk Utilization (% of Time Disk Had 1 or More Outstanding Requests) ”, which presents the average HDFS disk and processor utilization for each application in the test suite. The AIO test programs (running as native applications, not in Hadoop) kept the disk saturated with I/O requests nearly all the time (97.5%) with very low processor utilization (under 3.5%). Some Hadoop programs (such as S-Wr and Rnd-Bin) also kept the disk equivalently busy, albeit at much higher processor usage due to Hadoop and Java virtual machine overheads. In contrast, the remaining programs have poor resource utilization. For instance, the search program accesses the disk less than 40% of the time, and uses the processors less than 60% of the time.

This poor efficiency is a result of the way applications are scheduled in Hadoop, and is not a bottleneck caused by HDFS. By default, the test applications like search and sort were divided into hundreds of map tasks that each process only a single HDFS block or less before exiting. This can speed recovery from node failure (by reducing the amount of work lost) and simplify cluster scheduling. It is easy to take a map task that accesses a single HDFS block and assign it to the node that contains the data. Scheduling becomes more difficult, however, when map tasks access a region of multiple HDFS blocks, each of which could reside on different nodes. Unfortunately, the benefits of using a large number of small tasks come with a performance price that is particularly high for applications like the search test that complete tasks quickly. When a map task completes, the node can be idle for several seconds until the TaskTracker polls the JobTracker for more tasks. By default, the minimum polling interval is 3 seconds for a small cluster, and increases with cluster size. Then, the JobTracker runs a scheduling algorithm and returns the next task to the TaskTracker. Finally, a new Java virtual machine (JVM) is started, after which the node can resume application processing.

The Hadoop framework and filesystem impose a significant processor overhead on the cluster. While some of this over-head is inherent in providing necessary functionality, other overhead is incurred due to the design goal of creating a portable MapReduce implementation. These are referred to as Portability Limitations.

A final class of performance bottlenecks exists in the Hadoop filesystem that we refer to as Portability Assumptions. Specifically, these bottlenecks exist because the HDFS imple-mentation makes implicit assumptions that the underlying OS and filesystem will behave in an optimal manner for Hadoop. Unfortunately, I/O schedulers can cause excessive seeks under concurrent workloads, and disk allocation algorithms can cause excessive fragmentation, both of which degrade HDFS performance significantly. These agents are outside the direct control of HDFS, which runs inside a Java virtual machine and manages storage as a user-level application.

Missing Value Analysis

Posted: May 26, 2015 in Uncategorized

If you think the work of a data analyst is to analyze only data values, you might be wrong because analysts have to work and analyze missing values too!  

What are missing values?

Remember the last time you participated in a survey; you skipped some of the questions.  These are the information which is not available in your data sheet and these are the missing values. As an analyst or researcher, one needs to deal with the problem of missing values in a dataset.

Missing Values in Dataset?

             Missing Values in Dataset?

What are the issues with Missing Data?

For univariate dataset i.e. dataset with only one variable, you can decide to consider only complete cases for the analysis. The downside of this is that the sample size gets further reduced. The second approach is to replace the missing values and make the dataset complete. The more scientific approach based on the business scenario yields the better result.

Similarly, for multivariate dataset i.e. dataset with multiple variables, you can decide whether to take missing values or to ignore cases for which values are missing. If the sample size is large and percentage of missing values is less than 20% then you may consider only complete cases.

But if sample size is small and percentage of missing values is high, then ignoring the missing value cases completely will generate a smaller sample.

There are three main issues in missing value analysis (MVA) for evaluating the missing data :

  • The number of cases missing per variable
  • The number of variables missing per case
  • The missing value pattern or pattern of correlations among variables created to represent missing and valid data.

How do you deal with missing values?

So we have found out that our dataset has missing values, now what do I for analysis?

  • Do nothing! Based on the scenario (analyze e-commerce or patient data post-surgery) or the percentage of missing values or the size of the dataset (large or small), you can decide to do nothing about the missing values.
  •  Delete the missing value – If the percentage of missing values is small, one may decide to delete the missing value cases and take only complete cases. There are two methods available for the same – List wise deletion and Pair wise deletion. More on the methods of deletion and missing value analysis in Research Surveys can be found in our earlier blog on Treatment of Missing Values in Survey Analysis
  • Substitution or Imputation – When the sample size is less, one may decide to substitute the missing values using substitution or imputation methods. Here it is important to consider the pattern.
  • If Missing Completely At Random (MCAR), the missing value can be substituted by any measures of central tendency like mean, median or mode. Imputation can also be done by regression where based on the existing values, missing values will be predicted.
  • For Missing At Random (MAR), the missing value can be substituted by EM Algorithm (Expectation-Maximization) which is an iterative method.

When the sample size is less, then the analyst substitutes the missing value. This is known as substitution.

In recent years, the growth of efficient analytical tools and new methods, have given marketing heads a new oil fueling the organizations with new decision-making firepower and spark. Increase in growth and marketing return on investment (MROI) is greatly influenced by Marketing Analytics. The use of analytics in Marketing has increased greatly in the last couple of years and a Forbes Magazine article reported a 67% growth in hiring in Marketing Analytics.

Consumer is Dynamic!

The understanding of the target audience’s buying behaviour is a key factor that determines an effective MROI. With the advent of internet and e-commerce, products that were purchased traditionally from brick and mortar stores like microwave oven, refrigerator, apparels, books to insurance policies are also purchased from the web today! This calls for accepting the fact that the consumer is more dynamic and that consumer behaviour is subject to many different moments of influence which demands a change in media mix, marketing and advertising budget, distribution channels etc.

Marketing Indicators

So long, strategies were based as well as evaluated typically on last year’s dollars spent or how well a product faired in the market. A much better approach measures plans based on their strategic return, economic value, and payback window. Evaluating plans using such measures provide a consistent method for comparison, and these measurements can also be combined with preconditions such as baseline spending, thresholds for certain media in the media mix based on the type of product and the target customers, prior commitments etc.

Benefits of Marketing Analytics

In order to gain proper insights to generate smart business decisions, Marketing Analytics follows a methodical and efficient process of understanding and interpreting data. Collection of appropriate and logical data is one of the key element in measuring marketing effectiveness but data would remain numbers unless meaningful analysis is done. Marketing analytics, measures effectiveness in business metrics like traffic, leads, and sales which in turn influence whether leads become customers. In many marketing situations you need to slice and dice data to gain valuable insights into marketing as well as decide future strategies. For example,

  • Study Sales volume by store location, weekly/monthly and product wise and then use it to predict future sales that helps in advance planning of marketing as well as supply chain resources
  • Investigate global drift of a business due to global and local policies, new technology etc. – this influences current and future product marketing strategy
  • Assess effects of marketing campaigns in different media and decide on media mix, budget and spend
  • Regulate the impact of demographics on a merchandise
  • Understanding and getting insights on customer preferences and trends to drive current sales as well as for future product strategies
  • Internet marketing analytics and social media analytics helps in monitoring campaigns and their respective outcomes thus enabling to spend money effectively
Marketing – Integrated Approach

Marketing – Integrated Approach


The importance of product development, customer satisfaction as well as brand awareness has made its presence significantly felt and marketing is no longer limited to just customers searching for a product. How people look for you indicates how the market talks about you and how your brand and products are being portrayed. Therefore, it is imperative to analyse the available data correctly by using the right algorithm, tools and techniques for taking the right action. We need to imagine analysis and action as connected and dependent upon each other and not detach them as together they are of value and becomes greater than its part. In other words:

  • Analytics + Actions = Actalytics

We have numerous data as well as analytical tools to make smart choices in business. Existing techniques starting from descriptive statistics to predictive modeling, optimization techniques, time-series forecasting and newer text mining algorithms for analyzing unstructured text data from social media, mobile devices etc. are being used widely in marketing analytics. For unveiling significant and meaningful insights and driving above-market growth for brands, businesses must put analysis at the heart of an organization and take an integrated analytical approach to succeed.

This article has been contributed by our guest writer Nicholas Filler. Nicholas lives in Idaho, US and has deep interests in technology, education, and medicine. He is currently working on game theory and design. He enjoys spending his days outside, skiing during the winter, and learning about engineering concepts. 

With new innovations entering the medical field like the Apple ResearchKit, large data is becoming something that is much more manageable, but there are a variety of obstacles to still overcome. Many practices are attempting to start filing digitally and once again the problem of sorting through large data is a looming issue. There is hope on the horizon, and its coming in the form of open source software languages designed to handle this exact dilemma.

Open source languages are not something new, and have actually been around for quite some time. Open Source software is defined as “any program whose source code is made available for use or modification as users or other developers see fit.” This allows for a variety of applications when it comes to medical use, and there are specific languages that could allow sifting through large data much easier than reading one excel sheet at a time. One language comes to mind, and that language is known as “R”.

According to “R is an integrated suite of software facilities for data manipulation, calculation, and graphical display”. It includes :

  • an effective data handling and storage facility,
  • a suite of operators for calculations or arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.” As you can tell these types of languages are designed to tackle large amounts of data.

This type of data can be helpful in a variety of ways, and the benefits within the medical field are quite interesting. Health informatics in particular could greatly benefit from this type of technology. Considering that this field uses technology to gather information regarding patient health and other various statistics, implementation with open source programming languages could greatly improve the statistical data that is being generated.

According to an article from Dr. Victoria Wangia from the University of Cincinnati “Understanding factors that influence the use of an implemented public health information system such as immunization registry is of great important to those implementing the system and those interested in the positive impact of using the technology for positive public health outcomes.” This type of data doesn’t necessarily have to be geared directly towards doctors or those with a medical degree. It has far greater implementations, and could be used in the public sector as well.

Information like this could be used to map out a variety of interesting statistics that would benefit not only medical professionals, but curious individuals as well. Considering the article above mentions immunization, it would be incredibly interesting to view mapped out areas or locations within cities that are up to date on their treatments, or view the ones that are falling behind. This would be helpful for parents trying to find a new schools or looking to buy a house in regards to their children. This type of software doesn’t necessarily have to be tied to the medical field, but thats where the focus should remain.

On that note, with the transition of medical records moving into the digital realm, patient information is now much easier to map, and analyze using this software. There is some great information from the University of Ohio that discusses implementing electronic health records more in depth. Implementing information correctly regarding electronic health records could greatly improve the way in which we can compile important information regarding patient health. Doing this now could be a complete game changer in the way we view large data in the medical field.

Although R is a great resource and tool, there are other options to consider as well. Platforms like OpenEMR that are designed to handle electronic medical data and practice management have a great database and community designed to improve this open source software. OpenEMR is ONC certified, another great site seems to beOpenMRS, who are aspiring to create a global network of people who are centered around the open source medical platform. Another great site for up to date information regarding data analytics, which will have very interesting radio broadcasts regarding this type subject. These are just a few of the possibilities when looking at medical records and dealing with large amounts of data.

Hopefully as time continues, new and interesting solutions will continue to reshape the medical data field. Open source languages such as R will ideally become a leading force in the way we analyze these new and endless sheets.

Could we have prevented the 26/11 Mumbai terrorist attack in 2008 or the Charlie Hebdo attack at Paris this month (Jan 2015) ?

Donald Rumsfeld’s famous comment to a US Department of Defense news briefing in 2002 comes to my mind. “As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also – unknown unknowns – the ones we don’t know we don’t know … it is the latter category that tend to be the difficult ones.”

End of the day, neither of the two is a “Black or Grey Swan” event – an event which is possible and known about,and potentially extremely significant, but considered to be unlikely to – we have seen many terrorist attacks against the individual countries homeland and human mankind for many years now. So that’s more of the known unknowns.

Predictive Analytics

Predictive analytics is the method of studying past incidents to apprehend with some precision and accuracy the likelihood of similar future incidents. It uses statistical analysis to extract information from data to unveil patterns and trends. The basic tenets of predictive analytics are models or algorithms that can combine data, algebraic mathematics and at times artificial intelligence techniques to create a function between an outcome (terrorist attack) and a set of input data variables.


Model Approach

 Lets focus on the case of the known unknowns for the model building. So imagine that we want to create a model that is able to let us know the probability of the possible terrorist attack in near future. Such a model needs to be generic to an extent, have relatively high accuracy and is pragmatic – should work in real life than just being a theoretical paper.

We should also have the ‘train’ data to validate the model and fine-tune the parameters. Information about terrorist attacks across the world zones prone to such attacks will suffice. India, Bangladesh, Afghanistan and Middle East will give us the chunk of the data while other parts of the world can give us fewer but significant informative event data as well.

Variables in the Model

For building a model by using analytics for countering terrorism, an ‘attack by terrorist’ is the outcome variable or response variable. If the model could say who as an individual or group is behind that attack would be something utopian at this stage. May be it will be less challenging in building such a model in near future. So we will keep our focus on the probability of such an attack for now.

To explain our model we need to have the operational definition of the variables or inputs as well.  The functional parameters (x) could contain information variables like Impact of the attack (local, regional, national or international), the Importance of the event (Olympic vs. a League match), Coverage of the event (is it being broadcast locally, nationally or internationally? Is there a lot of interest in the event?), Location of the event (urban vs. rural), Density of Police/Intelligence resources deployed for the event, Population diversity (different ethnicity/religious affiliations- whether we like it or not people are the drivers of these horrific acts), Proximity to air/sea/rail connections, Time of the event, Audience majority (teenagers for a Jazz event vs Marathon watchers), Participant profile of the event and others. Identifying all possible parameters or variables is a humongous task and not prudent as well in case of this kind of the model, and thus the need for iterations in model building is important. Also the principal of parsimony should be followed in order to keep the model pragmatic.


This kind of analytics model on countering terrorism will also have false positives as well as false negatives. Experts from the domain of terrorism and crime should be involved in building and validating the model as well as interpreting outcomes. It cannot be full proof but it is still a very powerful tool that can help our law enforcement agencies fight terrorism.

You have started your journey in the world of data science and analytics but wondering which tool to learn? There are many established analytics tools available in the market and new ones are also coming up. Today we shall discuss about a few good and popular tools in the market to help you make your decision.

SAS –SAS® from SAS Institute is still the market leader in the world of data analytics due to its features, functionalities as well as robust security. It has many modules.

  • You can write your own code and syntaxes in Base SAS while SAS Enterprise Miner gives you a GUI which is easier to navigate, learn as well as use. For those interested in pharma or clinical trials area, SAS CDI (Clinical Data integration) GUI helps organize, standardize and manage clinical research data and metadata. SAS CDI GUI reduces the need to write unique code for each study.
  • Visualization – There are lot of good features in SAS BI as well as new SAS Visual Analytics for data slicing, dicing and reporting.
  • Free learning for students – It has come out with SAS University Edition which is a free tool for students who want to learn and master the tool.
  • Though SAS is an expensive tool it has a large installed base globally and in India too. Typically in BFSI and clinical trials areas, SAS is a preferred tool.
  • For a fresher or an experienced candidate making a lateral entry into analytics industry, it makes sense to learn SAS as there are more jobs in it currently.
  • There are different levels of certifications available for these modules. Check out our blog on “Guidelines for SAS Certification” for more

SPSS – SPSS®  was a product by SPSS Inc. which was taken over by IBM in 2009. Later IBM re-launched Clementine as SPSS Modeler for predictive modeling.

  • Functionalities – SPSS has broad range of statistical analysis and data crunching functionalities though limited visualization features.
  • It is easy and intuitive to learn hence popular among users with different backgrounds including those without any knowledge of statistics or coding.
  • SPSS is less expensive than SAS and is used widely in industry, market research and social research.
  • IBM also offers different levels of certifications on SPSS. Check out for details

R – R is an open source tool that is rapidly gaining popularity and giving tough fight to the established players in analytics!

  • Being an open source tool, the pace of development and addition of features is really fast as compared to any other proprietary tool. Hence it is very rich in features and functionalities.
  • Though it takes time to learn, it is becoming the preferred tool in academia as well as corporates due to less cost as well as flexibility to work with big data and Hadoop as well as flexibility to write your own code and algorithm. However, those who have some bit of programming knowledge learning time is really less.
  • R is rich in charts, graphs and visualization including GIS mapping through various packages.
  • Revolution Analytics came out with Rev R – a commercial version of R. Seeing the growing popularity of this tool, the IT behemoth, Microsoft has bought over this company in Jan 2015. Whether Microsoft will incorporate this in their big data strategy or launch it as a separate tool, only time will tell!

We have presented a summary of these 3 popular tools below –

Comparison of Analytics Tools

       Comparison of Analytics Tools

There are many other tools that are used in the industry like, Microsoft Excel (for basic data analysis and reporting as well as some statistical analysis using Analysis Toolpak), SAP Business Object (for reporting), MATLAB (for advanced analytics), Stata (for analysis) etc. These tools have some limitations or the other like handing volume of data or power of analysis or visualization.

For a fresher or someone new to the field of analytics, it makes sense to start with one tool. But we have seen that over time an experienced analytics professional will know atleast two analytics tools if not more!