Big Data Science Practice: 2014

Friday, November 21, 2014

Popular software skills in Data Science job postings

This exercise was done to understand the most popular skills required in job postings found in popular job websites. The skills requirements text is extracted, cleansed and mined. R and its packages tm and arules were used to cleanse and analyze the data. The findings were as follows

The following is the word cloud created out of the job software skill requirements

The following chart gives the relative importance of skills. A frequency of 0.5 means the skill is found in 50% of the postings.

As seen, R, python and sql are the top 3 skills found. Java continues to be a favorite programming language. Interestingly, SQL triumphs hadoop in the skill list.

Association rules mining was done to find which skills occur together. The following are the results of ARM (rules) on this skill set

lhs rhs support confidence lift
1 {} => {sas} 0.3469388 0.3469388 1.0000000
2 {} => {java} 0.4081633 0.4081633 1.0000000
3 {} => {hadoop} 0.4693878 0.4693878 1.0000000
4 {} => {sql} 0.5714286 0.5714286 1.0000000
5 {} => {python} 0.6326531 0.6326531 1.0000000
6 {} => {R} 0.7142857 0.7142857 1.0000000
7 {tableau} => {R} 0.1020408 1.0000000 1.4000000
8 {javascript} => {java} 0.1224490 1.0000000 2.4500000
9 {java} => {javascript} 0.1224490 0.3000000 2.4500000
10 {javascript} => {sql} 0.1020408 0.8333333 1.4583333
11 {javascript} => {python} 0.1020408 0.8333333 1.3172043
12 {big data} => {hadoop} 0.1020408 0.7142857 1.5217391
13 {spark} => {hive} 0.1224490 0.8571429 3.2307692
14 {hive} => {spark} 0.1224490 0.4615385 3.2307692
15 {spark} => {hadoop} 0.1224490 0.8571429 1.8260870
16 {spark} => {R} 0.1020408 0.7142857 1.0000000
17 {perl} => {sql} 0.1224490 1.0000000 1.7500000
18 {perl} => {python} 0.1224490 1.0000000 1.5806452
19 {perl} => {R} 0.1020408 0.8333333 1.1666667
20 {mapreduce} => {hive} 0.1020408 0.5555556 2.0940171
21 {hive} => {mapreduce} 0.1020408 0.3846154 2.0940171
22 {mapreduce} => {hadoop} 0.1632653 0.8888889 1.8937198
23 {hadoop} => {mapreduce} 0.1632653 0.3478261 1.8937198
24 {mapreduce} => {R} 0.1224490 0.6666667 0.9333333
25 {ruby} => {java} 0.1020408 0.6250000 1.5312500
26 {ruby} => {sql} 0.1632653 1.0000000 1.7500000
27 {ruby} => {python} 0.1428571 0.8750000 1.3830645
28 {ruby} => {R} 0.1020408 0.6250000 0.8750000
29 {pig} => {hive} 0.1428571 0.7777778 2.9316239
30 {hive} => {pig} 0.1428571 0.5384615 2.9316239
31 {pig} => {java} 0.1020408 0.5555556 1.3611111
32 {pig} => {hadoop} 0.1428571 0.7777778 1.6570048
33 {hadoop} => {pig} 0.1428571 0.3043478 1.6570048
34 {pig} => {sql} 0.1224490 0.6666667 1.1666667
35 {pig} => {python} 0.1224490 0.6666667 1.0537634
36 {pig} => {R} 0.1632653 0.8888889 1.2444444
37 {matlab} => {hive} 0.1224490 0.4615385 1.7396450
38 {hive} => {matlab} 0.1224490 0.4615385 1.7396450
39 {matlab} => {java} 0.1020408 0.3846154 0.9423077
40 {matlab} => {hadoop} 0.1224490 0.4615385 0.9832776
41 {matlab} => {sql} 0.1428571 0.5384615 0.9423077
42 {matlab} => {python} 0.2040816 0.7692308 1.2158809
43 {python} => {matlab} 0.2040816 0.3225806 1.2158809
44 {matlab} => {R} 0.2448980 0.9230769 1.2923077
45 {R} => {matlab} 0.2448980 0.3428571 1.2923077
46 {hive} => {java} 0.1020408 0.3846154 0.9423077
47 {hive} => {hadoop} 0.2040816 0.7692308 1.6387960
48 {hadoop} => {hive} 0.2040816 0.4347826 1.6387960
49 {hive} => {sql} 0.2040816 0.7692308 1.3461538
50 {sql} => {hive} 0.2040816 0.3571429 1.3461538
51 {hive} => {python} 0.1632653 0.6153846 0.9727047
52 {hive} => {R} 0.2040816 0.7692308 1.0769231
53 {sas} => {java} 0.1224490 0.3529412 0.8647059
54 {java} => {sas} 0.1224490 0.3000000 0.8647059
55 {sas} => {hadoop} 0.1428571 0.4117647 0.8772379
56 {hadoop} => {sas} 0.1428571 0.3043478 0.8772379
57 {sas} => {sql} 0.2040816 0.5882353 1.0294118
58 {sql} => {sas} 0.2040816 0.3571429 1.0294118
59 {sas} => {python} 0.2040816 0.5882353 0.9297913
60 {python} => {sas} 0.2040816 0.3225806 0.9297913
61 {sas} => {R} 0.3061224 0.8823529 1.2352941
62 {R} => {sas} 0.3061224 0.4285714 1.2352941
63 {java} => {hadoop} 0.1428571 0.3500000 0.7456522
64 {hadoop} => {java} 0.1428571 0.3043478 0.7456522
65 {java} => {sql} 0.2653061 0.6500000 1.1375000
66 {sql} => {java} 0.2653061 0.4642857 1.1375000
67 {java} => {python} 0.3469388 0.8500000 1.3435484
68 {python} => {java} 0.3469388 0.5483871 1.3435484
69 {java} => {R} 0.3265306 0.8000000 1.1200000
70 {R} => {java} 0.3265306 0.4571429 1.1200000
71 {hadoop} => {sql} 0.2448980 0.5217391 0.9130435
72 {sql} => {hadoop} 0.2448980 0.4285714 0.9130435
73 {hadoop} => {python} 0.2448980 0.5217391 0.8246844
74 {python} => {hadoop} 0.2448980 0.3870968 0.8246844
75 {hadoop} => {R} 0.3265306 0.6956522 0.9739130
76 {R} => {hadoop} 0.3265306 0.4571429 0.9739130
77 {sql} => {python} 0.4081633 0.7142857 1.1290323
78 {python} => {sql} 0.4081633 0.6451613 1.1290323
79 {sql} => {R} 0.4081633 0.7142857 1.0000000
80 {R} => {sql} 0.4081633 0.5714286 1.0000000
81 {python} => {R} 0.5306122 0.8387097 1.1741935
82 {R} => {python} 0.5306122 0.7428571 1.1741935
83 {java,
javascript} => {sql} 0.1020408 0.8333333 1.4583333
84 {javascript,
sql} => {java} 0.1020408 1.0000000 2.4500000
85 {java,
sql} => {javascript} 0.1020408 0.3846154 3.1410256
86 {java,
javascript} => {python} 0.1020408 0.8333333 1.3172043
87 {javascript,
python} => {java} 0.1020408 1.0000000 2.4500000
88 {hive,
spark} => {hadoop} 0.1020408 0.8333333 1.7753623
89 {hadoop,
spark} => {hive} 0.1020408 0.8333333 3.1410256
90 {hadoop,
hive} => {spark} 0.1020408 0.5000000 3.5000000
91 {perl,
sql} => {python} 0.1224490 1.0000000 1.5806452
92 {perl,
python} => {sql} 0.1224490 1.0000000 1.7500000
93 {python,
sql} => {perl} 0.1224490 0.3000000 2.4500000
94 {perl,
sql} => {R} 0.1020408 0.8333333 1.1666667
95 {perl,
R} => {sql} 0.1020408 1.0000000 1.7500000
96 {perl,
python} => {R} 0.1020408 0.8333333 1.1666667
97 {perl,
R} => {python} 0.1020408 1.0000000 1.5806452
98 {hive,
mapreduce} => {hadoop} 0.1020408 1.0000000 2.1304348
99 {hadoop,
mapreduce} => {hive} 0.1020408 0.6250000 2.3557692
100 {hadoop,
hive} => {mapreduce} 0.1020408 0.5000000 2.7222222

>

Tuesday, November 4, 2014

Exercise to compare classifier performance

This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.

Loading and exploring the data set

data <-read.csv("incomedata.csv")

str(data)

Let us look at the correlation between the predictors and the target.

library(psych)

pairs.panels(data)

Correlation Plot

All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.

Model Building and Prediction

Split training and testing sets

library(caret)

inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)

training <- data[inTrain,]

testing <- data[-inTrain,]

Predict using different models

#list of all algorithms

predlist <- c("bagFDA","lda","LogitBoost",

"nb","nnet","rf","rpart","svmRadialCost",

"C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),

stringsAsFactors=FALSE)

#loop through algorithm list and perform model building and prediction

for (i in 1:length(predlist)) {

pred <- predlist[i]

print(paste("Algorithm = ",pred ))

startTime <- as.integer(Sys.time())

model <- train( Salary ~ ., data=training, method=pred)

predicted <- predict(model, testing)

matrix<- confusionMatrix(predicted, testing$Salary)

endTime <- as.integer(Sys.time())

thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))

results[i,1] <- pred

results[i,2] <- endTime-startTime

results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)

}

results

Analyzing results

Classifier	Accuracy (%)	Execution Time (minutes)
Linear Discriminant Analysis	80.38	0.0
Classification and Regression Trees	80.86	0.2
General Linear Models	80.53	0.2
Boosted Logistic Regression	79.01	1.4
Decision Trees (C5.0)	80.9	4.5
Naïve Bayes	80.94	4.9
Neural Networks	80.8	6.4
Random Forest	80.84	15.1
Bagging (Flexible Discriminant Analysis)	81.15	23.2
Support Vector Machines	81.17	66.3

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.

Classifier	Income Prediction	SPAM Filter	Credit Approval
Linear Discriminant Analysis	80.38	91.95	72
Classification and Regression Trees	80.86	88.59	66
General Linear Models	80.53	92.62	71.67
Boosted Logistic Regression	79.01	90.6	69.33
Decision Trees (C5.0)	80.9	89.26	71
Naïve Bayes	80.94	87.92	68.33
Neural Networks	80.8	91.28	71.67
Random Forest	80.84	89.93	70.67
Bagging (Flexible Discriminant Analysis)	81.15	93.96	70.33
Support Vector Machines	81.17	89.93	72.33

Friday, October 31, 2014

My article in Destination CRM magazine

link to article

Wednesday, October 29, 2014

Improved Multichannel Data Integration through purpose-built Customer Engagement Analytics

Link in Transera blog

Monday, October 27, 2014

Predictions - Effect of unique number of target classes on accuracy

When we perform machine learning of type classification, the target variable is a categorical (nominal) variable that has a set of unique values or classes . It could be a simple two class target variable like "approve application? " with classes (values) of "yes" or "no". Sometimes they might indicate ranges like "Excellent", "Good" etc. for a target variable like satisfaction score. We might also convert continuous variables like test scores (1 - 100) into classes like grades (A, B, C etc).

This experiment is to find the effect of the number of unique classes in the target variable on the accuracy of the prediction. The hypothesis is that accuracy will go down as the number of classes increases. This is because, with each additional class boundary, there is additional chance of a predicted sample to end up on the wrong side of the boundary.

For this experiment, I used a data set of blood pressure levels. Each observation contains the patient's demographics and the actual systolic blood pressure measured. The value of the blood pressure is the binned into multiple classes (blood pressure ranges). Prediction of the blood pressure range is then done for varying number of bins (classes). The results are then tabulated as follows.

The experiment confirms the hypothesis. Accuracy drops sharply as the number of classes in the target variable increases. It does taper out beyond as size of 8.

Sunday, October 19, 2014

K Means Clustering - Effect of random seed

When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. wiki article

If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed will not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown.

R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.

data <- read.csv("AgeBMIEven.csv")

str(data)

## 'data.frame':    1338 obs. of  2 variables:

##  $ age: int  19 18 28 33 32 31 46 37 37 60 ...
##  $ bmi: num  27.9 33.8 33 22.7 28.9 ...

summary(data)

##       age            bmi

##  Min.   :18.0   Min.   :16.0  
##  1st Qu.:27.0   1st Qu.:26.3  
##  Median :39.0   Median :30.4  
##  Mean   :39.2   Mean   :30.7  
##  3rd Qu.:51.0   3rd Qu.:34.7  
##  Max.   :64.0   Max.   :53.1

plot(data$age, data$bmi)

As we an see from the above plot, the data points are distributed almost evenly all over the scatter plot. The initial cluster center position would affect the final cluster shapes and memberships. We run the clustering 4 times to group this data as 4 clusters and plot the clusters outputs here.

par(mfrow=c(2,2))
for (i in 1:4 ) {
  clusters<- kmeans(data,4)
  plot(data$age, data$bmi, col=clusters$cluster)
}

Each time the clustering algorithm runs, it is going to pick a random seed and that seem to impact the shapes and memberships of the clusters. The first two runs generate the same groups, but the next 2 give different groupings of the data. Setting the seed explicitly to a specific value is required to generate the same results every time.

Labels