Big Data Science Practice: November 2014

This exercise was done to understand the most popular skills required in job postings found in popular job websites. The skills requirements text is extracted, cleansed and mined. R and its packages tm and arules were used to cleanse and analyze the data. The findings were as follows

The following is the word cloud created out of the job software skill requirements

The following chart gives the relative importance of skills. A frequency of 0.5 means the skill is found in 50% of the postings.

As seen, R, python and sql are the top 3 skills found. Java continues to be a favorite programming language. Interestingly, SQL triumphs hadoop in the skill list.

Association rules mining was done to find which skills occur together. The following are the results of ARM (rules) on this skill set

lhs rhs support confidence lift
1 {} => {sas} 0.3469388 0.3469388 1.0000000
2 {} => {java} 0.4081633 0.4081633 1.0000000
3 {} => {hadoop} 0.4693878 0.4693878 1.0000000
4 {} => {sql} 0.5714286 0.5714286 1.0000000
5 {} => {python} 0.6326531 0.6326531 1.0000000
6 {} => {R} 0.7142857 0.7142857 1.0000000
7 {tableau} => {R} 0.1020408 1.0000000 1.4000000
8 {javascript} => {java} 0.1224490 1.0000000 2.4500000
9 {java} => {javascript} 0.1224490 0.3000000 2.4500000
10 {javascript} => {sql} 0.1020408 0.8333333 1.4583333
11 {javascript} => {python} 0.1020408 0.8333333 1.3172043
12 {big data} => {hadoop} 0.1020408 0.7142857 1.5217391
13 {spark} => {hive} 0.1224490 0.8571429 3.2307692
14 {hive} => {spark} 0.1224490 0.4615385 3.2307692
15 {spark} => {hadoop} 0.1224490 0.8571429 1.8260870
16 {spark} => {R} 0.1020408 0.7142857 1.0000000
17 {perl} => {sql} 0.1224490 1.0000000 1.7500000
18 {perl} => {python} 0.1224490 1.0000000 1.5806452
19 {perl} => {R} 0.1020408 0.8333333 1.1666667
20 {mapreduce} => {hive} 0.1020408 0.5555556 2.0940171
21 {hive} => {mapreduce} 0.1020408 0.3846154 2.0940171
22 {mapreduce} => {hadoop} 0.1632653 0.8888889 1.8937198
23 {hadoop} => {mapreduce} 0.1632653 0.3478261 1.8937198
24 {mapreduce} => {R} 0.1224490 0.6666667 0.9333333
25 {ruby} => {java} 0.1020408 0.6250000 1.5312500
26 {ruby} => {sql} 0.1632653 1.0000000 1.7500000
27 {ruby} => {python} 0.1428571 0.8750000 1.3830645
28 {ruby} => {R} 0.1020408 0.6250000 0.8750000
29 {pig} => {hive} 0.1428571 0.7777778 2.9316239
30 {hive} => {pig} 0.1428571 0.5384615 2.9316239
31 {pig} => {java} 0.1020408 0.5555556 1.3611111
32 {pig} => {hadoop} 0.1428571 0.7777778 1.6570048
33 {hadoop} => {pig} 0.1428571 0.3043478 1.6570048
34 {pig} => {sql} 0.1224490 0.6666667 1.1666667
35 {pig} => {python} 0.1224490 0.6666667 1.0537634
36 {pig} => {R} 0.1632653 0.8888889 1.2444444
37 {matlab} => {hive} 0.1224490 0.4615385 1.7396450
38 {hive} => {matlab} 0.1224490 0.4615385 1.7396450
39 {matlab} => {java} 0.1020408 0.3846154 0.9423077
40 {matlab} => {hadoop} 0.1224490 0.4615385 0.9832776
41 {matlab} => {sql} 0.1428571 0.5384615 0.9423077
42 {matlab} => {python} 0.2040816 0.7692308 1.2158809
43 {python} => {matlab} 0.2040816 0.3225806 1.2158809
44 {matlab} => {R} 0.2448980 0.9230769 1.2923077
45 {R} => {matlab} 0.2448980 0.3428571 1.2923077
46 {hive} => {java} 0.1020408 0.3846154 0.9423077
47 {hive} => {hadoop} 0.2040816 0.7692308 1.6387960
48 {hadoop} => {hive} 0.2040816 0.4347826 1.6387960
49 {hive} => {sql} 0.2040816 0.7692308 1.3461538
50 {sql} => {hive} 0.2040816 0.3571429 1.3461538
51 {hive} => {python} 0.1632653 0.6153846 0.9727047
52 {hive} => {R} 0.2040816 0.7692308 1.0769231
53 {sas} => {java} 0.1224490 0.3529412 0.8647059
54 {java} => {sas} 0.1224490 0.3000000 0.8647059
55 {sas} => {hadoop} 0.1428571 0.4117647 0.8772379
56 {hadoop} => {sas} 0.1428571 0.3043478 0.8772379
57 {sas} => {sql} 0.2040816 0.5882353 1.0294118
58 {sql} => {sas} 0.2040816 0.3571429 1.0294118
59 {sas} => {python} 0.2040816 0.5882353 0.9297913
60 {python} => {sas} 0.2040816 0.3225806 0.9297913
61 {sas} => {R} 0.3061224 0.8823529 1.2352941
62 {R} => {sas} 0.3061224 0.4285714 1.2352941
63 {java} => {hadoop} 0.1428571 0.3500000 0.7456522
64 {hadoop} => {java} 0.1428571 0.3043478 0.7456522
65 {java} => {sql} 0.2653061 0.6500000 1.1375000
66 {sql} => {java} 0.2653061 0.4642857 1.1375000
67 {java} => {python} 0.3469388 0.8500000 1.3435484
68 {python} => {java} 0.3469388 0.5483871 1.3435484
69 {java} => {R} 0.3265306 0.8000000 1.1200000
70 {R} => {java} 0.3265306 0.4571429 1.1200000
71 {hadoop} => {sql} 0.2448980 0.5217391 0.9130435
72 {sql} => {hadoop} 0.2448980 0.4285714 0.9130435
73 {hadoop} => {python} 0.2448980 0.5217391 0.8246844
74 {python} => {hadoop} 0.2448980 0.3870968 0.8246844
75 {hadoop} => {R} 0.3265306 0.6956522 0.9739130
76 {R} => {hadoop} 0.3265306 0.4571429 0.9739130
77 {sql} => {python} 0.4081633 0.7142857 1.1290323
78 {python} => {sql} 0.4081633 0.6451613 1.1290323
79 {sql} => {R} 0.4081633 0.7142857 1.0000000
80 {R} => {sql} 0.4081633 0.5714286 1.0000000
81 {python} => {R} 0.5306122 0.8387097 1.1741935
82 {R} => {python} 0.5306122 0.7428571 1.1741935
83 {java,
javascript} => {sql} 0.1020408 0.8333333 1.4583333
84 {javascript,
sql} => {java} 0.1020408 1.0000000 2.4500000
85 {java,
sql} => {javascript} 0.1020408 0.3846154 3.1410256
86 {java,
javascript} => {python} 0.1020408 0.8333333 1.3172043
87 {javascript,
python} => {java} 0.1020408 1.0000000 2.4500000
88 {hive,
spark} => {hadoop} 0.1020408 0.8333333 1.7753623
89 {hadoop,
spark} => {hive} 0.1020408 0.8333333 3.1410256
90 {hadoop,
hive} => {spark} 0.1020408 0.5000000 3.5000000
91 {perl,
sql} => {python} 0.1224490 1.0000000 1.5806452
92 {perl,
python} => {sql} 0.1224490 1.0000000 1.7500000
93 {python,
sql} => {perl} 0.1224490 0.3000000 2.4500000
94 {perl,
sql} => {R} 0.1020408 0.8333333 1.1666667
95 {perl,
R} => {sql} 0.1020408 1.0000000 1.7500000
96 {perl,
python} => {R} 0.1020408 0.8333333 1.1666667
97 {perl,
R} => {python} 0.1020408 1.0000000 1.5806452
98 {hive,
mapreduce} => {hadoop} 0.1020408 1.0000000 2.1304348
99 {hadoop,
mapreduce} => {hive} 0.1020408 0.6250000 2.3557692
100 {hadoop,
hive} => {mapreduce} 0.1020408 0.5000000 2.7222222

>

This is an exercise to compare performance of different machine learning algorithms (classifiers) both in terms of accuracy and speed. This is not a comprehensive exercise; its merely a trial to come up with a process to compare performance. The caret package of R helps here since it can execute different classifiers with just the change of a parameter.

The data used is an income data set that contains details about individuals - age, years of education, gender and work hours per week. It also has their salary range as a binary variable - whether they earn more than 50K or not. The goal of the classifier is to predict the income range based on the other attributes. R is used for the exercise.

Loading and exploring the data set

data <-read.csv("incomedata.csv")

str(data)

Let us look at the correlation between the predictors and the target.

library(psych)

pairs.panels(data)

Correlation Plot

All 4 predictors are week, with correlation co-efficients against Salary ranging in the twenties. So it would be interesting to see how different algorithms would perform on this data.

Model Building and Prediction

Split training and testing sets

library(caret)

inTrain <- createDataPartition(y=data$Salary,p=0.7,list=FALSE)

training <- data[inTrain,]

testing <- data[-inTrain,]

Predict using different models

#list of all algorithms

predlist <- c("bagFDA","lda","LogitBoost",

"nb","nnet","rf","rpart","svmRadialCost",

"C5.0", "glm")

results <- data.frame( Algorithm=character(), Duration=numeric(), Accuracy=numeric(),

stringsAsFactors=FALSE)

#loop through algorithm list and perform model building and prediction

for (i in 1:length(predlist)) {

pred <- predlist[i]

print(paste("Algorithm = ",pred ))

startTime <- as.integer(Sys.time())

model <- train( Salary ~ ., data=training, method=pred)

predicted <- predict(model, testing)

matrix<- confusionMatrix(predicted, testing$Salary)

endTime <- as.integer(Sys.time())

thisresult <- c( as.character(pred), endTime-startTime, as.numeric(matrix$overall[1]))

results[i,1] <- pred

results[i,2] <- endTime-startTime

results[i,3] <- round(as.numeric(matrix$overall[1]) * 100, 2)

}

results

Analyzing results

Classifier	Accuracy (%)	Execution Time (minutes)
Linear Discriminant Analysis	80.38	0.0
Classification and Regression Trees	80.86	0.2
General Linear Models	80.53	0.2
Boosted Logistic Regression	79.01	1.4
Decision Trees (C5.0)	80.9	4.5
Naïve Bayes	80.94	4.9
Neural Networks	80.8	6.4
Random Forest	80.84	15.1
Bagging (Flexible Discriminant Analysis)	81.15	23.2
Support Vector Machines	81.17	66.3

The table above shows the results of comparison. Some thoughts.

1. Different classifiers perform differently based on the data set. The correlation between predictors and variability of predictions between runs goes a long way in influencing the algorithms. Sometimes all algorithms predict the same way.

2. There are differences in accuracy of a few points between many of them. Are these differences statistically significant?

3. There are huge differences in execution times. This indicates that for any data science project, trying out multiple algorithms and comparing their results is a worthwhile exercise. For similar levels of accuracy, some of them have execution times way better. This will help in choosing the most optimized algorithm for the problem in question.

Here is more comparison of accuracy on different datasets.

Classifier	Income Prediction	SPAM Filter	Credit Approval
Linear Discriminant Analysis	80.38	91.95	72
Classification and Regression Trees	80.86	88.59	66
General Linear Models	80.53	92.62	71.67
Boosted Logistic Regression	79.01	90.6	69.33
Decision Trees (C5.0)	80.9	89.26	71
Naïve Bayes	80.94	87.92	68.33
Neural Networks	80.8	91.28	71.67
Random Forest	80.84	89.93	70.67
Bagging (Flexible Discriminant Analysis)	81.15	93.96	70.33
Support Vector Machines	81.17	89.93	72.33

Big Data Science Practice

Labels

Friday, November 21, 2014

Popular software skills in Data Science job postings

Tuesday, November 4, 2014

Exercise to compare classifier performance