Data Science - Illustration with Cancer Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

os.chdir('C:\\Users\\John Robertson\\Documents\\python_test')

headerList = []
headerFile = open('field_names.txt','r')
for line in headerFile:
    nextHeader = line.rstrip()
    if nextHeader:
['ID', 'diagnosis', 'radius_mean', 'radius_sd_error', 'radius_worst', 'texture_mean', 'texture_sd_error', 'texture_worst', 'perimeter_mean', 'perimeter_sd_error', 'perimeter_worst', 'area_mean', 'area_sd_error', 'area_worst', 'smoothness_mean', 'smoothness_sd_error', 'smoothness_worst', 'compactness_mean', 'compactness_sd_error', 'compactness_worst', 'concavity_mean', 'concavity_sd_error', 'concavity_worst', 'concave_points_mean', 'concave_points_sd_error', 'concave_points_worst', 'symmetry_mean', 'symmetry_sd_error', 'symmetry_worst', 'fractal_dimension_mean', 'fractal_dimension_sd_error', 'fractal_dimension_worst']
df = pd.read_csv('breast-cancer.csv', header = None, names = headerList, index_col = 0)
print("Dimensions are " + str(df.shape))
print("Number of malignant " + str(len(df[df.diagnosis == 'M'])))
print("Number of benign " + str(len(df[df.diagnosis == 'B'])))
print(df.iloc[0]) # take a look at an element
Dimensions are (569, 31)
Number of malignant 212
Number of benign 357
diagnosis                            M
radius_mean                      17.99
radius_sd_error                  10.38
radius_worst                     122.8
texture_mean                      1001
texture_sd_error                0.1184
texture_worst                   0.2776
perimeter_mean                  0.3001
perimeter_sd_error              0.1471
perimeter_worst                 0.2419
area_mean                      0.07871
area_sd_error                    1.095
area_worst                      0.9053
smoothness_mean                  8.589
smoothness_sd_error              153.4
smoothness_worst              0.006399
compactness_mean               0.04904
compactness_sd_error           0.05373
compactness_worst              0.01587
concavity_mean                 0.03003
concavity_sd_error            0.006193
concavity_worst                  25.38
concave_points_mean              17.33
concave_points_sd_error          184.6
concave_points_worst              2019
symmetry_mean                   0.1622
symmetry_sd_error               0.6656
symmetry_worst                  0.7119
fractal_dimension_mean          0.2654
fractal_dimension_sd_error      0.4601
fractal_dimension_worst         0.1189
Name: 842302, dtype: object
# create a normalized version, where each variable is centered and normalized by the std dev
df_numerical = df.drop(['diagnosis'],axis = 1)
df_numerical_norm = (df_numerical - df_numerical.mean())/df_numerical.std()
df_norm = df.loc[:,['diagnosis']].join(df_numerical_norm)
print(df_norm.iloc[0]) # take a look at an element
diagnosis                            M
radius_mean                     1.0961
radius_sd_error               -2.07151
radius_worst                   1.26882
texture_mean                   0.98351
texture_sd_error               1.56709
texture_worst                  3.28063
perimeter_mean                 2.65054
perimeter_sd_error             2.53025
perimeter_worst                2.21557
area_mean                      2.25376
area_sd_error                  2.48755
area_worst                   -0.564768
smoothness_mean                2.83054
smoothness_sd_error            2.48539
smoothness_worst             -0.213814
compactness_mean                1.3157
compactness_sd_error           0.72339
compactness_worst             0.660239
concavity_mean                 1.14775
concavity_sd_error            0.906286
concavity_worst                1.88503
concave_points_mean            -1.3581
concave_points_sd_error        2.30158
concave_points_worst           1.99948
symmetry_mean                  1.30654
symmetry_sd_error              2.61436
symmetry_worst                 2.10767
fractal_dimension_mean         2.29406
fractal_dimension_sd_error      2.7482
fractal_dimension_worst        1.93531
Name: 842302, dtype: object
# We want to plot the data
# Visualization can help us recognize dangers, unusual features, 
# and our end results should correspond with what we can see visually
# so it helps prevent techinical mistakes from 
# leading us to wrong conclusions
for label in headerList[2:]:
    bins = np.linspace(-4,4,100)
    plt.hist(df_norm[label][df.diagnosis == 'B'],bins, alpha = .5, label = 'B')
    plt.hist(df_norm[label][df.diagnosis == 'M'],bins, alpha = .5, label = 'M')
    plt.legend(loc='upper right')
# Compute the mean and median smoothness and compactness for benign and malignant tumors - 
# do they differ? 
# Explain how you would identify this.

# Answer. We have three columns for smoothness and three columns for compactness. 
# It is not clear what smoothness_sd_error or compactness_sd_error mean. Without more understanding of the data
# I would assume that I am being asked for the mean and median of the columns smoothness_mean and compactness_mean
# it should be noted that the visual histograms plotted above give a meaningful answer already. 

# I am computing a normalized mean and median which makes it easy to tell by inspection that their difference is significant

print("Malignant smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'B'])))

print("Malignant smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'B'])))

# If this was for a scientific study was going to be published then I would use traditional statistical tests -- some version of 
# students t-test is the standard I believe.
# Problems like this that are classical statistics
# are not commonly called "big data" because they were doable before the days of terabytes of data and hardware
# capable of processing that. In fact, they could be computed (tediously) before computers existed by hand. 
# By inspection, since there are 500 computations, the variation in the means should be on magnitute of 1/sqrt(200) or about 1/14
# But instead they differ by 1.1 and by .9. So they are roughly 8 sample deviations apart, which means they are genuinely different
Malignant smoothness mean = 0.7210558324558541
Benign smoothness mean = -0.4281900181530531
Malignant smoothness median = 0.40232407996917197
Benign smoothness median = -0.502043643388796
# Write a function to generate bootstrap samples of the data

# Bootstrap samples are samples with replacement so to get a sample of N rows with replacement we would use

from random import randint

def getSamples(n,dataFrame):
    newList = []
    rowCount = len(dataFrame)
    for i in range(n):
        newList.append(randint(0, rowCount-1))
    return df.iloc[newList]
# test our Bootstrap function to generate a set of 10 samples

diagnosis radius_mean radius_sd_error radius_worst texture_mean texture_sd_error texture_worst perimeter_mean perimeter_sd_error perimeter_worst ... concavity_worst concave_points_mean concave_points_sd_error concave_points_worst symmetry_mean symmetry_sd_error symmetry_worst fractal_dimension_mean fractal_dimension_sd_error fractal_dimension_worst
865468 B 13.37 16.39 86.10 553.5 0.07115 0.07325 0.080920 0.028000 0.1422 ... 14.260 22.75 91.99 632.1 0.10250 0.25310 0.33080 0.08978 0.2048 0.07628
89346 B 9.00 14.40 56.36 246.3 0.07005 0.03116 0.003681 0.003472 0.1788 ... 9.699 20.07 60.90 285.5 0.09861 0.05232 0.01472 0.01389 0.2991 0.07804
874858 M 14.22 23.12 94.37 609.9 0.10750 0.24130 0.198100 0.066180 0.2384 ... 15.740 37.18 106.40 762.4 0.15330 0.93270 0.84880 0.17720 0.5166 0.14460
857010 M 18.65 17.60 123.70 1076.0 0.10990 0.16860 0.197400 0.100900 0.1907 ... 22.820 21.32 150.60 1567.0 0.16790 0.50900 0.73450 0.23780 0.3799 0.09185
869218 B 11.43 17.31 73.66 398.0 0.10920 0.09486 0.020310 0.018610 0.1645 ... 12.780 26.76 82.66 503.0 0.14130 0.17920 0.07708 0.06402 0.2584 0.08096
864726 B 8.95 15.76 58.74 245.2 0.09462 0.12430 0.092630 0.023080 0.1305 ... 9.414 17.07 63.34 270.0 0.11790 0.18790 0.15440 0.03846 0.1652 0.07722
901028 B 13.87 16.21 88.52 593.7 0.08743 0.05492 0.015020 0.020880 0.1424 ... 15.110 25.58 96.74 694.4 0.11530 0.10080 0.05285 0.05556 0.2362 0.07113
923465 B 10.82 24.21 68.89 361.6 0.08192 0.06602 0.015480 0.008160 0.1976 ... 13.030 31.45 83.90 505.6 0.12040 0.16330 0.06194 0.03264 0.3059 0.07626
91376702 B 17.85 13.23 114.60 992.1 0.07838 0.06217 0.044450 0.041780 0.1220 ... 19.820 18.42 127.10 1210.0 0.09862 0.09976 0.10480 0.08341 0.1783 0.05871
91544002 B 11.06 17.12 71.25 366.5 0.11940 0.10710 0.040630 0.042680 0.1954 ... 11.690 20.74 76.08 411.1 0.16620 0.20310 0.12560 0.09514 0.2780 0.11680

10 rows × 31 columns

# Random forest variable importance is a common way
# to pick out which variables are most important

from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators = 500), df.diagnosis)
importances = forest.feature_importances_
# importance_stds = np.std([tree.feature_importances_ for tree in forest.estimators_], axis = 0)
importance_indices = np.argsort( importances )[::-1]
for i in range(df_numerical_norm.shape[1]):
    print( list(df_numerical_norm)[i] + " " + str(importances[i]))
print("In order of importance")
for i in range(df_numerical_norm.shape[1]):
    j = importance_indices[i]
    print( list(df_numerical_norm)[j] + " " + str(importances[j]))

plt.plot(range( len(importance_indices)), importances[ importance_indices ], 'ro')
radius_mean 0.053421045747899666
radius_sd_error 0.018758082281329424
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
texture_sd_error 0.010970244884948332
texture_worst 0.018674215638445447
perimeter_mean 0.061087710397355624
perimeter_sd_error 0.09380445329204413
perimeter_worst 0.007643975139285527
area_mean 0.006275077476443084
area_sd_error 0.02357754182383073
area_worst 0.005203512929981728
smoothness_mean 0.020029228669550432
smoothness_sd_error 0.037318116760044644
smoothness_worst 0.006180432373471859
compactness_mean 0.0068107591790187855
compactness_sd_error 0.008274600673487047
compactness_worst 0.009290031042625912
concavity_mean 0.006046108181319825
concavity_sd_error 0.00630317987710586
concavity_worst 0.0947999219538862
concave_points_mean 0.025900099207817617
concave_points_sd_error 0.08206509580588969
concave_points_worst 0.08425461459043444
symmetry_mean 0.01957570597038478
symmetry_sd_error 0.02465788075135929
symmetry_worst 0.041246989728128326
fractal_dimension_mean 0.0868849053263718
fractal_dimension_sd_error 0.015133958397684163
fractal_dimension_worst 0.010008009982875762
In order of importance
concavity_worst 0.0947999219538862
perimeter_sd_error 0.09380445329204413
fractal_dimension_mean 0.0868849053263718
concave_points_worst 0.08425461459043444
concave_points_sd_error 0.08206509580588969
perimeter_mean 0.061087710397355624
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
radius_mean 0.053421045747899666
symmetry_worst 0.041246989728128326
smoothness_sd_error 0.037318116760044644
concave_points_mean 0.025900099207817617
symmetry_sd_error 0.02465788075135929
area_sd_error 0.02357754182383073
smoothness_mean 0.020029228669550432
symmetry_mean 0.01957570597038478
radius_sd_error 0.018758082281329424
texture_worst 0.018674215638445447
fractal_dimension_sd_error 0.015133958397684163
texture_sd_error 0.010970244884948332
fractal_dimension_worst 0.010008009982875762
compactness_worst 0.009290031042625912
compactness_sd_error 0.008274600673487047
perimeter_worst 0.007643975139285527
compactness_mean 0.0068107591790187855
concavity_sd_error 0.00630317987710586
area_mean 0.006275077476443084
smoothness_worst 0.006180432373471859
concavity_mean 0.006046108181319825
area_worst 0.005203512929981728
# The plot of variable importance using random forests is very useful
# Offhand, it is not necessarily best to just grab the top 3 or 5 
# most important variables. We see distinct groups of variables with 
# comparable importance in this plot, and it may be that they have comparable
# importance because they are strongly correlated, i.e. possibly variables ranked 
# 6,7,8 above are so close in importance because they are tightly correlated
# and each one gives no more information than the others. But we have cut down the 
# playing field of interesting variables significantly.
# Identify 2-3 variables that are predictive of a malignant tumor.
# Display the relationship visually and write 1-2 sentences explaining the relationship.

# The two strongest ones are fractal_dimension_mean and concavity_worst and malignant tumors 
# have larger values of both of those. I don't know precisely how those geometric quantities were
# measured. Offhand, one sounds like it means malignant tumors have a more pitted and crinkled surface.
# I have already displayed the relationship visually with the histograms above.

plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'B'], df_norm.concavity_worst[df.diagnosis == 'B'],'o', alpha = 0.2, label='Benign')
plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'M'], df_norm.concavity_worst[df.diagnosis == 'M'],'o', alpha = 0.2, label='Malignant')
plt.legend(loc = 'upper left')
# Plotting these two variables for both groups together it appears that they are not too strongly correlated 
# and that each of these two variable independently helps reduce the overlap between malignant and benign tumors
# That is, the x,y pairs are more separated than either the x coordinates alone or the y coordinates alone would be
from sklearn.cross_validation import cross_val_score
X = df_numerical_norm
Y = df_norm.diagnosis
forest = ExtraTreesClassifier(n_estimators = 500)
forest_result = cross_val_score(forest, X, Y, cv = 5)
[0.94782609 0.96521739 0.98230088 0.96460177 0.96460177]
# These scores are the portion of correctly classified samples. 
# These are good scores, and they are consistent scores.
# One of the downsides of cross validation in python is that it doesn't
# return the scores on the training set as well as on the test set.
# You will normally see better scores on the training set than on the 
# test set. But if you see significantly better scores on the 
# training set than on the test set, that is because you are overfitting
# the data. Effectively, these scores are so high that we 
# know we are not overfitting dramatically anyway.
# I already determine the most important variables in a random forest model
# I like SVMs but they are poor at helping you identify the most important variables
# So for the second case I will just use linear regression
from sklearn.svm import SVC
svm = SVC()
svmResult = cross_val_score(svm, X, Y, cv = 5)
In [29]:
array([0.97391304, 0.96521739, 1.        , 0.96460177, 0.97345133])
# It is not easy from an SVM to determine what the most important variables are.
# SVMs are more of a black box. They are best where there is 
# sparse data and you want a black box predictor rather than insight
# about the meaning of the predictions. That is why they
# are used so frequently in computer vision when the data is ALWAYS sparse.
# We know we didn't overfit because the results are so very high. 
# Overfitting with a linear SVM is very unlikely. They specialize in being robust
# against overfitting.

