Data Science - Illustration with Cancer Data

python_test
In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]:
import os
os.getcwd()
Out[3]:
'C:\\Users\\John Robertson\\Documents'
In [4]:
os.chdir('C:\\Users\\John Robertson\\Documents\\python_test')
os.getcwd()
Out[4]:
'C:\\Users\\John Robertson\\Documents\\python_test'
In [7]:
headerList = []
headerFile = open('field_names.txt','r')
for line in headerFile:
    nextHeader = line.rstrip()
    if nextHeader:
        headerList.append(nextHeader)
headerFile.close()
print(headerList)
['ID', 'diagnosis', 'radius_mean', 'radius_sd_error', 'radius_worst', 'texture_mean', 'texture_sd_error', 'texture_worst', 'perimeter_mean', 'perimeter_sd_error', 'perimeter_worst', 'area_mean', 'area_sd_error', 'area_worst', 'smoothness_mean', 'smoothness_sd_error', 'smoothness_worst', 'compactness_mean', 'compactness_sd_error', 'compactness_worst', 'concavity_mean', 'concavity_sd_error', 'concavity_worst', 'concave_points_mean', 'concave_points_sd_error', 'concave_points_worst', 'symmetry_mean', 'symmetry_sd_error', 'symmetry_worst', 'fractal_dimension_mean', 'fractal_dimension_sd_error', 'fractal_dimension_worst']
In [8]:
df = pd.read_csv('breast-cancer.csv', header = None, names = headerList, index_col = 0)
In [112]:
print("Dimensions are " + str(df.shape))
print("Number of malignant " + str(len(df[df.diagnosis == 'M'])))
print("Number of benign " + str(len(df[df.diagnosis == 'B'])))
print(df.iloc[0]) # take a look at an element
Dimensions are (569, 31)
Number of malignant 212
Number of benign 357
diagnosis                            M
radius_mean                      17.99
radius_sd_error                  10.38
radius_worst                     122.8
texture_mean                      1001
texture_sd_error                0.1184
texture_worst                   0.2776
perimeter_mean                  0.3001
perimeter_sd_error              0.1471
perimeter_worst                 0.2419
area_mean                      0.07871
area_sd_error                    1.095
area_worst                      0.9053
smoothness_mean                  8.589
smoothness_sd_error              153.4
smoothness_worst              0.006399
compactness_mean               0.04904
compactness_sd_error           0.05373
compactness_worst              0.01587
concavity_mean                 0.03003
concavity_sd_error            0.006193
concavity_worst                  25.38
concave_points_mean              17.33
concave_points_sd_error          184.6
concave_points_worst              2019
symmetry_mean                   0.1622
symmetry_sd_error               0.6656
symmetry_worst                  0.7119
fractal_dimension_mean          0.2654
fractal_dimension_sd_error      0.4601
fractal_dimension_worst         0.1189
Name: 842302, dtype: object
In [9]:
# create a normalized version, where each variable is centered and normalized by the std dev
df_numerical = df.drop(['diagnosis'],axis = 1)
df_numerical_norm = (df_numerical - df_numerical.mean())/df_numerical.std()
df_norm = df.loc[:,['diagnosis']].join(df_numerical_norm)
print(df_norm.iloc[0]) # take a look at an element
diagnosis                            M
radius_mean                     1.0961
radius_sd_error               -2.07151
radius_worst                   1.26882
texture_mean                   0.98351
texture_sd_error               1.56709
texture_worst                  3.28063
perimeter_mean                 2.65054
perimeter_sd_error             2.53025
perimeter_worst                2.21557
area_mean                      2.25376
area_sd_error                  2.48755
area_worst                   -0.564768
smoothness_mean                2.83054
smoothness_sd_error            2.48539
smoothness_worst             -0.213814
compactness_mean                1.3157
compactness_sd_error           0.72339
compactness_worst             0.660239
concavity_mean                 1.14775
concavity_sd_error            0.906286
concavity_worst                1.88503
concave_points_mean            -1.3581
concave_points_sd_error        2.30158
concave_points_worst           1.99948
symmetry_mean                  1.30654
symmetry_sd_error              2.61436
symmetry_worst                 2.10767
fractal_dimension_mean         2.29406
fractal_dimension_sd_error      2.7482
fractal_dimension_worst        1.93531
Name: 842302, dtype: object
In [10]:
# We want to plot the data
# Visualization can help us recognize dangers, unusual features, 
# and our end results should correspond with what we can see visually
# so it helps prevent techinical mistakes from 
# leading us to wrong conclusions
for label in headerList[2:]:
    bins = np.linspace(-4,4,100)
    plt.hold(True)
    plt.hist(df_norm[label][df.diagnosis == 'B'],bins, alpha = .5, label = 'B')
    plt.hist(df_norm[label][df.diagnosis == 'M'],bins, alpha = .5, label = 'M')
    plt.legend(loc='upper right')
    plt.suptitle(label)
    plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py:805: MatplotlibDeprecationWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  mplDeprecation)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\rcsetup.py:155: MatplotlibDeprecationWarning: axes.hold is deprecated, will be removed in 3.0
  mplDeprecation)
In [11]:
# Compute the mean and median smoothness and compactness for benign and malignant tumors - 
# do they differ? 
# Explain how you would identify this.

# Answer. We have three columns for smoothness and three columns for compactness. 
# It is not clear what smoothness_sd_error or compactness_sd_error mean. Without more understanding of the data
# I would assume that I am being asked for the mean and median of the columns smoothness_mean and compactness_mean
# it should be noted that the visual histograms plotted above give a meaningful answer already. 

# I am computing a normalized mean and median which makes it easy to tell by inspection that their difference is significant

print("Malignant smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'B'])))

print("Malignant smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'B'])))

# If this was for a scientific study was going to be published then I would use traditional statistical tests -- some version of 
# students t-test is the standard I believe.
# Problems like this that are classical statistics
# are not commonly called "big data" because they were doable before the days of terabytes of data and hardware
# capable of processing that. In fact, they could be computed (tediously) before computers existed by hand. 
# By inspection, since there are 500 computations, the variation in the means should be on magnitute of 1/sqrt(200) or about 1/14
# But instead they differ by 1.1 and by .9. So they are roughly 8 sample deviations apart, which means they are genuinely different
Malignant smoothness mean = 0.7210558324558541
Benign smoothness mean = -0.4281900181530531
Malignant smoothness median = 0.40232407996917197
Benign smoothness median = -0.502043643388796
In [12]:
# Write a function to generate bootstrap samples of the data

# Bootstrap samples are samples with replacement so to get a sample of N rows with replacement we would use

from random import randint

def getSamples(n,dataFrame):
    newList = []
    rowCount = len(dataFrame)
    for i in range(n):
        newList.append(randint(0, rowCount-1))
    return df.iloc[newList]
In [13]:
# test our Bootstrap function to generate a set of 10 samples

getSamples(10,df_norm)
Out[13]:
diagnosis radius_mean radius_sd_error radius_worst texture_mean texture_sd_error texture_worst perimeter_mean perimeter_sd_error perimeter_worst ... concavity_worst concave_points_mean concave_points_sd_error concave_points_worst symmetry_mean symmetry_sd_error symmetry_worst fractal_dimension_mean fractal_dimension_sd_error fractal_dimension_worst
ID
865468 B 13.37 16.39 86.10 553.5 0.07115 0.07325 0.080920 0.028000 0.1422 ... 14.260 22.75 91.99 632.1 0.10250 0.25310 0.33080 0.08978 0.2048 0.07628
89346 B 9.00 14.40 56.36 246.3 0.07005 0.03116 0.003681 0.003472 0.1788 ... 9.699 20.07 60.90 285.5 0.09861 0.05232 0.01472 0.01389 0.2991 0.07804
874858 M 14.22 23.12 94.37 609.9 0.10750 0.24130 0.198100 0.066180 0.2384 ... 15.740 37.18 106.40 762.4 0.15330 0.93270 0.84880 0.17720 0.5166 0.14460
857010 M 18.65 17.60 123.70 1076.0 0.10990 0.16860 0.197400 0.100900 0.1907 ... 22.820 21.32 150.60 1567.0 0.16790 0.50900 0.73450 0.23780 0.3799 0.09185
869218 B 11.43 17.31 73.66 398.0 0.10920 0.09486 0.020310 0.018610 0.1645 ... 12.780 26.76 82.66 503.0 0.14130 0.17920 0.07708 0.06402 0.2584 0.08096
864726 B 8.95 15.76 58.74 245.2 0.09462 0.12430 0.092630 0.023080 0.1305 ... 9.414 17.07 63.34 270.0 0.11790 0.18790 0.15440 0.03846 0.1652 0.07722
901028 B 13.87 16.21 88.52 593.7 0.08743 0.05492 0.015020 0.020880 0.1424 ... 15.110 25.58 96.74 694.4 0.11530 0.10080 0.05285 0.05556 0.2362 0.07113
923465 B 10.82 24.21 68.89 361.6 0.08192 0.06602 0.015480 0.008160 0.1976 ... 13.030 31.45 83.90 505.6 0.12040 0.16330 0.06194 0.03264 0.3059 0.07626
91376702 B 17.85 13.23 114.60 992.1 0.07838 0.06217 0.044450 0.041780 0.1220 ... 19.820 18.42 127.10 1210.0 0.09862 0.09976 0.10480 0.08341 0.1783 0.05871
91544002 B 11.06 17.12 71.25 366.5 0.11940 0.10710 0.040630 0.042680 0.1954 ... 11.690 20.74 76.08 411.1 0.16620 0.20310 0.12560 0.09514 0.2780 0.11680

10 rows × 31 columns

In [14]:
# Random forest variable importance is a common way
# to pick out which variables are most important

from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators = 500)
forest.fit(df_numerical_norm, df.diagnosis)
importances = forest.feature_importances_
# importance_stds = np.std([tree.feature_importances_ for tree in forest.estimators_], axis = 0)
importance_indices = np.argsort( importances )[::-1]
for i in range(df_numerical_norm.shape[1]):
    print( list(df_numerical_norm)[i] + " " + str(importances[i]))
print("-------------")
print("In order of importance")
for i in range(df_numerical_norm.shape[1]):
    j = importance_indices[i]
    print( list(df_numerical_norm)[j] + " " + str(importances[j]))

plt.plot(range( len(importance_indices)), importances[ importance_indices ], 'ro')
plt.show()
radius_mean 0.053421045747899666
radius_sd_error 0.018758082281329424
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
texture_sd_error 0.010970244884948332
texture_worst 0.018674215638445447
perimeter_mean 0.061087710397355624
perimeter_sd_error 0.09380445329204413
perimeter_worst 0.007643975139285527
area_mean 0.006275077476443084
area_sd_error 0.02357754182383073
area_worst 0.005203512929981728
smoothness_mean 0.020029228669550432
smoothness_sd_error 0.037318116760044644
smoothness_worst 0.006180432373471859
compactness_mean 0.0068107591790187855
compactness_sd_error 0.008274600673487047
compactness_worst 0.009290031042625912
concavity_mean 0.006046108181319825
concavity_sd_error 0.00630317987710586
concavity_worst 0.0947999219538862
concave_points_mean 0.025900099207817617
concave_points_sd_error 0.08206509580588969
concave_points_worst 0.08425461459043444
symmetry_mean 0.01957570597038478
symmetry_sd_error 0.02465788075135929
symmetry_worst 0.041246989728128326
fractal_dimension_mean 0.0868849053263718
fractal_dimension_sd_error 0.015133958397684163
fractal_dimension_worst 0.010008009982875762
-------------
In order of importance
concavity_worst 0.0947999219538862
perimeter_sd_error 0.09380445329204413
fractal_dimension_mean 0.0868849053263718
concave_points_worst 0.08425461459043444
concave_points_sd_error 0.08206509580588969
perimeter_mean 0.061087710397355624
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
radius_mean 0.053421045747899666
symmetry_worst 0.041246989728128326
smoothness_sd_error 0.037318116760044644
concave_points_mean 0.025900099207817617
symmetry_sd_error 0.02465788075135929
area_sd_error 0.02357754182383073
smoothness_mean 0.020029228669550432
symmetry_mean 0.01957570597038478
radius_sd_error 0.018758082281329424
texture_worst 0.018674215638445447
fractal_dimension_sd_error 0.015133958397684163
texture_sd_error 0.010970244884948332
fractal_dimension_worst 0.010008009982875762
compactness_worst 0.009290031042625912
compactness_sd_error 0.008274600673487047
perimeter_worst 0.007643975139285527
compactness_mean 0.0068107591790187855
concavity_sd_error 0.00630317987710586
area_mean 0.006275077476443084
smoothness_worst 0.006180432373471859
concavity_mean 0.006046108181319825
area_worst 0.005203512929981728
In [15]:
# The plot of variable importance using random forests is very useful
# Offhand, it is not necessarily best to just grab the top 3 or 5 
# most important variables. We see distinct groups of variables with 
# comparable importance in this plot, and it may be that they have comparable
# importance because they are strongly correlated, i.e. possibly variables ranked 
# 6,7,8 above are so close in importance because they are tightly correlated
# and each one gives no more information than the others. But we have cut down the 
# playing field of interesting variables significantly.
In [37]:
# Identify 2-3 variables that are predictive of a malignant tumor.
# Display the relationship visually and write 1-2 sentences explaining the relationship.

# The two strongest ones are fractal_dimension_mean and concavity_worst and malignant tumors 
# have larger values of both of those. I don't know precisely how those geometric quantities were
# measured. Offhand, one sounds like it means malignant tumors have a more pitted and crinkled surface.
# I have already displayed the relationship visually with the histograms above.

plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'B'], df_norm.concavity_worst[df.diagnosis == 'B'],'o', alpha = 0.2, label='Benign')
plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'M'], df_norm.concavity_worst[df.diagnosis == 'M'],'o', alpha = 0.2, label='Malignant')
plt.legend(loc = 'upper left')
plt.axis('scaled')
plt.show()
# Plotting these two variables for both groups together it appears that they are not too strongly correlated 
# and that each of these two variable independently helps reduce the overlap between malignant and benign tumors
# That is, the x,y pairs are more separated than either the x coordinates alone or the y coordinates alone would be
In [18]:
from sklearn.cross_validation import cross_val_score
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [19]:
X = df_numerical_norm
Y = df_norm.diagnosis
In [20]:
forest = ExtraTreesClassifier(n_estimators = 500)
In [21]:
forest_result = cross_val_score(forest, X, Y, cv = 5)
In [22]:
print(forest_result)
[0.94782609 0.96521739 0.98230088 0.96460177 0.96460177]
In [23]:
# These scores are the portion of correctly classified samples. 
# These are good scores, and they are consistent scores.
# One of the downsides of cross validation in python is that it doesn't
# return the scores on the training set as well as on the test set.
# You will normally see better scores on the training set than on the 
# test set. But if you see significantly better scores on the 
# training set than on the test set, that is because you are overfitting
# the data. Effectively, these scores are so high that we 
# know we are not overfitting dramatically anyway.
In [24]:
# I already determine the most important variables in a random forest model
In [25]:
# I like SVMs but they are poor at helping you identify the most important variables
# So for the second case I will just use linear regression
In [26]:
from sklearn.svm import SVC
In [27]:
svm = SVC()
In [28]:
svmResult = cross_val_score(svm, X, Y, cv = 5)
In [29]:
svmResult
Out[29]:
array([0.97391304, 0.96521739, 1.        , 0.96460177, 0.97345133])
In [30]:
# It is not easy from an SVM to determine what the most important variables are.
# SVMs are more of a black box. They are best where there is 
# sparse data and you want a black box predictor rather than insight
# about the meaning of the predictions. That is why they
# are used so frequently in computer vision when the data is ALWAYS sparse.
# We know we didn't overfit because the results are so very high. 
# Overfitting with a linear SVM is very unlikely. They specialize in being robust
# against overfitting.

No comments:

Post a Comment

More about rotations in 4 dimensions.

It is easy to take a quaternion and recognize the rotation it actually represents. This is about doing the same thing for 4 dimensional rota...