Loading [MathJax]/jax/output/CommonHTML/jax.js

More about rotations in 4 dimensions.

It is easy to take a quaternion and recognize the rotation it actually represents. This is about doing the same thing for 4 dimensional rotations. It is easy to multiply two four dimensional rotations together. Can we extract the orthogonal planes of rotation and their angles of rotations from that. We can represent a rotation as a product of two vectors ab via w mapstobavab and this will be the rotation of the ab plane which leaves the space orthogonal to that plane pointwise fixed. It will be the rotations from the vector a to the vector b applied twice. This specifies both the direction and that amount of the rotation. Since ab=g(a,b)+ab then this rotation is cos(α/2)+sin(α/2)ω where α is the amount of rotation and ω is ab scaled to have norm one by a positive scalar constant. The product ab specifies the direction in which the rotation α is applied, since any vectors cd which equals ab specifies the direction of the rotation as it is a rotation of angle α rotating starting from c and moving toward d. We can extract the plane a,b from the product ab because it is those vectors u with u(ab)=0. This is a system of 4 homogeneous equations which has rank 2 over R4. We can extract the orthogonal plane ϕ=cd to the a,b plane from ω alone. Write ω=r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4 where r12r34r13r24+r14r23=0. Then the orthogonal plane is ψ=r34e1e2r24e1e3+r23e1e4+r14e2e3r13e2e4+r12e3e4. Then ωψ=(r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4)(r34e1e2r24e1e3+r23e1e4+r14e2e3r13e2e4+r12e3e4). The constant term is r12r34+r13r24r14r23r23r14+r24r13r34r12=0. The e1e2e3e4 term is r212+r213+r214+r223+r224+r234. The e1e2 term of ωψ is r13r14e1e3e2e3r14r13e1e4e2e4r23r24e2e3e1e3+r24r23e2e4e1e4 This simplifies to zero. The other grade 2 terms should simply to zero as well. Then (a+bω)(c+dψ)=ac+(bcω+adψ)+bdωψ. So if ω and ψ were norm one then ωψ=e1e2e3e4. Thus our product becomes ac+(bcω+adψ)+bde1e2e3e4. If a=cos(α/2) and b=sin(α/2) and if c=cos(β/2) and d=sin(β/2) then ac=cos(α/2)cos(β/2) and bd=sin(α/2)sin(β/2) and consequently acbd=cos((βα)/2) and ac+bd=cos((α+β)/2). Since ac and bd are known (being the constant and the volume coefficients) then we can solve this for α and β. Now ad+bc=sin((α+β)/2) And adbc=sin((βα)/2). Now bcω+adψ=12(ad+bc)(ψ+ω)+12(adbc)(ψω). This then equals 12sin((α+β)/2)(ψ+ω)+12sin((βα)/2)(ψω) This is useful because ac and bd are not known. But we can find sin((α+β)/2) and sin((βα)/2) up to sign (of each) using ac and bd. This then becomes easy to solve for the actual coefficients rij of ω (and ψ).

Some notes on geometric algebra in 4 dimensions

Working in R4 here with the geometric algebra corresponding to the usual norm, i.e. eiej=ejei for ij and eiei=1. The inverse of a vector v is just v/v2. A geometric algebra is an algebra with zero divisors, for if v is a vector of norm 1 so vv=1 then 12(1+v) is idempotent and is also a zero divisor (by multiplying by (1v). An element of grade k is often called a k vector. Now the product of two vectors a,b is g(a,b)+ab. The product of a and bag(a,b)/g(a,a) is therefore g(a,b)g(a,a)g(a,b)/g(a,a)+(a(bag(a,b)/g(a,a)))=ab since aa=0. So the grade 2 part of a 2 blade is also always a 2 blade. Put differently, ab is always a 2 blade. Now given a 2 blade ab then as above we can assume without loss of generality that g(a,b)=0 and we can also assume that a and b are unit vectors. Then for any scalar α the two blade a(b+αa)=g(a,b+αa)+a(b+αa)=α+ab. So the grade 0 part is irrelevant to whether something is a 2 blade. If A=uv for othogonormal vectors u,v then AA=uvuv=uuvv=1. Let A=r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4. Then AA=1 means (r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4)(r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4)=1 Now rijeiejrijeiej=r2ijeiejeiej=r2ijeieiejej=r2ij. For the mixed terms things are uglier. The e1e2e3e4 term is r12r34e1e2e3e4+r13r24e1e3e2e4+r14r23e1e4e2e3+r23r14e2e3e1e4+r24r13e2e4e1e3+r34r12e3e4e1e2 This is equal to (r12r34r13r24+r14r23+r23r14r24r13+r34r12). Thus r12r34r13r24+r14r23=0 for a blade. The e1e2 part of AA is r13r23e1e3e2e3+r14r24e1e4e2e4+r23r13e2e3e1e3+r24r14e2e4e1e4=(r13r32r14r24+r13r23+r24r14)e1e2=0. Let B=r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4 in R4 and assume that r12r34r13r24+r14r23=0. How big is the space of vectors u with uB=0? If u=x1e1+x2e2+x3e3+x4e4 then uB=0 becomes x1r23e1e2e3+x1r24e1e2e4+x1r34e1e3e4+x2r13e2e1e3+x2r14e2e1e4+x2r34e2e3e4+x3r12e3e1e2+x3r14e3e1e4+x3r24e3e2e4+x4r12e4e1e2+x4r13e4e1e3+x4r23e4e2e3=0 This then becomes the system of 4 homogeneous equations: x1r23x2r13+x3r12=0, x1r24x2r14+x4r12=0, x1r34x3r14+x4r13=0, x2r34x3r24+x4r23=0. Taking r14 times the first equation minus r13 times the second we obtain x1(r14r23r13r24)+x3r14r12x4r13r12=0 If r12r34r13r24+r14r23=0 then we can rewrite this as x1r12r34+x3r14r12x4r13r12=0 which is r12 times the third equation. Similarly if we take the linear combination of the first two equations which eliminates the x1 term we should obtain the fourth equation. Thus the system has rank at most two. If B is a wedge of two orthonormal vectors then we can obtain the span of those vectors as the space of vectors u satisfying uB=0.

Rotations in 4D

So the geometric algebra specifies a rotation on Rn in quite a nice form. Given unit vectors a,b then defining σ:vbavab we get the σ is a rotation that leaves everything perpendicular to the a,b plan pointwise fixed, and within the a,b plane σ is the rotation from a to b applied twice (so that the rotation of σ is twice the angle between a and b). Now in the plane if we choose a to be e1=(1,0) and we choose b=cos(θ/2)e1+sin(θ/2)e2=(cos(θ/2),sin(θ/2)) so that applying the rotation from a to b twice is a counterclockwise rotation of the plane by angle θ then the geometric product ab=cos(θ/2)+sin(θ/2)e1e2 and the geometric ba=cos(θ/2)sin(θ/2)e1e2. Now in 4 dimensions a rotation is made up of a plane, and its orthogonal plane, each of them being setwise invariant, and each undergoing a rotation. Thus we can see a model of a 4 dimensional rotation by taking the product (cos(α/2)+sin(α/2)e1e2)(cos(β/2)+sin(β/2)e3e4). Then vq1vq is a rotation in R4 which is a rotation in the direction from e1 to e2 by angle α in the e1,e2 plane, and is a rotation in the e3,e4 plane rotating in the direction from e3 to e4 by angle β. More interestingly, given a general unit member z=a+b12e1e2+b13e1e3+b14e1e4+b23e2e3+b24e2e4+b34e3e4+ce1e2e3e4 one wonders whether we can factor it as a product of two planar orthogonal rotations. That is, can you read off the planes of rotation and the corresponding angles of rotation from any unit even member of the geometric algebra. The answer is that yes one can. Assume that the two rotations are g=cos(α/2)+sin(α/2)(r12e1e2+r13e1e3+r14e1e4+r23e2e3+r24e2e4+r34e3e4) and by h=cos(β/2)+sin(β/2)(s12e1e2+s13e1e3+s14e1e4+s23e2e3+s24e2e4+s34e3e4). For a rotation to represent a plane it must have r12r34+r13r24+r14r23=0 For the planes of these two members of the product to be orthogonal we must have r12=s34, r13=s24, r14=s23, r23=s14, r24=s13, r34=s12. Consequently the grade 4 part of the product gh is r12s34+r13s24+r14s23+r23s14+r24s13+r34s14 which is equal to r212+r213+r214+r232+r224+r234=1. Consequently if z=gh then a=cos(α/2)cos(β/2) and c=sin(α/2)sin(β/2). We then have ac=cos((α+β)/2) and a+c=cos((αβ)/2). From this we can readily extract both α and β. Then since bij=cos(β/2)sin(α/2)rij+sin(β/2)cos(α/2)sij=cos(β/2)sin(α/2)rij+sin(β/2)cos(α/2)rkl and bkl=cos(β/2)sin(α/2)rkl+sin(β/2)cos(α/2)skl=cos(β/2)sin(α/2)rkl+sin(β/2)cos(α/2)rij where kl are the two members different than ij we have equations we can readily solve for rij and rkl in terms of bij and bkl (since β and α are known). Thus one can decompose any even unit member of the geometric algebra of R4 into two orthogonal plane rotations with known angles and directions of rotation.
So the wedge product can be defined by simply setting a basis ei and requiring that eiej=ejei for ij and that eiei=0 along with commuting with scalars and distribution and associativity. It can be shown then that for any vector a one has aa=0. The wedge product defines a bit of signed area (or volume) on an oriented plane, hyperplane, etc.... just as a vetor represents a signed quantity in a specific direction. The (typical) geometric product is defined as follows for an orthonormal basis: eiej=eiej for ij and eiei=1. If a is a vector then aa=aa the usual dot product of a with itself. In general for vectors a and b the geometric product is ab=ab+ab. If b has unit length then bb=1, so b is its own inverse. Consider the conjugation τv:ve1ve1. Then for any basis vector other than e1 the result is eie1eie1=e1e1ei=ei. But e1e1e1e1=e1. Thus τv preserves the e1 axis, but is an inversion on the perpendicular space. The same is true for τa:vava for any unit vector a. The line through a is fixed pointwise, and the perpendicular space to a undergoes an involution where everything is negated. The combination of τa followed by τb then fixes everything perpendicular to the a,b plane. However, on the a,b plane, if h is rotation from a to b then τbτa is h applied twice, giving twice the rotation of a direct rotation from a to b. Thus the map v(ba)v(ab) rotates the a,b plane as if a single rotation directly from a to b had been applied twice, whereas it leaves the subspace perpendicular to the a,b plane pointwise fixed. The scalar part of ab is the cosine of ψ where ψ is the angle from a to b. Put differently, the angle of rotation is θ and the scalar part of this rotation is cos(θ/2). Because abab=aabb=1 then if we write out the taylor series expansion for exp(tab) where t is a scalar we get cos(t)+sin(t)ab. The inverse of cos(t)+sin(t)ab is cos(t)+sin(t)ba=cos(t)sin(t)ab. Thus there is a rotation subgroup cos(t) + sin(t)ab, i.e. v(cos(t)sin(t)ab)v(cos(t)+sin(t)ab) When t=0 this is the identity map and when t=π/2 this is the rotation v(ba)v(ab) Now ab=ab+ab. If a,b are unit vectors, then multiplying [cos(t)+sin(t)ab][cos(s)+sin(s)ab] yields cos(t)cos(s)+sin(t)sin(s)abab+cos(t)sin(s)ab+sin(t)cos(s)ab=(cos(t)cos(s)sin(t)sin(s))+[cos(t)sin(s)+sin(t)cos(s)]ab but this is just cos(t+s)+sin(t+s)ab In 4 dimensions a rotation is determined by the element (cos(s/2)+sin(s/2)e1e2)(cos(t/2)+sin(t/2)e3e4) Expanding gives cos(s/2)cos(t/2)+cos(s/2)sin(t/2)e3e4+cos(t/2)sin(s/2)e1e2+sin(s/2)sin(t/2)e1e2e3e4 So there is a vector and a pseudovector and one is cos(s/2)cos(t/2) and the other is sin(s/2)sin(t/2)e1e2e3e4 There is also a bivector part which is where we get real direction (because the vector and pseudovector are directionless). So given a unit versor for R4 is there necessarily a corresponding rotation, and if so, can we read off the planes of rotations and their rotation angles as we can in 3 dimensions? That is, given a + b_{12) e_1 e_2 + b_{13} e_1 e_3 + b_{14} e_1 e_4 + b_{23} e_2 e_3 + b_{24} e_2 e_4 + b_{34} e_3 e_4 + c e_1 e_2 e_3 e_4 a unit vector where the sum of squares of the coefficients is one, does this always give a rotation, and if so, what are the planes and their rotations? Now the grade two part should be the sum of two different blades, each one the orthogonal complement of the other. A single 2 blade in R4 has grade two part satisfying u12u34+u13u24+u14u23=0. The orthogonal blade grade 2 part has v12=u34 and v13=u24 etc.... One would expect that every possible choice of bij can be represented nearly uniquely as a weighted sum of two orthogonal 2 blades, i.e. choose 1, then the other is detemined, and one can choose the coefficient of the other which gives just the right dimension.

Deforming complex space using Beltrami coefficients on a pair of foliations

Since the pair of equations for deforming a pair of foliations is a pair of PDSs involving the beltrami coefficients, I checked to see whether they were simpler to work with if the beltrami coefficients were in turn represented as a ratio involving ˉz and z but that didn’t help.
I had worked out a series solution to them years ago, looking for simple solutions. The simple solutions all looked like they turned out to be trivial as I recall (i.e. no different than a deformation of each coordinate separately).
However, I went ahead and wrote out the series solution again just because the PDEs involved obviously have a series solution, and PDE’s can be hard to solve.
If b1 and b2 are the beltrami coefficients of the horizontal and vertical holomorphic leaves, the PDE pair is

b1ˉz2=b2b1z2

and

b2ˉz1=b1b2z1


You can select any two convergent series aijkzi1zj2ˉzk1 and cijkzi1zj2ˉzk2 then there is a nice series relationship that gives you actual solutions b1 and b2 as an infinite series. The series solution is nicer if you actually represent b1 and b2 as
b1=Aijˉzi1ˉzj2
and
b2=Cijˉzi1ˉzj2
where Aij and Cij are holomorphic functions. 

You then get the relationships:

(j+1)Ai,j+1=ki,jAkz2Cik,j

(i+1)Ci+1,j=ki,jCkz1Aik,j

This recursive relationship requires that you specify each Ai0 and each C0j as a holomorphic function.

Of course, where resulting series converges is a different matter altogether. But it appears this gives lots of solutions to the system of PDEs. For example, if the constants aijk and cijk get small very rapidly, this method would give you convergent solutions to the PDE.
Now, of course, there are many solutions, because one can simply take two transverse holomorphic motions and turn them into a deformation by treating them as a coordinate system. That is, take two different pairs of transverse foliations. For each foliation in each pair choose a leaf. Choose a way to match the selected two leaves that came from one pair with the selected two leaves that came from the other pair. Now extend by treating the transverse foliations as giving coordinates.
The PDE is written for the case where one of these pairs is simply the regular horizontal and vertical leaves, and we are specifying beltrami coefficients to deform the structure of those leaves.

So there are lots of solutions. But it is interesting to have solutions specified by beltrami coefficients. It means you can deform an existing structure and still have a holomorphic space. 

Deforming C2 with a pair of holomorphic foliations

I went over my old notes on deforming C2. Pairs of laminations of C2 by holomorphic curves was a recurring theme in holomorphic dynamics, and I worked out the conditions using the Neienhaus tensor for doing quasiconformal deformation on a pair of laminations and obtaining a new complex structure on C2. The point is more or less that two transverse laminations are deformed into two different transverse laminations, which goes along with the leaves of each of those laminations undergoing a quasiconformal deformation. I had worked out a few geometric basics as well, sort of to check the viability of the idea. For smooth deformations locally one would have an R linear map of C2. It turns out that there are two open conditions for such maps - they either map two complex lines to two complex lines or else they map no complex lines to complex lines. I don't think this is a direction to pursue right now, as the condition one obtains on the beltrami coefficients is simple and elegant, yet it is a nonlinear PDE. Given the implicit geometric meaning it might be interesting in looking at whether the property of being less than norm 1 propagates in a solution.

I need to rewrite the equations using bi=fzbarfz since bi IS a beltrami coefficient.

There was some horrible thing that happened when you tried to develop a series solution, and I don't recall what it was.

Data Science - Illustration with Cancer Data

python_test
In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]:
import os
os.getcwd()
Out[3]:
'C:\\Users\\John Robertson\\Documents'
In [4]:
os.chdir('C:\\Users\\John Robertson\\Documents\\python_test')
os.getcwd()
Out[4]:
'C:\\Users\\John Robertson\\Documents\\python_test'
In [7]:
headerList = []
headerFile = open('field_names.txt','r')
for line in headerFile:
    nextHeader = line.rstrip()
    if nextHeader:
        headerList.append(nextHeader)
headerFile.close()
print(headerList)
['ID', 'diagnosis', 'radius_mean', 'radius_sd_error', 'radius_worst', 'texture_mean', 'texture_sd_error', 'texture_worst', 'perimeter_mean', 'perimeter_sd_error', 'perimeter_worst', 'area_mean', 'area_sd_error', 'area_worst', 'smoothness_mean', 'smoothness_sd_error', 'smoothness_worst', 'compactness_mean', 'compactness_sd_error', 'compactness_worst', 'concavity_mean', 'concavity_sd_error', 'concavity_worst', 'concave_points_mean', 'concave_points_sd_error', 'concave_points_worst', 'symmetry_mean', 'symmetry_sd_error', 'symmetry_worst', 'fractal_dimension_mean', 'fractal_dimension_sd_error', 'fractal_dimension_worst']
In [8]:
df = pd.read_csv('breast-cancer.csv', header = None, names = headerList, index_col = 0)
In [112]:
print("Dimensions are " + str(df.shape))
print("Number of malignant " + str(len(df[df.diagnosis == 'M'])))
print("Number of benign " + str(len(df[df.diagnosis == 'B'])))
print(df.iloc[0]) # take a look at an element
Dimensions are (569, 31)
Number of malignant 212
Number of benign 357
diagnosis                            M
radius_mean                      17.99
radius_sd_error                  10.38
radius_worst                     122.8
texture_mean                      1001
texture_sd_error                0.1184
texture_worst                   0.2776
perimeter_mean                  0.3001
perimeter_sd_error              0.1471
perimeter_worst                 0.2419
area_mean                      0.07871
area_sd_error                    1.095
area_worst                      0.9053
smoothness_mean                  8.589
smoothness_sd_error              153.4
smoothness_worst              0.006399
compactness_mean               0.04904
compactness_sd_error           0.05373
compactness_worst              0.01587
concavity_mean                 0.03003
concavity_sd_error            0.006193
concavity_worst                  25.38
concave_points_mean              17.33
concave_points_sd_error          184.6
concave_points_worst              2019
symmetry_mean                   0.1622
symmetry_sd_error               0.6656
symmetry_worst                  0.7119
fractal_dimension_mean          0.2654
fractal_dimension_sd_error      0.4601
fractal_dimension_worst         0.1189
Name: 842302, dtype: object
In [9]:
# create a normalized version, where each variable is centered and normalized by the std dev
df_numerical = df.drop(['diagnosis'],axis = 1)
df_numerical_norm = (df_numerical - df_numerical.mean())/df_numerical.std()
df_norm = df.loc[:,['diagnosis']].join(df_numerical_norm)
print(df_norm.iloc[0]) # take a look at an element
diagnosis                            M
radius_mean                     1.0961
radius_sd_error               -2.07151
radius_worst                   1.26882
texture_mean                   0.98351
texture_sd_error               1.56709
texture_worst                  3.28063
perimeter_mean                 2.65054
perimeter_sd_error             2.53025
perimeter_worst                2.21557
area_mean                      2.25376
area_sd_error                  2.48755
area_worst                   -0.564768
smoothness_mean                2.83054
smoothness_sd_error            2.48539
smoothness_worst             -0.213814
compactness_mean                1.3157
compactness_sd_error           0.72339
compactness_worst             0.660239
concavity_mean                 1.14775
concavity_sd_error            0.906286
concavity_worst                1.88503
concave_points_mean            -1.3581
concave_points_sd_error        2.30158
concave_points_worst           1.99948
symmetry_mean                  1.30654
symmetry_sd_error              2.61436
symmetry_worst                 2.10767
fractal_dimension_mean         2.29406
fractal_dimension_sd_error      2.7482
fractal_dimension_worst        1.93531
Name: 842302, dtype: object
In [10]:
# We want to plot the data
# Visualization can help us recognize dangers, unusual features, 
# and our end results should correspond with what we can see visually
# so it helps prevent techinical mistakes from 
# leading us to wrong conclusions
for label in headerList[2:]:
    bins = np.linspace(-4,4,100)
    plt.hold(True)
    plt.hist(df_norm[label][df.diagnosis == 'B'],bins, alpha = .5, label = 'B')
    plt.hist(df_norm[label][df.diagnosis == 'M'],bins, alpha = .5, label = 'M')
    plt.legend(loc='upper right')
    plt.suptitle(label)
    plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py:805: MatplotlibDeprecationWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  mplDeprecation)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\rcsetup.py:155: MatplotlibDeprecationWarning: axes.hold is deprecated, will be removed in 3.0
  mplDeprecation)
In [11]:
# Compute the mean and median smoothness and compactness for benign and malignant tumors - 
# do they differ? 
# Explain how you would identify this.

# Answer. We have three columns for smoothness and three columns for compactness. 
# It is not clear what smoothness_sd_error or compactness_sd_error mean. Without more understanding of the data
# I would assume that I am being asked for the mean and median of the columns smoothness_mean and compactness_mean
# it should be noted that the visual histograms plotted above give a meaningful answer already. 

# I am computing a normalized mean and median which makes it easy to tell by inspection that their difference is significant

print("Malignant smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness mean = " + str(np.mean(df_norm.smoothness_mean[df.diagnosis == 'B'])))

print("Malignant smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'M'])))
print("Benign smoothness median = " + str(np.median(df_norm.smoothness_mean[df.diagnosis == 'B'])))

# If this was for a scientific study was going to be published then I would use traditional statistical tests -- some version of 
# students t-test is the standard I believe.
# Problems like this that are classical statistics
# are not commonly called "big data" because they were doable before the days of terabytes of data and hardware
# capable of processing that. In fact, they could be computed (tediously) before computers existed by hand. 
# By inspection, since there are 500 computations, the variation in the means should be on magnitute of 1/sqrt(200) or about 1/14
# But instead they differ by 1.1 and by .9. So they are roughly 8 sample deviations apart, which means they are genuinely different
Malignant smoothness mean = 0.7210558324558541
Benign smoothness mean = -0.4281900181530531
Malignant smoothness median = 0.40232407996917197
Benign smoothness median = -0.502043643388796
In [12]:
# Write a function to generate bootstrap samples of the data

# Bootstrap samples are samples with replacement so to get a sample of N rows with replacement we would use

from random import randint

def getSamples(n,dataFrame):
    newList = []
    rowCount = len(dataFrame)
    for i in range(n):
        newList.append(randint(0, rowCount-1))
    return df.iloc[newList]
In [13]:
# test our Bootstrap function to generate a set of 10 samples

getSamples(10,df_norm)
Out[13]:
diagnosis radius_mean radius_sd_error radius_worst texture_mean texture_sd_error texture_worst perimeter_mean perimeter_sd_error perimeter_worst ... concavity_worst concave_points_mean concave_points_sd_error concave_points_worst symmetry_mean symmetry_sd_error symmetry_worst fractal_dimension_mean fractal_dimension_sd_error fractal_dimension_worst
ID
865468 B 13.37 16.39 86.10 553.5 0.07115 0.07325 0.080920 0.028000 0.1422 ... 14.260 22.75 91.99 632.1 0.10250 0.25310 0.33080 0.08978 0.2048 0.07628
89346 B 9.00 14.40 56.36 246.3 0.07005 0.03116 0.003681 0.003472 0.1788 ... 9.699 20.07 60.90 285.5 0.09861 0.05232 0.01472 0.01389 0.2991 0.07804
874858 M 14.22 23.12 94.37 609.9 0.10750 0.24130 0.198100 0.066180 0.2384 ... 15.740 37.18 106.40 762.4 0.15330 0.93270 0.84880 0.17720 0.5166 0.14460
857010 M 18.65 17.60 123.70 1076.0 0.10990 0.16860 0.197400 0.100900 0.1907 ... 22.820 21.32 150.60 1567.0 0.16790 0.50900 0.73450 0.23780 0.3799 0.09185
869218 B 11.43 17.31 73.66 398.0 0.10920 0.09486 0.020310 0.018610 0.1645 ... 12.780 26.76 82.66 503.0 0.14130 0.17920 0.07708 0.06402 0.2584 0.08096
864726 B 8.95 15.76 58.74 245.2 0.09462 0.12430 0.092630 0.023080 0.1305 ... 9.414 17.07 63.34 270.0 0.11790 0.18790 0.15440 0.03846 0.1652 0.07722
901028 B 13.87 16.21 88.52 593.7 0.08743 0.05492 0.015020 0.020880 0.1424 ... 15.110 25.58 96.74 694.4 0.11530 0.10080 0.05285 0.05556 0.2362 0.07113
923465 B 10.82 24.21 68.89 361.6 0.08192 0.06602 0.015480 0.008160 0.1976 ... 13.030 31.45 83.90 505.6 0.12040 0.16330 0.06194 0.03264 0.3059 0.07626
91376702 B 17.85 13.23 114.60 992.1 0.07838 0.06217 0.044450 0.041780 0.1220 ... 19.820 18.42 127.10 1210.0 0.09862 0.09976 0.10480 0.08341 0.1783 0.05871
91544002 B 11.06 17.12 71.25 366.5 0.11940 0.10710 0.040630 0.042680 0.1954 ... 11.690 20.74 76.08 411.1 0.16620 0.20310 0.12560 0.09514 0.2780 0.11680

10 rows × 31 columns

In [14]:
# Random forest variable importance is a common way
# to pick out which variables are most important

from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators = 500)
forest.fit(df_numerical_norm, df.diagnosis)
importances = forest.feature_importances_
# importance_stds = np.std([tree.feature_importances_ for tree in forest.estimators_], axis = 0)
importance_indices = np.argsort( importances )[::-1]
for i in range(df_numerical_norm.shape[1]):
    print( list(df_numerical_norm)[i] + " " + str(importances[i]))
print("-------------")
print("In order of importance")
for i in range(df_numerical_norm.shape[1]):
    j = importance_indices[i]
    print( list(df_numerical_norm)[j] + " " + str(importances[j]))

plt.plot(range( len(importance_indices)), importances[ importance_indices ], 'ro')
plt.show()
radius_mean 0.053421045747899666
radius_sd_error 0.018758082281329424
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
texture_sd_error 0.010970244884948332
texture_worst 0.018674215638445447
perimeter_mean 0.061087710397355624
perimeter_sd_error 0.09380445329204413
perimeter_worst 0.007643975139285527
area_mean 0.006275077476443084
area_sd_error 0.02357754182383073
area_worst 0.005203512929981728
smoothness_mean 0.020029228669550432
smoothness_sd_error 0.037318116760044644
smoothness_worst 0.006180432373471859
compactness_mean 0.0068107591790187855
compactness_sd_error 0.008274600673487047
compactness_worst 0.009290031042625912
concavity_mean 0.006046108181319825
concavity_sd_error 0.00630317987710586
concavity_worst 0.0947999219538862
concave_points_mean 0.025900099207817617
concave_points_sd_error 0.08206509580588969
concave_points_worst 0.08425461459043444
symmetry_mean 0.01957570597038478
symmetry_sd_error 0.02465788075135929
symmetry_worst 0.041246989728128326
fractal_dimension_mean 0.0868849053263718
fractal_dimension_sd_error 0.015133958397684163
fractal_dimension_worst 0.010008009982875762
-------------
In order of importance
concavity_worst 0.0947999219538862
perimeter_sd_error 0.09380445329204413
fractal_dimension_mean 0.0868849053263718
concave_points_worst 0.08425461459043444
concave_points_sd_error 0.08206509580588969
perimeter_mean 0.061087710397355624
radius_worst 0.059980675352813054
texture_mean 0.05582382656416694
radius_mean 0.053421045747899666
symmetry_worst 0.041246989728128326
smoothness_sd_error 0.037318116760044644
concave_points_mean 0.025900099207817617
symmetry_sd_error 0.02465788075135929
area_sd_error 0.02357754182383073
smoothness_mean 0.020029228669550432
symmetry_mean 0.01957570597038478
radius_sd_error 0.018758082281329424
texture_worst 0.018674215638445447
fractal_dimension_sd_error 0.015133958397684163
texture_sd_error 0.010970244884948332
fractal_dimension_worst 0.010008009982875762
compactness_worst 0.009290031042625912
compactness_sd_error 0.008274600673487047
perimeter_worst 0.007643975139285527
compactness_mean 0.0068107591790187855
concavity_sd_error 0.00630317987710586
area_mean 0.006275077476443084
smoothness_worst 0.006180432373471859
concavity_mean 0.006046108181319825
area_worst 0.005203512929981728
In [15]:
# The plot of variable importance using random forests is very useful
# Offhand, it is not necessarily best to just grab the top 3 or 5 
# most important variables. We see distinct groups of variables with 
# comparable importance in this plot, and it may be that they have comparable
# importance because they are strongly correlated, i.e. possibly variables ranked 
# 6,7,8 above are so close in importance because they are tightly correlated
# and each one gives no more information than the others. But we have cut down the 
# playing field of interesting variables significantly.
In [37]:
# Identify 2-3 variables that are predictive of a malignant tumor.
# Display the relationship visually and write 1-2 sentences explaining the relationship.

# The two strongest ones are fractal_dimension_mean and concavity_worst and malignant tumors 
# have larger values of both of those. I don't know precisely how those geometric quantities were
# measured. Offhand, one sounds like it means malignant tumors have a more pitted and crinkled surface.
# I have already displayed the relationship visually with the histograms above.

plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'B'], df_norm.concavity_worst[df.diagnosis == 'B'],'o', alpha = 0.2, label='Benign')
plt.plot(df_norm.fractal_dimension_mean[df.diagnosis == 'M'], df_norm.concavity_worst[df.diagnosis == 'M'],'o', alpha = 0.2, label='Malignant')
plt.legend(loc = 'upper left')
plt.axis('scaled')
plt.show()
# Plotting these two variables for both groups together it appears that they are not too strongly correlated 
# and that each of these two variable independently helps reduce the overlap between malignant and benign tumors
# That is, the x,y pairs are more separated than either the x coordinates alone or the y coordinates alone would be
In [18]:
from sklearn.cross_validation import cross_val_score
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [19]:
X = df_numerical_norm
Y = df_norm.diagnosis
In [20]:
forest = ExtraTreesClassifier(n_estimators = 500)
In [21]:
forest_result = cross_val_score(forest, X, Y, cv = 5)
In [22]:
print(forest_result)
[0.94782609 0.96521739 0.98230088 0.96460177 0.96460177]
In [23]:
# These scores are the portion of correctly classified samples. 
# These are good scores, and they are consistent scores.
# One of the downsides of cross validation in python is that it doesn't
# return the scores on the training set as well as on the test set.
# You will normally see better scores on the training set than on the 
# test set. But if you see significantly better scores on the 
# training set than on the test set, that is because you are overfitting
# the data. Effectively, these scores are so high that we 
# know we are not overfitting dramatically anyway.
In [24]:
# I already determine the most important variables in a random forest model
In [25]:
# I like SVMs but they are poor at helping you identify the most important variables
# So for the second case I will just use linear regression
In [26]:
from sklearn.svm import SVC
In [27]:
svm = SVC()
In [28]:
svmResult = cross_val_score(svm, X, Y, cv = 5)
In [29]:
svmResult
Out[29]:
array([0.97391304, 0.96521739, 1.        , 0.96460177, 0.97345133])
In [30]:
# It is not easy from an SVM to determine what the most important variables are.
# SVMs are more of a black box. They are best where there is 
# sparse data and you want a black box predictor rather than insight
# about the meaning of the predictions. That is why they
# are used so frequently in computer vision when the data is ALWAYS sparse.
# We know we didn't overfit because the results are so very high. 
# Overfitting with a linear SVM is very unlikely. They specialize in being robust
# against overfitting.

More about rotations in 4 dimensions.

It is easy to take a quaternion and recognize the rotation it actually represents. This is about doing the same thing for 4 dimensional rota...