Given a data set containing features of cells, we need to determine whether a cell is malignant or benign. This comes under classification. It is a type of supervised-learning. It categorises the data into different classes and also predicts the class of a data passed as an input.
We will use sklearn module as it provides a range of supervised and unsupervised learning algorithms. It is designed to interoperate with with the Python numerical and scientific libraries NumPy and SciPy.
First, we import various libraries that we will need along with the data from sklearn.datasets.
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_curve, auc
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
We also imported pandas to deal with dataframes, NumPy to work with arrays which also supports various mathematical functions, maplotlib and seaborn for data visualisation.
Now, we need to load the data and convert it into a dataframe as it is easier to manipulate and analyse the data.
data = load_breast_cancer()
a=np.c_[data.data, data.target]
columns = np.append(data.feature_names, ["target"])
df_cancer=pd.DataFrame(a,columns=columns)
df_cancer.head()
The head() function shows top 5 rows of our dataframe, we used it to see if the data has been correctly converted into a dataframe with right column names and to get a better understanding of our data.
The dataframe looks like this:
There are 10 different cell nuclei parameters:
Radius: Distance from the centre to the perimeter.
Perimeter: The value of core tumour. The total distance between the points give perimeter.
Area: Area of cancer cells.
Smoothness: This gives the local variation in the radius lengths. The smoothness is given by difference of radial length and mean lengths of the lines around it.
Compactness: It is value of estimation of perimeter and area,it is given by (perimeter^2 / area - 1.0).
Concavity: Severity of concave points is given . Smaller chords encapsulate small concavities better. This feature is affected by length
Concave points: The concavity measures magnitude of contour concavities while concave points measures the number of concave points
Symmetry: The longest chord is taken as major axis.The length difference between the line perpendicular to the major axis is taken. This is known as the symmetry.
Fractal dimension: It is a measure of non linear growth. As the ruler used to measure the perimeter increases, the precision decreases and hence the perimeter decreases. This data is plotted using log scale and the downward slope gives us an approximation of fractal dimension
Texture: Standard derivation of the Gray scale area. This is helpful to find out the variation.
Higher value of all the shape features imply irregular contour which in turn implies a malignant cell.
The worst and error values are taken because only few malignant cells maybe present in an given sample.To better correlate malignant cells, these values are taken. The surgery depends on the size of tumour hence worst values are necessary.
The target value is zero for malignant and one for benign.
We divide the data into two classes: Malignant and Benign.
Malignant=df_cancer[df_cancer['target'] ==0]
Benign=df_cancer[df_cancer['target'] ==1]
We divide the feature names into three categories: mean, error and worst.
mean_features= ['mean radius',
'mean texture',
'mean perimeter',
'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points',
'mean symmetry',
'mean fractal dimension']
error_features=['radius error',
'texture error',
'perimeter error',
'area error',
'smoothness error',
'compactness error',
'concavity error',
'concave points error',
'symmetry error',
'fractal dimension error']
worst_features=['worst radius',
'worst texture',
'worst perimeter',
'worst area',
'worst smoothness',
'worst compactness',
'worst concavity',
'worst concave points',
'worst symmetry',
'worst fractal dimension']
We will create a function to plot histograms with 10 subplots.
bins = 20 #Number of bins is set to 20, bins are specified to divide the range of values into intervals
def histogram(features):
plt.figure(figsize=(10,15))
for i, feature in enumerate(features):
plt.subplot(5, 2, i+1) #subplot function: the number of rows are given as 5 and number of columns as 2, the value i+1 gives the subplot number, subplot numbers start with 1
sns.distplot(Malignant[feature], bins=bins, color='red', label='Malignant');
sns.distplot(Benign[feature], bins=bins, color='green', label='Benign');
plt.title(str(' Density Plot of: ')+str(feature))
plt.xlabel('X variable')
plt.ylabel('Density Function')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
We call the function for mean, error and worst features of the malignant and benign cells.
histogram(mean_features)
histogram(error_features)
histogram(worst_features)
We will now write a function to plot a ROC (Receiver Operating Characteristics). It is a measure of performance for classification problems at various threshold points. This curve is plotted True positive rate and False positive rate. Larger the area under the curve, better the model at distinguishing between two classes which in our case are malignant and benign.
def ROC_curve(X,Y,string):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4) # Splitting the data for training and testing in 60/40 ratio
model=LogisticRegression(solver='liblinear') #Using logistic regression model
model.fit(X_train,y_train)
probability=model.predict_proba(X_test) #Predicting probability
fpr, tpr, thresholds = roc_curve(y_test, probability[:,1]) #False positive rate, True Positive Rate and Threshold is returned using this function
roc_auc = auc(fpr, tpr) #The area under the curve is given by this function
plt.figure()
plt.plot(fpr, tpr, lw=1, color='green', label=f'AUC = {roc_auc:.3f}')
plt.plot([0,1],[0,1],linestyle='--',label='Baseline') #Plotting the baseline
plt.title(string)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate ')
plt.legend()
plt.show()
ROC_curve(df_cancer[mean_features],df_cancer['target'],'ROC for mean features ')
ROC_curve(df_cancer[error_features],df_cancer['target'],'ROC for error features')
ROC_curve(df_cancer[worst_features],df_cancer['target'],'ROC for worst features')
An excellent model has area under the curve near to 1 which means it has good measure of separability. In the above ROC curves we can see that mean and worst features show high accuracy. Therefore, we will not take the error features in consideration. Also, in the histograms plotted above we see that there is an overlapping between the features of malignant and benign cells. In order to make our model to distinguish better we need to select features with the least overlapping values. The top 5 features according to this are:
worst area
worst perimeter
worst radius
mean concave points
mean concavity
We will save these feature in a list called imp_features:
imp_features=['worst area','worst perimeter','worst radius','mean concave points','mean concavity']
The mean of all the instances of all features for both Benign and Malignant classes are:
m_feature_space=Malignant.mean(axis=0)
b_feature_space=Benign.mean(axis=0)
Now we, concatenate the two dataframes one with Benign and one with Malignant and calculate the mean between the values corresponding to the same features.
z=pd.concat([m_feature_space,b_feature_space],axis=1)
analysis_point=z.mean(axis=1)
analysis_point.head() #Analysis point
The output is:
mean radius 14.804677 mean texture 19.759834 mean perimeter 96.720392 mean area 720.583306 mean smoothness 0.097688 dtype: float64
Creating X and Y, where X has all the features and Y contains target:
X=df_cancer.drop(['target'],axis=1)
Y=df_cancer['target']
Now, our data has been preprocessed and is ready to be trained.
But before that, since we have an imbalanced data set due to only a few number of malignant cells, we will use the SMOTE (Synthetic Minority Oversampling TEchnique) from imbalanced-learn module. A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. This technique will create new examples from the minority class (i.e. malignant cells). SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
We will split our data into training and test set and then we will build a pipeline from the imblearn library as it can also include oversampling technique (SMOTE in this case).
Using this pipeline we will implement StandardScaler,SMOTE, and DecisionTreeClassifier on our data.
Also, we will use GridSearch for the hyper-parameter search of the features max_depth and min_leaf for DecisionTreeClassifier.
The program code is given below:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
max_depth = list(range(1,24))
min_leaf=list(range(1,20))
params = [{'classifier__max_depth':max_depth,'classifier__min_samples_leaf':min_leaf}] #Defining parameters for the grid search
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4)
pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',DecisionTreeClassifier(random_state=0,min_samples_split=6,max_features=10))]) #Creating a pipeline
grid_search_cv = GridSearchCV(pipe,params,scoring='accuracy',refit=True, verbose=1,cv=5) #Grid Search function which will put different combinations of the parameters
grid_search_cv.fit(X_train,y_train)
Specified a value for max_features, max_features gives us how many features should be taken a time when taking the best split, if we have too many features it will have computationally heavy. Taking the value of 10, as we have 30 features. The min_samples_split is used to control overfitting, the ideal value for it should be between 1 to 40.If the value is too low we see overfitting.
Decison Trees don't generally require scaling but we used it here to compare the decision tree with SVM. There is no drastic change in decision trees with scaling. The different class sizes might result in bias, although the difference is not very huge, it still is better to have a balanced class data set. we applied oversampling using the smote function to solve this problem.
The max_depth for a decision tree should be equal to or less than the square-root of the instances for most optimum case, hence we choose the range of 1 to 24. If the depth is too large we see over-fitting and if too low we see under-fitting. The min_samples_leaf gives the minimum samples to become a leaf node. Too low value will give over-fitting and too large value will make it computationally expensive, hence we take the range to be 1 to 20.
The output is as follows:
GridSearchCV(cv=5, error_score=nan,
estimator=Pipeline(memory=None,
steps=[('sc',
StandardScaler(copy=True,
with_mean=True,
with_std=True)),
('smt',
SMOTE(k_neighbors=5,
kind='deprecated',
m_neighbors='deprecated',
n_jobs=1,
out_step='deprecated',
random_state=None,
ratio=None,
sampling_strategy='auto',
svm_estimator='deprecated')),
('classifier',
DecisionTreeClassi...
presort='deprecated',
random_state=0,
splitter='best'))],
verbose=False),
iid='deprecated', n_jobs=None,
param_grid=[{'classifier__max_depth': [1,2,3,4,5,6,7,8,9,
10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22,23],
'classifier__min_samples_leaf': [1, 2, 3, 4, 5, 6,7,
8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19]}], pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring='accuracy', verbose=1)
Finding the best model from grid search
model=grid_search_cv.best_estimator_
We will check the accuracy of the model:
from sklearn.metrics import accuracy_score
model.fit(X_train,y_train) #Fitting the model
test_pred = model.predict(X_test)
print(accuracy_score(y_test, test_pred)) #accuracy score function, to print the accuracy of the model
y_test.value_counts()
The output is:
0.9429824561403509
1.0 136 0.0 92 Name: target, dtype: int64
We will define a variable params to save the parameters of the model:
params=model.get_params()
Now, we will produce the confusion matrix and classification report.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1])) #Creating confusion matrix
pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer']) #Labelling the matrix
print(classification_report(y_test, test_pred)) #The classification report gives precision,recall and f1 score
Failing to detect a sample which has cancer means we look at the intersection of cancer and predicted no cancer. The number is 4 out of 91. When a person has cancer and it is detected as no cancer, the chance of happening so is 0.043. The weakness of the classifier is the computational overhead. The strength of the classifier shows good accuracy and chances of detecting a sample as no cancer while it is cancer is not very much which is desired in this case.
We will now plot the decision tree and train our data on it. We will plot the scatter plots of imp_features.
from sklearn import tree
plt.figure(figsize=(40,40))
tree.plot_tree(model['classifier']) #function used to plot decision tree
Output:
[Text(1116.0, 1993.2, 'X[7] <= 0.128\ngini = 0.5\nsamples = 440\nvalue = [220, 220]'), Text(558.0, 1630.8000000000002, 'X[23] <= 0.083\ngini = 0.182\nsamples = 227\nvalue = [23, 204]'), Text(279.0, 1268.4, 'X[1] <= 0.488\ngini = 0.057\nsamples = 205\nvalue = [6, 199]'), Text(139.5, 906.0, 'gini = 0.0\nsamples = 163\nvalue = [0, 163]'), Text(418.5, 906.0, 'X[27] <= -0.302\ngini = 0.245\nsamples = 42\nvalue = [6, 36]'), Text(279.0, 543.5999999999999, 'X[17] <= -0.261\ngini = 0.062\nsamples = 31\nvalue = [1, 30]'), Text(139.5, 181.19999999999982, 'gini = 0.0\nsamples = 21\nvalue = [0, 21]'), Text(418.5, 181.19999999999982, 'gini = 0.18\nsamples = 10\nvalue = [1, 9]'), Text(558.0, 543.5999999999999, 'gini = 0.496\nsamples = 11\nvalue = [5, 6]'), Text(837.0, 1268.4, 'X[24] <= -0.355\ngini = 0.351\nsamples = 22\nvalue = [17, 5]'), Text(697.5, 906.0, 'gini = 0.48\nsamples = 10\nvalue = [6, 4]'), Text(976.5, 906.0, 'gini = 0.153\nsamples = 12\nvalue = [11, 1]'), Text(1674.0, 1630.8000000000002, 'X[21] <= -0.321\ngini = 0.139\nsamples = 213\nvalue = [197, 16]'), Text(1395.0, 1268.4, 'X[20] <= 0.266\ngini = 0.493\nsamples = 34\nvalue = [19, 15]'), Text(1255.5, 906.0, 'gini = 0.117\nsamples = 16\nvalue = [1, 15]'), Text(1534.5, 906.0, 'gini = 0.0\nsamples = 18\nvalue = [18, 0]'), Text(1953.0, 1268.4, 'X[0] <= -0.227\ngini = 0.011\nsamples = 179\nvalue = [178, 1]'), Text(1813.5, 906.0, 'gini = 0.18\nsamples = 10\nvalue = [9, 1]'), Text(2092.5, 906.0, 'gini = 0.0\nsamples = 169\nvalue = [169, 0]')]
It is a tree flowchart, each observation splits according to some feature. There are two ways to go from each node if the condition is true it goes one way and if false it goes the other way. The first line(here X7),gives the feature and compares it to some value. The second row gives us the value of gini index at every node. Gini index is computed mathematically. Gini index=0 means the node is perfect and we get definite class. The sample row gives us the number of samples being considered. The value row in each node gives us the number of samples in each class. In all the nodes the features are considered but the feature which gives best gini index is chosen.
clf=DecisionTreeClassifier(random_state=0,min_samples_leaf=2,min_impurity_split=6,max_depth=11) #Replicating the decision tree classifier as our classifier had max_features as 10 which can not be applied here, as the features taken are 2
k=1
plt.figure(figsize=(20,40))
for i in range(0,4):
for j in range(1,5):
inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1) #Taking data from two features
clf.fit(inp,Y)
plt.subplot(4, 4, k)
k=k+1
plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30) #Creating scatter plot
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50)) #Creating a meshgrid of data points
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z,alpha=0.5, cmap=plt.cm.Paired, linestyles=['--','+','--'])
plt.title(str(imp_features[i])+' & '+str(imp_features[j]))
We will write a code to print the important features according to the grid search and then compare it with the important features selected by us at the beginning of the program.
feat_importances = pd.Series(model['classifier'].feature_importances_, index=X.columns) #function to save the most important features
feat_importances = feat_importances.nlargest(5) #as we need only 5 features nlargest() is used
feat_importances.plot(kind='barh',figsize=(12,8),title='Most Important Features') #plotting bar graph
imp_features=list(feat_importances.index)
print(feat_importances)
Output:
mean concave points 0.756231 worst area 0.138760 worst texture 0.072523 worst radius 0.019084 mean texture 0.010388 dtype: float64
As we can see that the important features predicted by us and the grid search are the same. Hence, it will be safe to say that the program is running as expected.
Now, let's train our data using SVM (Support Vector Machine). It is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.
Firstly, we will use GridSearch to select the best values of C and gamma. Also, we will use a pipeline to implement StandardScaler, SMOTE, and SVM Classifier on our data.
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
c=[0.01,0.1,1,10]
gamma=[0.01,0.1,1,10]
params = [{'classifier__C':c,'classifier__gamma':gamma}] #Setting the parameters
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4)
pipe=Pipeline([('sc',StandardScaler()),('smt',SMOTE()),('classifier',SVC(kernel='rbf'))]) #Creating the pipeline
grid_search_cv = GridSearchCV(pipe,params,refit=True, verbose=1,cv=5)
grid_search_cv.fit(X_train,y_train)
Ouput:
GridSearchCV(cv=5, error_score=nan,
estimator=Pipeline(memory=None,
steps=[('sc',
StandardScaler(copy=True,
with_mean=True,
with_std=True)),
('smt',
SMOTE(k_neighbors=5, kind='deprecated',
m_neighbors='deprecated',
n_jobs=1, out_step='deprecated',
random_state=None, ratio=None,
sampling_strategy='auto',
svm_estimator='deprecated')),
('classifier',
SVC(C=1.0, break_ti...
decision_function_shape='ovr',
degree=3, gamma='scale',
kernel='rbf', max_iter=-1,
probability=False,
random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='deprecated', n_jobs=None,
param_grid=[{'classifier__C': [0.01, 0.1, 1, 10],
'classifier__gamma': [0.01, 0.1, 1, 10]}], pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring=None, verbose=1)
In the output, both the C value and gamma are in the range of 10. C is a regularization parameter so we choose the range as 0.01, 0.1,1 and 10. Similarly for gamma we choose 0.01, 0.1, 1 and 10. Value less than 0.01 would have been too low and value more than 10 would have been too high. Hence, we choose this range. Gridsearch CV gives the best combination of these two features.
We will check the accuracy of the model:
svc=grid_search_cv.best_estimator_ #Saving the best estimator
svc.fit(X_train,y_train)
test_pred = svc.predict(X_test)
print(accuracy_score(y_test, test_pred))
Output:
0.9868421052631579
We will create a confusion matrix and classification report for this model as well.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
matrix=np.array(confusion_matrix(y_test,test_pred,labels=[0,1]))
pd.DataFrame(matrix,index=['Cancer','No Cancer'],columns=['Predicted_Cancer','Predicted_No_Cancer'])
print(classification_report(y_test, test_pred))
People that have cancer but are predicted with no cancer are 3 out of 91 in this model, which is better than the decision tree classifier. The chances of failing to detect cancer is 0.03. The advantage of support vector classifier is it is relatively more efficient. The disadvantage is we need to scale the data before using it as support vector machine can show bias towards a feature if the data is not scaled.
Visualising the result:
k=1
plt.figure(figsize=(20,40))
for i in range(0,4):
for j in range(1,5):
inp=pd.concat([X[imp_features[i]],X[imp_features[j]]],axis=1)
s=svc['classifier'].fit(inp,Y)
decision_function = svc['classifier'].decision_function(inp)
plt.subplot(4, 4, k)
k=k+1
plt.scatter(X[imp_features[i]], X[imp_features[j]], c=Y, s=30, cmap=plt.cm.Paired)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),np.linspace(ylim[0], ylim[1], 50))
xy = np.vstack([xx.ravel(), yy.ravel()]).T
Z = svc['classifier'].decision_function(xy).reshape(xx.shape)
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])
ax.scatter(s.support_vectors_[:, 0], s.support_vectors_[:, 1], s=10,linewidth=1, facecolors='none', edgecolors='k') #Showing support vectors
plt.title(str(imp_features[i])+' & '+str(imp_features[j]))
The model is not showing over-fitting as it is giving good accuracy in testing data as well. Over-fitting occurs when the accuracy in training data is very high but in test data is low.
Comentários