Breast Cancer Prediction with Logistic Regression

Introduction

For this binary classification, we can use Python and Logistic Regression since we have two classes in our breast cancer data, Class 2 and Class 4. Logistic Regression uses probability to map features to classes, and the function it relies on is the sigmoid function. To learn more about Logistic Regression check out its documentation on sklearn and Wikipedia.

Data

To test our Logistic Regression understanding and predict the class of breast cancer, we can get the breast cancer dataset available on Kaggle (licensed under 'public domain'). The data consists of 10 columns. We will select all of these features for class prediction. Let's divide this data into X and Y;

Y = data['Class']

features = ['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses']

X = data[features]

where X is the features and Y is the class of breast cancer. How this will work is, we will be providing the trained model with X and it will give us Y.

Splitting the data into training and test batches

After assigning X and Y, we can import the train_test_split function from sklearn to split our giant data into test and training datasets so can we can test our model once it's been trained. Splitting the data helps us test the accuracy of the trained model on data that it has never seen before.

from sklearn.model_selection import train_test_split

(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size = 0.3, random_state = 1)

Note: I also used test_size = 0.3 as a parameter which means that 30% of the provided data will be chosen for testing and the rest will be assigned to training by the function.

Importing Logistic Regression

We can import the Logistic Regression classifier from sklearn.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

Once it's been imported, we call the 'fit' function on our training data (X_train and Y_train) for the model to train.

training = lr.fit(X_train, Y_train)

The trained model is then used to predict the class (in our case breast cancer), by passing X_test as a parameter to give us Y_test.

prediction = training.predict(X_test)

Accuracy of prediction

To test the accuracy of the model, let's import the accuracy_score function from sklearn, using the code below and passing the predicted values stored in the prediction and let's also pass Y_test to test our prediction against:

from sklearn.metrics import accuracy_score

print(accuracy_score(Y_test, prediction))

0.9609756097560975

The model gave a prediction accuracy of 96%, this can further be improved by feature selection which essentially means reducing the dimensionality and selecting only features that truly contribute to predicting Y.