Introduction
For this binary classification, we can use Python and Logistic Regression since we have two classes in our breast cancer data, Class 2 and Class 4. Logistic Regression uses probability to map features to classes, and the function it relies on is the sigmoid function. To learn more about Logistic Regression check out its documentation on sklearn
and Wikipedia.
Data
To test our Logistic Regression understanding and predict the class of breast cancer, we can get the breast cancer dataset available on Kaggle (licensed under 'public domain'). The data consists of 10 columns. We will select all of these features for class prediction. Let's divide this data into X and Y;
Y = data['Class']
features = ['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses']
X = data[features]
where X is the features and Y is the class of breast cancer. How this will work is, we will be providing the trained model with X and it will give us Y.
Splitting the data into training and test batches
After assigning X and Y, we can import the train_test_split
function from sklearn
to split our giant data into test and training datasets so can we can test our model once it's been trained. Splitting the data helps us test the accuracy of the trained model on data that it has never seen before.
from sklearn.model_selection import train_test_split
(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size = 0.3, random_state = 1)
Note: I also used test_size = 0.3
as a parameter which means that 30% of the provided data will be chosen for testing and the rest will be assigned to training by the function.
Importing Logistic Regression
We can import the Logistic Regression classifier from sklearn.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Once it's been imported, we call the 'fit' function on our training data (X_train
and Y_train
) for the model to train.
training =
lr.fit
(X_train, Y_train)
The trained model is then used to predict the class (in our case breast cancer), by passing X_test
as a parameter to give us Y_test
.
prediction = training.predict(X_test)
Accuracy of prediction
To test the accuracy of the model, let's import the accuracy_score
function from sklearn
, using the code below and passing the predicted values stored in the prediction
and let's also pass Y_test
to test our prediction against:
from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test, prediction))
0.9609756097560975
The model gave a prediction accuracy of 96%, this can further be improved by feature selection which essentially means reducing the dimensionality and selecting only features that truly contribute to predicting Y.