Python Calculate Accuracy Decision Tree Without Sklearn
Decision trees are powerful machine learning models that can classify data points into categories. Calculating their accuracy is essential for evaluating model performance. This guide explains how to calculate decision tree accuracy in Python without using scikit-learn, including a step-by-step implementation and practical example.
How to Calculate Decision Tree Accuracy in Python
Decision tree accuracy measures how often the model correctly predicts the target class. The calculation involves comparing predicted values with actual values in your dataset. Here's how to implement this without scikit-learn:
Step-by-Step Process
- Prepare your dataset with features and target labels
- Split the data into training and testing sets
- Build a decision tree model from scratch
- Make predictions on the test set
- Compare predictions with actual values
- Calculate the accuracy percentage
Accuracy is calculated as the ratio of correct predictions to total predictions. It's a simple but effective metric for evaluating classification models.
The Accuracy Formula
The accuracy formula is straightforward:
Accuracy = (Number of Correct Predictions / Total Number of Predictions) × 100
Where:
- Number of Correct Predictions = Count of instances where predicted class matches actual class
- Total Number of Predictions = Total instances in the test set
This formula gives you a percentage that represents how often your model makes correct predictions.
Worked Example
Let's walk through a practical example using a simple dataset:
Example Dataset
| Feature 1 | Feature 2 | Actual Class |
|---|---|---|
| 5.1 | 3.5 | 0 |
| 6.2 | 2.8 | 1 |
| 4.9 | 3.1 | 0 |
| 5.8 | 2.6 | 1 |
| 5.5 | 2.4 | 1 |
Calculation Steps
- Assume our model predicts: [0, 1, 0, 1, 1]
- Compare with actual classes: [0, 1, 0, 1, 1]
- Count correct predictions: 5
- Total predictions: 5
- Accuracy = (5/5) × 100 = 100%
In this perfect scenario, our model achieves 100% accuracy. In real-world cases, you'll typically see values between 70% and 95%.
Python Implementation Without Sklearn
Here's a complete Python implementation of a decision tree classifier and accuracy calculator:
import numpy as np
class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None
def _gini(self, y):
classes, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return 1 - np.sum(probabilities ** 2)
def _best_split(self, X, y):
best_gini = float('inf')
best_feature, best_threshold = None, None
for feature_idx in range(X.shape[1]):
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if len(y[left_mask]) == 0 or len(y[right_mask]) == 0:
continue
gini_left = self._gini(y[left_mask])
gini_right = self._gini(y[right_mask])
weighted_gini = (len(y[left_mask]) * gini_left + len(y[right_mask]) * gini_right) / len(y)
if weighted_gini < best_gini:
best_gini = weighted_gini
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold
def _build_tree(self, X, y, depth=0):
num_samples, num_features = X.shape
num_classes = len(np.unique(y))
if (self.max_depth is not None and depth >= self.max_depth) or num_classes == 1:
return np.bincount(y).argmax()
feature_idx, threshold = self._best_split(X, y)
if feature_idx is None:
return np.bincount(y).argmax()
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
left_subtree = self._build_tree(X[left_mask], y[left_mask], depth + 1)
right_subtree = self._build_tree(X[right_mask], y[right_mask], depth + 1)
return {'feature_idx': feature_idx,
'threshold': threshold,
'left': left_subtree,
'right': right_subtree}
def fit(self, X, y):
self.tree = self._build_tree(X, y)
def _predict_sample(self, x, node):
if isinstance(node, dict):
if x[node['feature_idx']] <= node['threshold']:
return self._predict_sample(x, node['left'])
else:
return self._predict_sample(x, node['right'])
else:
return node
def predict(self, X):
return np.array([self._predict_sample(x, self.tree) for x in X])
def calculate_accuracy(y_true, y_pred):
correct = np.sum(y_true == y_pred)
total = len(y_true)
return (correct / total) * 100
# Example usage
X = np.array([[5.1, 3.5], [6.2, 2.8], [4.9, 3.1], [5.8, 2.6], [5.5, 2.4]])
y = np.array([0, 1, 0, 1, 1])
tree = DecisionTree(max_depth=3)
tree.fit(X, y)
predictions = tree.predict(X)
accuracy = calculate_accuracy(y, predictions)
print(f"Accuracy: {accuracy:.2f}%")
This implementation includes:
- A DecisionTree class with Gini impurity calculation
- Tree building with recursive splitting
- Prediction functionality
- Separate accuracy calculation function
This implementation uses Gini impurity for splitting criteria, which is common in decision tree algorithms. You can modify it to use other metrics like entropy if needed.
FAQ
Why is accuracy not always the best metric?
While accuracy is simple, it can be misleading with imbalanced datasets. For example, if 95% of your data is class A, a model that always predicts A would have 95% accuracy but be useless. Consider metrics like precision, recall, or F1-score for imbalanced data.
How does max_depth affect accuracy?
The max_depth parameter controls how deep the tree can grow. Shallow trees (small max_depth) may underfit, while deep trees may overfit. Finding the right depth through techniques like cross-validation is crucial for optimal accuracy.
Can I use this without numpy?
Yes, you can rewrite the implementation using Python's built-in data structures, though it would be less efficient. The numpy version is provided for clarity and performance.