Python Calculate Accuracy Decision Tree Without Sklearn

Decision trees are powerful machine learning models that can classify data points into categories. Calculating their accuracy is essential for evaluating model performance. This guide explains how to calculate decision tree accuracy in Python without using scikit-learn, including a step-by-step implementation and practical example.

How to Calculate Decision Tree Accuracy in Python

Decision tree accuracy measures how often the model correctly predicts the target class. The calculation involves comparing predicted values with actual values in your dataset. Here's how to implement this without scikit-learn:

Step-by-Step Process

Prepare your dataset with features and target labels
Split the data into training and testing sets
Build a decision tree model from scratch
Make predictions on the test set
Compare predictions with actual values
Calculate the accuracy percentage

Accuracy is calculated as the ratio of correct predictions to total predictions. It's a simple but effective metric for evaluating classification models.

The Accuracy Formula

The accuracy formula is straightforward:

Accuracy = (Number of Correct Predictions / Total Number of Predictions) × 100

Where:

Number of Correct Predictions = Count of instances where predicted class matches actual class
Total Number of Predictions = Total instances in the test set

This formula gives you a percentage that represents how often your model makes correct predictions.

Worked Example

Let's walk through a practical example using a simple dataset:

Example Dataset

Feature 1	Feature 2	Actual Class
5.1	3.5	0
6.2	2.8	1
4.9	3.1	0
5.8	2.6	1
5.5	2.4	1

Calculation Steps

Assume our model predicts: [0, 1, 0, 1, 1]
Compare with actual classes: [0, 1, 0, 1, 1]
Count correct predictions: 5
Total predictions: 5
Accuracy = (5/5) × 100 = 100%

In this perfect scenario, our model achieves 100% accuracy. In real-world cases, you'll typically see values between 70% and 95%.

Python Implementation Without Sklearn

Here's a complete Python implementation of a decision tree classifier and accuracy calculator:

import numpy as np

class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def _gini(self, y):
        classes, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        return 1 - np.sum(probabilities ** 2)

    def _best_split(self, X, y):
        best_gini = float('inf')
        best_feature, best_threshold = None, None

        for feature_idx in range(X.shape[1]):
            thresholds = np.unique(X[:, feature_idx])
            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask

                if len(y[left_mask]) == 0 or len(y[right_mask]) == 0:
                    continue

                gini_left = self._gini(y[left_mask])
                gini_right = self._gini(y[right_mask])
                weighted_gini = (len(y[left_mask]) * gini_left + len(y[right_mask]) * gini_right) / len(y)

                if weighted_gini < best_gini:
                    best_gini = weighted_gini
                    best_feature = feature_idx
                    best_threshold = threshold

        return best_feature, best_threshold

    def _build_tree(self, X, y, depth=0):
        num_samples, num_features = X.shape
        num_classes = len(np.unique(y))

        if (self.max_depth is not None and depth >= self.max_depth) or num_classes == 1:
            return np.bincount(y).argmax()

        feature_idx, threshold = self._best_split(X, y)

        if feature_idx is None:
            return np.bincount(y).argmax()

        left_mask = X[:, feature_idx] <= threshold
        right_mask = ~left_mask

        left_subtree = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_subtree = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return {'feature_idx': feature_idx,
                'threshold': threshold,
                'left': left_subtree,
                'right': right_subtree}

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _predict_sample(self, x, node):
        if isinstance(node, dict):
            if x[node['feature_idx']] <= node['threshold']:
                return self._predict_sample(x, node['left'])
            else:
                return self._predict_sample(x, node['right'])
        else:
            return node

    def predict(self, X):
        return np.array([self._predict_sample(x, self.tree) for x in X])

def calculate_accuracy(y_true, y_pred):
    correct = np.sum(y_true == y_pred)
    total = len(y_true)
    return (correct / total) * 100

# Example usage
X = np.array([[5.1, 3.5], [6.2, 2.8], [4.9, 3.1], [5.8, 2.6], [5.5, 2.4]])
y = np.array([0, 1, 0, 1, 1])

tree = DecisionTree(max_depth=3)
tree.fit(X, y)
predictions = tree.predict(X)
accuracy = calculate_accuracy(y, predictions)

print(f"Accuracy: {accuracy:.2f}%")

This implementation includes:

A DecisionTree class with Gini impurity calculation
Tree building with recursive splitting
Prediction functionality
Separate accuracy calculation function

This implementation uses Gini impurity for splitting criteria, which is common in decision tree algorithms. You can modify it to use other metrics like entropy if needed.

FAQ

Why is accuracy not always the best metric?

While accuracy is simple, it can be misleading with imbalanced datasets. For example, if 95% of your data is class A, a model that always predicts A would have 95% accuracy but be useless. Consider metrics like precision, recall, or F1-score for imbalanced data.

How does max_depth affect accuracy?

The max_depth parameter controls how deep the tree can grow. Shallow trees (small max_depth) may underfit, while deep trees may overfit. Finding the right depth through techniques like cross-validation is crucial for optimal accuracy.

Can I use this without numpy?

Yes, you can rewrite the implementation using Python's built-in data structures, though it would be less efficient. The numpy version is provided for clarity and performance.