Full Document

1. Data Pipeline & Preprocessing

Real-world educational data is often noisy and incomplete. The system implements a robust preprocessing pipeline before feeding data to the classifier:

Feature Engineering: Irrelevant identifiers (like student index columns) are stripped to prevent the model from learning artificial correlations.
Mean Imputation: Missing continuous quiz scores are automatically detected and imputed using the mean of their respective columns, preserving the dataset's overall distribution.
Equal-Width Discretization (Binning): Because standard Naive Bayes handles discrete categorical data best, the continuous numeric scores are transformed into categorical bins (e.g., Low, Medium, High). The system allows dynamic reconfiguration of the bin count (default $n=3$ ) to experiment with data granularity.

2. Core Algorithm & Mathematical Modeling

The classifier is built on Bayes' Theorem, calculating the prior probabilities of the classes and the likelihood of the features given the class. To ensure robustness, two critical mathematical enhancements were implemented:

Laplace Smoothing (Additive Smoothing)

In real-world testing, a model might encounter a feature state that it never saw during training (a zero-frequency problem). This would normally collapse the entire probability calculation to zero. To prevent this, the algorithm applies Laplace Smoothing ( $\alpha = 1$ ), guaranteeing that every theoretically possible event has a non-zero probability.

Logarithmic Probabilities (Underflow Prevention)

A standard Naive Bayes prediction requires multiplying many small fractional probabilities together. As the number of features grows, this causes floating-point underflow, where the computer registers the product as 0.0. To solve this, the algorithm transforms the math to log-space. Instead of multiplying raw probabilities, it computes the sum of their natural logarithms: $\log P(c | X) \propto \log P(c) + \sum_{i=1}^{n} \log P(x_i | c)$ This guarantees absolute numerical stability regardless of the dataset's dimensionality.

3. Evaluation & Hyperparameter Tuning

The model is evaluated using a strict 80/20 Train-Test split. To understand the impact of the data pipeline on the model's predictive power, an automated experiment iterates the discretization bin count ( $k$ ) from 1 to 10.

The system tracks how the model's accuracy scales with different bin sizes.
Results are visualized using matplotlib, demonstrating the balance between underfitting (too few bins, loss of detail) and overfitting (too many bins, data fragmentation).

💻 Technical Architecture

Language: Python 3.x
Data Manipulation: pandas, numpy (for vectorized mathematical operations).
Visualization: matplotlib (for plotting accuracy curves and discretization histograms).
Architecture: Object-Oriented Design, allowing the model to be instantiated, trained (fit), and tested (predict) mirroring standard industry API structures.

Naive Bayes Classifier: Student Performance Prediction