Naive Bayes Classifier: Student Performance Prediction
This project implements a custom **Naive Bayes Classifier built entirely from scratch** in Python. The objective is to predict student outcomes (Pass/Fail) based on a continuous dataset of sequential quiz scores. Rather than relying on high-level machine learning libraries like `scikit-learn`, this project mathematically constructs the probabilistic model, handling everything from data imputation to numerical stability optimizations.

Full Document
1. Data Pipeline & Preprocessing
Real-world educational data is often noisy and incomplete. The system implements a robust preprocessing pipeline before feeding data to the classifier:
- Feature Engineering: Irrelevant identifiers (like student index columns) are stripped to prevent the model from learning artificial correlations.
- Mean Imputation: Missing continuous quiz scores are automatically detected and imputed using the mean of their respective columns, preserving the dataset's overall distribution.
- Equal-Width Discretization (Binning): Because standard Naive Bayes handles discrete categorical data best, the continuous numeric scores are transformed into categorical bins (e.g., Low, Medium, High). The system allows dynamic reconfiguration of the bin count (default ) to experiment with data granularity.
2. Core Algorithm & Mathematical Modeling
The classifier is built on Bayes' Theorem, calculating the prior probabilities of the classes and the likelihood of the features given the class. To ensure robustness, two critical mathematical enhancements were implemented:
Laplace Smoothing (Additive Smoothing)
In real-world testing, a model might encounter a feature state that it never saw during training (a zero-frequency problem). This would normally collapse the entire probability calculation to zero. To prevent this, the algorithm applies Laplace Smoothing (), guaranteeing that every theoretically possible event has a non-zero probability.
Logarithmic Probabilities (Underflow Prevention)
A standard Naive Bayes prediction requires multiplying many small fractional probabilities together. As the number of features grows, this causes floating-point underflow, where the computer registers the product as 0.0.
To solve this, the algorithm transforms the math to log-space. Instead of multiplying raw probabilities, it computes the sum of their natural logarithms:
This guarantees absolute numerical stability regardless of the dataset's dimensionality.
3. Evaluation & Hyperparameter Tuning
The model is evaluated using a strict 80/20 Train-Test split. To understand the impact of the data pipeline on the model's predictive power, an automated experiment iterates the discretization bin count () from 1 to 10.
- The system tracks how the model's accuracy scales with different bin sizes.
- Results are visualized using
matplotlib, demonstrating the balance between underfitting (too few bins, loss of detail) and overfitting (too many bins, data fragmentation).
💻 Technical Architecture
- Language: Python 3.x
- Data Manipulation:
pandas,numpy(for vectorized mathematical operations). - Visualization:
matplotlib(for plotting accuracy curves and discretization histograms). - Architecture: Object-Oriented Design, allowing the model to be instantiated, trained (
fit), and tested (predict) mirroring standard industry API structures.
Further Reading & Source Code
To view the underlying mathematical implementation, the data processing pipeline, and the generated evaluation charts, please visit the repository: https://github.com/LPH1110/tdtu_ai_final_N14/tree/main/source/task4.