$ pip install -r requirements.txt && streamlit run app.py
Key results
SVM selected as the best model after three-model comparison with 5-fold cross-validation
0.87
Macro-F1
89%
Accuracy
5,000
Reviews classified
3
Sentiment classes
Methodology
Built on 5,000 synthetic product reviews across 5 categories with 3 sentiment classes. Text is preprocessed with lowering, stopword removal, and lemmatization, then vectorized using TF-IDF with unigram+bigram features. Three classifiers are compared using 5-fold cross-validation on macro-F1, with SVM (LinearSVC) achieving the best performance. Error analysis reveals that most misclassifications involve neutral reviews at the boundary with positive or negative classes.
Text preprocessing
Clean, stopwords, lemmatize
TF-IDF features
10K features, (1,2) n-grams
Model training
LR, SVM, Random Forest
Evaluation
Confusion, error analysis
How to run
Three commands to generate data, train models, and launch the dashboard
The dataset contains 5,000 synthetic product reviews built from realistic vocabulary and sentence structure. Reviews span 5 product categories (Electronics, Clothing, Home & Kitchen, Books, Sports & Outdoors) with 3 sentiment classes: positive (45%), neutral (30%), and negative (25%). Each review includes text, rating (1-5), product category, character length, and word count.