20

Product review sentiment analysis

Classifying 5,000 product reviews into positive, neutral, and negative sentiment using TF-IDF features and linear classifiers.

TF-IDF SVM NLP Text classification Sentiment analysis Product reviews
0.87
Macro-F1 score

Interactive dashboard

Five-page Streamlit application for exploring sentiment patterns and model outputs

Live prediction
  • Type any review for real-time classification
  • Confidence scores with bar chart
  • Preprocessed text preview
Word clouds
  • Word cloud per sentiment class
  • Most frequent terms after preprocessing
  • Visual comparison across classes
Model comparison
  • Three-model metric comparison table
  • Confusion matrices side by side
  • Per-class F1 breakdown
Predictive terms
  • Top TF-IDF features per sentiment class
  • Interactive coefficient explorer
  • Words that drive classification decisions
Data explorer
  • Sentiment distribution and rating breakdown
  • Word count histograms by class
  • Category-level analysis and sample review browser
$ pip install -r requirements.txt && streamlit run app.py

Key results

SVM selected as the best model after three-model comparison with 5-fold cross-validation

0.87
Macro-F1
89%
Accuracy
5,000
Reviews classified
3
Sentiment classes

Methodology

Built on 5,000 synthetic product reviews across 5 categories with 3 sentiment classes. Text is preprocessed with lowering, stopword removal, and lemmatization, then vectorized using TF-IDF with unigram+bigram features. Three classifiers are compared using 5-fold cross-validation on macro-F1, with SVM (LinearSVC) achieving the best performance. Error analysis reveals that most misclassifications involve neutral reviews at the boundary with positive or negative classes.

Text preprocessing
Clean, stopwords, lemmatize
TF-IDF features
10K features, (1,2) n-grams
Model training
LR, SVM, Random Forest
Evaluation
Confusion, error analysis

How to run

Three commands to generate data, train models, and launch the dashboard

$ pip install -r requirements.txt $ python data/generate_data.py $ python -c "from src.data_loader import load_and_prepare; from src.model import train_and_evaluate; X_tr, X_te, y_tr, y_te, _ = load_and_prepare(); train_and_evaluate(X_tr, X_te, y_tr, y_te)" $ streamlit run app.py

Data source

The dataset contains 5,000 synthetic product reviews built from realistic vocabulary and sentence structure. Reviews span 5 product categories (Electronics, Clothing, Home & Kitchen, Books, Sports & Outdoors) with 3 sentiment classes: positive (45%), neutral (30%), and negative (25%). Each review includes text, rating (1-5), product category, character length, and word count.

Links