Product review sentiment analysis

Classifying 5,000 product reviews into positive, neutral, and negative sentiment using TF-IDF features and linear classifiers.

TF-IDF SVM NLP Text classification Sentiment analysis Product reviews

0.87

Macro-F1 score

Interactive dashboard

Five-page Streamlit application for exploring sentiment patterns and model outputs

Live prediction

Type any review for real-time classification
Confidence scores with bar chart
Preprocessed text preview

Word clouds

Word cloud per sentiment class
Most frequent terms after preprocessing
Visual comparison across classes

Model comparison

Three-model metric comparison table
Confusion matrices side by side
Per-class F1 breakdown

Predictive terms

Top TF-IDF features per sentiment class
Interactive coefficient explorer
Words that drive classification decisions

Data explorer

Sentiment distribution and rating breakdown
Word count histograms by class
Category-level analysis and sample review browser

Launch live app View source code

$ pip install -r requirements.txt && streamlit run app.py

Key results

SVM selected as the best model after three-model comparison with 5-fold cross-validation

0.87

Macro-F1

89%

Accuracy

5,000

Reviews classified

Sentiment classes

Methodology

Built on 5,000 synthetic product reviews across 5 categories with 3 sentiment classes. Text is preprocessed with lowering, stopword removal, and lemmatization, then vectorized using TF-IDF with unigram+bigram features. Three classifiers are compared using 5-fold cross-validation on macro-F1, with SVM (LinearSVC) achieving the best performance. Error analysis reveals that most misclassifications involve neutral reviews at the boundary with positive or negative classes.

Text preprocessing

Clean, stopwords, lemmatize

TF-IDF features

10K features, (1,2) n-grams

Model training

LR, SVM, Random Forest

Evaluation

Confusion, error analysis

How to run

Three commands to generate data, train models, and launch the dashboard

$ pip install -r requirements.txt
$ python data/generate_data.py
$ python -c "from src.data_loader import load_and_prepare; from src.model import train_and_evaluate; X_tr, X_te, y_tr, y_te, _ = load_and_prepare(); train_and_evaluate(X_tr, X_te, y_tr, y_te)"
$ streamlit run app.py

Data source

The dataset contains 5,000 synthetic product reviews built from realistic vocabulary and sentence structure. Reviews span 5 product categories (Electronics, Clothing, Home & Kitchen, Books, Sports & Outdoors) with 3 sentiment classes: positive (45%), neutral (30%), and negative (25%). Each review includes text, rating (1-5), product category, character length, and word count.

Links

{}

View code on GitHub

Source, notebooks, and model artifacts

View notebook

Exploratory data analysis

Interactive Streamlit dashboard