A machine learning project that classifies Turkish emails as spam or ham using the K-Nearest Neighbors (KNN) algorithm. Trained on a dataset of 825 emails achieving 81.85% accuracy with optimized K=3 parameter.
Core features and capabilities of the project
Custom implementation of K-Nearest Neighbors algorithm. Calculates similarity using Euclidean distance metric.
Text preprocessing with Turkish stopwords. Includes Turkish character support and language-specific optimizations.
Word frequency analysis with histograms and bar charts. Comparative visualization of ham and spam emails.
Comprehensive text processing with punctuation removal, lowercase conversion, and stopword filtering.
Interactive analysis with step-by-step explanations. Ideal format for learning and education.
Optimized code structure and efficient computation methods. Fast training and prediction times.
Project workflow and methods used
Turkish Spam V01 dataset (trspam.csv) containing 825 emails is loaded. Each email is labeled as either "spam" or "ham".
Dataset: trspam.csv
Total Emails: 825
Format: Text, Classification
Text is cleaned: punctuation removed, converted to lowercase, and Turkish stopwords (conjunctions, prepositions, etc.) are filtered.
• Punctuation removal
• Lowercase conversion
• Turkish stopwords filtering
Word frequencies in each email are calculated. Word counts are represented as vectors and prepared for the KNN algorithm.
wordCounts = get_count(text)
features = word_frequencies
vector_representation
Data is split 70% training, 30% testing. Multiple K values are tested (3, 5, 7, 9, 11, 15, 19, 24, 30) to find optimal performance. Best result: K=3 with 81.85% accuracy.
train_test_split(test_size=0.30)
Best K = 3
Accuracy: 81.85%
Model performance and data analysis visualizations
Model performance visualization showing true positives, true negatives, false positives, and false negatives for spam and ham classification.
Word count distribution in legitimate (ham) emails showing the frequency of email lengths after preprocessing.
Word count distribution in spam emails revealing typical patterns in spam message lengths.
Most frequent words (100-150 occurrences) in ham and spam emails, helping identify distinguishing features.
Libraries and tools used in the project
git clone https://github.com/yusufarbc/turkish-spam-mail-detection.git
cd turkish-spam-mail-detection
pip install -r requirements.txt
python spam_detection.py
jupyter notebook spam_detection.ipynb
The Turkish Spam V01 dataset is used. This dataset consists of 825 Turkish emails, with each email labeled as either spam or ham (legitimate).
trspam.csv
├── Text: Email content
└── Classification: spam/ham
stopwords-tr.txt
└── Turkish stopwords list