Machine Learning Project

Turkish Email Spam Detection

A machine learning project that classifies Turkish emails as spam or ham using the K-Nearest Neighbors (KNN) algorithm. Trained on a dataset of 825 emails achieving 81.85% accuracy with optimized K=3 parameter.

Features

Core features and capabilities of the project

KNN Algorithm

Custom implementation of K-Nearest Neighbors algorithm. Calculates similarity using Euclidean distance metric.

Turkish Language Support

Text preprocessing with Turkish stopwords. Includes Turkish character support and language-specific optimizations.

Visualization

Word frequency analysis with histograms and bar charts. Comparative visualization of ham and spam emails.

Data Preprocessing

Comprehensive text processing with punctuation removal, lowercase conversion, and stopword filtering.

Jupyter Notebook

Interactive analysis with step-by-step explanations. Ideal format for learning and education.

High Performance

Optimized code structure and efficient computation methods. Fast training and prediction times.

825
Total Emails
K=3
Optimized K Value
70/30
Train/Test Split
81.85%
Accuracy Rate

Methodology

Project workflow and methods used

Data Loading

Turkish Spam V01 dataset (trspam.csv) containing 825 emails is loaded. Each email is labeled as either "spam" or "ham".

Dataset: trspam.csv
Total Emails: 825
Format: Text, Classification

Preprocessing

Text is cleaned: punctuation removed, converted to lowercase, and Turkish stopwords (conjunctions, prepositions, etc.) are filtered.

• Punctuation removal
• Lowercase conversion
• Turkish stopwords filtering

Feature Extraction

Word frequencies in each email are calculated. Word counts are represented as vectors and prepared for the KNN algorithm.

wordCounts = get_count(text)
features = word_frequencies
vector_representation

Model Training

Data is split 70% training, 30% testing. Multiple K values are tested (3, 5, 7, 9, 11, 15, 19, 24, 30) to find optimal performance. Best result: K=3 with 81.85% accuracy.

train_test_split(test_size=0.30)
Best K = 3
Accuracy: 81.85%

Visualizations

Model performance and data analysis visualizations

Confusion Matrix
Confusion Matrix

Model performance visualization showing true positives, true negatives, false positives, and false negatives for spam and ham classification.

Ham Email Distribution
Ham Word Count

Word count distribution in legitimate (ham) emails showing the frequency of email lengths after preprocessing.

Spam Email Distribution
Spam Word Count

Word count distribution in spam emails revealing typical patterns in spam message lengths.

Word Frequency Analysis
Ham Words
Spam Words

Most frequent words (100-150 occurrences) in ham and spam emails, helping identify distinguishing features.

Technologies

Libraries and tools used in the project

Python 3 NumPy Pandas Matplotlib scikit-learn Jupyter Notebook CSV

Installation and Usage

1. Clone the Repository
git clone https://github.com/yusufarbc/turkish-spam-mail-detection.git
cd turkish-spam-mail-detection
2. Install Required Packages
pip install -r requirements.txt
3. Run the Python Script
python spam_detection.py
Or Use Jupyter Notebook
jupyter notebook spam_detection.ipynb

Dataset

The Turkish Spam V01 dataset is used. This dataset consists of 825 Turkish emails, with each email labeled as either spam or ham (legitimate).

  • 825 total emails
  • 2 classes: spam and ham
  • Turkish language content
  • CSV format structured data
File Structure
trspam.csv
├── Text: Email content
└── Classification: spam/ham

stopwords-tr.txt
└── Turkish stopwords list