Turkish Spam Detection | Machine Learning Project

Features

Core features and capabilities of the project

KNN Algorithm

Custom implementation of K-Nearest Neighbors algorithm. Calculates similarity using Euclidean distance metric.

Turkish Language Support

Text preprocessing with Turkish stopwords. Includes Turkish character support and language-specific optimizations.

Visualization

Word frequency analysis with histograms and bar charts. Comparative visualization of ham and spam emails.

Data Preprocessing

Comprehensive text processing with punctuation removal, lowercase conversion, and stopword filtering.

Jupyter Notebook

Interactive analysis with step-by-step explanations. Ideal format for learning and education.

High Performance

Optimized code structure and efficient computation methods. Fast training and prediction times.

Methodology

Project workflow and methods used

Data Loading

Turkish Spam V01 dataset (trspam.csv) containing 825 emails is loaded. Each email is labeled as either "spam" or "ham".

Dataset: trspam.csv

                                Total Emails: 825

                                Format: Text, Classification

Preprocessing

Text is cleaned: punctuation removed, converted to lowercase, and Turkish stopwords (conjunctions, prepositions, etc.) are filtered.

• Punctuation removal

                                • Lowercase conversion

                                • Turkish stopwords filtering

Feature Extraction

Word frequencies in each email are calculated. Word counts are represented as vectors and prepared for the KNN algorithm.

wordCounts = get_count(text)

                                features = word_frequencies

                                vector_representation

Model Training

Data is split 70% training, 30% testing. Multiple K values are tested (3, 5, 7, 9, 11, 15, 19, 24, 30) to find optimal performance. Best result: K=3 with 81.85% accuracy.

train_test_split(test_size=0.30)

                                Best K = 3

                                Accuracy: 81.85%

Visualizations

Model performance and data analysis visualizations

Confusion Matrix

Model performance visualization showing true positives, true negatives, false positives, and false negatives for spam and ham classification.

Ham Email Distribution

Word count distribution in legitimate (ham) emails showing the frequency of email lengths after preprocessing.

Spam Email Distribution

Word count distribution in spam emails revealing typical patterns in spam message lengths.

Word Frequency Analysis

Most frequent words (100-150 occurrences) in ham and spam emails, helping identify distinguishing features.

Technologies

Libraries and tools used in the project

Python 3 NumPy Pandas Matplotlib scikit-learn Jupyter Notebook CSV

Installation and Usage

1. Clone the Repository

git clone https://github.com/yusufarbc/turkish-spam-mail-detection.git

                                cd turkish-spam-mail-detection

2. Install Required Packages

pip install -r requirements.txt

3. Run the Python Script

python spam_detection.py

Or Use Jupyter Notebook

jupyter notebook spam_detection.ipynb

Dataset

The Turkish Spam V01 dataset is used. This dataset consists of 825 Turkish emails, with each email labeled as either spam or ham (legitimate).

825 total emails
2 classes: spam and ham
Turkish language content
CSV format structured data

File Structure

trspam.csv

                                ├── Text: Email content

                                └── Classification: spam/ham

                                

                                stopwords-tr.txt

                                └── Turkish stopwords list

Turkish Email Spam Detection