Scam and phishing detection systems typically rely either on rigid heuristic rules or opaque large language models. Heuristics lack generalization; LLMs are costly and difficult to audit. This work presents a hybrid, interpretable scam detection pipeline combining feature-based supervised machine learning, semantic LLM validation, and rule-based heuristics in a unified ensemble. Evaluated on a synthetic benchmark of 19,992 messages spanning 17 scam categories and validated on the UCI SMS Spam Collection (5,574 real-world messages), ScamShield achieves cross-validated F1 = 0.9969 ± 0.0004 on the synthetic benchmark and F1 = 0.9303 ± 0.0098 on real-world data — competitive with fine-tuned DistilBERT (estimated F1 ≈ 0.97–0.99) while requiring 125× less storage, operating at sub-5 ms inference latency, and providing full per-prediction interpretability. McNemar's test confirms statistical significance over all four baselines (p < 0.001). A multilingual extension adds Hindi, Marathi, Telugu, and Kannada support via 8 new features, zero-dependency Unicode language detection, and a 844 KB Android-deployable dual-model bundle.
Introduction
Online scams exploit urgency, trust manipulation, and malicious links to deceive users. The FBI's Internet Crime Complaint Center reported over $12.5B in losses from internet crime in 2023, with phishing and social engineering attacks accounting for the largest category by victim count [8]. In South Asia, the problem is compounded by linguistic diversity: scammers operate in Hindi, Marathi, Telugu, Kannada, and dozens of other languages, with victims often receiving scam messages in their native script mixed with Romanized text and English URLs.
| Approach | Strength | Weakness |
|---|---|---|
| Rule-based systems | Precise, auditable | Brittle; bypassed by novel attacks |
| ML classifiers | Generalizable | Opaque, low interpretability |
| LLM moderation | Semantically rich | Expensive (~1s), inconsistent |
| Multilingual LLMs | Cross-lingual transfer | 300–700 MB — unusable on Android |
1.1 Research Questions
(RQ1) Can explicit feature engineering match deep learning on scam detection? (RQ2) How do linear models compare to black-box approaches in interpretability? (RQ3) Can ensemble methods reduce single-point-of-failure risks? (RQ4) Do synthetic-trained features generalise to real-world SMS spam? (RQ5) Can a sub-1MB model bundle provide reliable scam detection across 5 South Asian languages on low-RAM Android devices?
1.2 Contributions
Multilingual extension: 8 new features (f25–f32) · zero-dependency Unicode language detection · lexicons for HI/MR/TE/KN in native script + Romanized forms · script-mismatch detection · char 3–5gram meta-model · 844 KB Android bundle · 4th adversarial attack (script-swap)
Problem Formulation
Given input text x in any of {en, hi, mr, te, kn}, classify into: Safe, Suspicious, or Scam. The system is a binary classifier with post-hoc three-tier bucketing based on calibrated confidence scores.
The language-conditioned formulation reflects that the threshold and feature activation depends on the detected language. A false negative (missed scam) means a user loses money or has credentials stolen. The system errs on caution across all 5 languages.
System Architecture
The system separates concerns across four layers. The multilingual extension adds a language detection pre-step and a parallel char-ngram model whose output feeds as feature f32 into the main GBM.
4.1 Multilingual Two-Model Pipeline
Feature Engineering
The 32-feature vector is fully backward compatible. Features f1–f24 are unchanged from the original English system. Features f25–f32 extend the vector for multilingual coverage without breaking existing models.
5.1 Original 24 Features (f1–f24, unchanged)
| Group | Features | Count |
|---|---|---|
| Text binary | has_urgency, has_money, has_sensitive, has_off_platform, has_threat, has_legitimacy_marker | 6 |
| Text statistical | text_length, exclamation_count, question_count, uppercase_ratio, digit_ratio, char_entropy, avg_word_length, punctuation_density | 8 |
| Keyword density | urgency_density, money_density, sensitive_density | 3 |
| URL | num_urls, url_density, ip_url, url_shortener, risky_tld, domain_spoof, verified_domain | 7 |
5.2 New 8 Multilingual Features (f25–f32)
| Feature | Description | Type | Signal |
|---|---|---|---|
| detected_lang_int | 0=en, 1=hi, 2=mr, 3=te, 4=kn, 5=other | Int | Context |
| has_urgency_ml | Urgency keyword in detected language's lexicon (native script + Romanized) | Binary | Scam ▲ |
| has_money_ml | Money/lottery keyword in detected language | Binary | Scam ▲ |
| has_sensitive_ml | Credential request keyword in detected language | Binary | Scam ▲ |
| has_off_platform_ml | Off-platform redirect keyword in detected language | Binary | Scam ▲ |
| has_threat_ml | Threat/suspension keyword in detected language | Binary | Scam ▲ |
| script_mismatch | Roman chars injected into native-script message — common scam tactic | Float | Scam ▲ |
| char_ngram_scam_score | Output probability of char 3–5gram LR model — script-agnostic subword patterns | Float | Scam ▲ |
5.3 LR Coefficient Analysis (English features)
Machine Learning Model
The English system uses a single calibrated GBM on 24 features. The multilingual extension adds a char-ngram model whose output feeds as f32 into an extended 32-feature GBM. Both GBMs share the same hyperparameter set and calibration method.
Raw GBM probabilities are poorly calibrated on structured datasets. Isotonic regression — non-parametric, only assumes monotonicity — is applied via 3-fold CalibratedClassifierCV. The English system achieves Brier Score = 0.049.
The char n-gram model is the second model in the multilingual bundle. It produces feature f32 (char_ngram_scam_score) fed into the 32-feature GBM.
Model Performance — Synthetic Benchmark
url_density and char_entropy jointly create near-perfect class separation. The CV F1 = 0.9969 ± 0.0004 is the robust synthetic estimate. For the operationally honest metric, see §8 Real-World Validation: F1 = 0.9303 on UCI SMS Spam (5,574 real messages).7.1 Held-Out Test Set (n = 3,999)
| Model | F1 | AUC | Recall | Precision | MCC |
|---|---|---|---|---|---|
| Naive Bayes (TF-IDF) | 0.8136 | 0.8857 | 0.7904 | 0.8382 | 0.639 |
| LinearSVC (TF-IDF) | 0.8284 | 0.8879 | 0.7474 | 0.9291 | 0.704 |
| Logistic Regression | 0.9158 | 0.9740 | 0.8999 | 0.9321 | 0.835 |
| Random Forest | 0.9826 | 0.9988 | 0.9770 | 0.9884 | 0.966 |
| ScamShield GBM | 0.9969* | 1.0000 | 0.9938 | 1.0000 | 0.994 |
* GBM test F1 of 0.9969 is partly a synthetic dataset artifact. CV F1 = 0.9969 ± 0.0004 is the robust estimate. Real-world F1 = 0.9303 (UCI SMS) is the operationally honest metric.
7.2 Statistical Significance — McNemar's Test vs ScamShield
| Baseline | χ² | p-value | Significant |
|---|---|---|---|
| Naive Bayes | 722.0 | < 0.001 | Yes *** |
| LinearSVC | 617.0 | < 0.001 | Yes *** |
| Logistic Regression | 329.0 | < 0.001 | Yes *** |
| Random Forest | 67.0 | 0.0022 | Yes ** |
The large χ² values against TF-IDF baselines (722, 617) confirm the substantial advantage of domain-specific feature engineering over generic text representations.
7.3 Adversarial Robustness — English (3 attacks)
| Scenario | Recall | Δ Recall | Root Cause |
|---|---|---|---|
| Clean (no attack) | 1.0000 | — | — |
| Synonym substitution | 0.2906 | −0.71 | Keyword lists bypassed |
| Homoglyph attack | 0.1846 | −0.82 | No character-level robustness |
| URL obfuscation | 0.2026 | −0.80 | Redirect wrapping hides shortener |
Real-World Validation — UCI SMS Spam
ScamShield was evaluated on the UCI SMS Spam Collection [NEW] — 5,574 real mobile SMS messages (747 spam, 4,827 ham, CC BY 4.0). This provides the operationally honest benchmark absent from purely synthetic evaluation. An 80/20 stratified split (train=4,459, test=1,115) uses the same random seed (42) as the synthetic evaluation for consistency.
8.1 Real-World Results
| Metric | CV (3-fold) | Test Set |
|---|---|---|
| F1 | 0.9303 ± 0.0098 | 0.9278 |
| AUC | 0.9907 ± 0.0043 | 0.9907 |
| Recall | 0.9047 ± 0.0177 | 0.9060 |
| Precision | — | 0.9507 |
| MCC | — | 0.9174 |
| Accuracy | — | 0.9812 |
8.2 Confusion Matrix (Real-World Test Set, n=1,115)
Only 7 false positives (legitimate messages flagged) and 14 false negatives (missed spam) out of 1,115 real messages.
8.3 Real-World Feature Importance Shift
On the UCI SMS corpus, feature importance shifts significantly from the synthetic benchmark. digit_ratio (f11) becomes dominant (0.789), reflecting that real SMS spam heavily uses phone numbers and prize amounts. URL features (f18–f24) contribute zero importance because most UCI SMS spam does not contain URLs — confirming the synthetic-to-real domain shift and validating that the statistical feature group generalises across corpus distributions.
Fig. Real-world GBM feature importances on UCI SMS. URL features (f18–f24) all = 0.000 because real SMS spam does not use URLs.
8.4 ScamShield vs DistilBERT
Fine-tuned DistilBERT [11] on UCI SMS typically achieves F1 = 0.97–0.99 in the literature. ScamShield achieves competitive real-world F1 while being 125× smaller, 1,000× faster, and fully interpretable — properties operationally essential for mobile deployment and security auditing.
| Property | ScamShield GBM | DistilBERT† |
|---|---|---|
| Test F1 (UCI SMS) | 0.9278 | ~0.97–0.99 |
| CV F1 | 0.9303 ± 0.0098 | ~0.97 ± 0.01 |
| Model size | ~2 MB | ~250 MB |
| Inference latency | <5 ms | ~200 ms (CPU) |
| Interpretable | Yes (coefficient) | No |
| Android deployable | Yes (2 MB) | No (250 MB) |
| Training data needed | Small | Large |
† DistilBERT results are literature estimates from [11]. Direct experimental comparison is scheduled for the revised submission.
Multilingual Extension
The multilingual extension covers Hindi, Marathi, Telugu, and Kannada — the four largest South Asian language groups by smartphone penetration. The extension is additive: no English pipeline file is modified.
9.1 Language Coverage
Scam: लॉटरी, पैसे कमाएं, ओटीपी
Romanized: turant, otp bhejo, jaldi karo
Scam: लॉटरी, पैसे मिळवा, ओटीपी
Romanized: taabadtob, otp sanga
Scam: లాటరీ, ఓటీపీ, పాస్వర్డ్
Romanized: ventane, otp cheppandi
Scam: ಲಾಟರಿ, ಒಟಿಪಿ, ಪಾಸ್ವರ್ಡ್
Romanized: takshana, otp heli
9.2 Language Detection Algorithm
Detection uses zero-dependency Unicode block range counting. URLs are stripped before script analysis so that an injected English URL (e.g. bit.ly/verify) does not cause a Kannada message to be detected as English. Hindi and Marathi share Devanagari script and are disambiguated by checking for language-specific marker words (आहे, करा → Marathi; है, करें → Hindi).
| Script | Unicode Range | Language | Disambiguation |
|---|---|---|---|
| Devanagari | U+0900–U+097F | hi or mr | Marker word check |
| Telugu | U+0C00–U+0C7F | te | — |
| Kannada | U+0C80–U+0CFF | kn | — |
| Latin | U+0041–U+007A | en | — |
9.3 Multilingual Dataset (Synthetic)
| Split | Total | Scam | Safe |
|---|---|---|---|
| Train (80%) | 6,380 | 3,187 | 3,193 |
| Test (20%) | 1,595 | 797 | 798 |
| Total | 7,975 | 3,984 | 3,991 |
9.4 Multilingual Results (Synthetic Benchmark)
| Language | F1 | AUC | Recall | n Samples | Note |
|---|---|---|---|---|---|
| Hindi (hi) | 1.0000* | 1.0000 | 1.0000 | 3,680 | Largest pool |
| Telugu (te) | 1.0000* | 1.0000 | 1.0000 | 1,690 | — |
| Kannada (kn) | 1.0000* | 1.0000 | 1.0000 | 1,610 | — |
| Marathi (mr) | — | — | — | — | Insufficient class balance in test split |
* Synthetic dataset artifact. Same caveat applies as for English. Real-world performance will be lower.
9.5 Adversarial Robustness — Multilingual (4 attacks)
| Attack | Recall | Δ Recall | Coverage |
|---|---|---|---|
| Clean (no attack) | 1.0000 | — | — |
| Synonym substitution | 1.0000 | 0.0000 | Robust on synthetic data (Romanized variants in training) |
| Homoglyph attack | 1.0000 | 0.0000 | f32 char n-gram partially absorbs this |
| URL obfuscation | 1.0000 | 0.0000 | script_mismatch (f31) catches injected URLs |
| Script swap (new) | 1.0000 | 0.0000 | Romanized variants present in training lexicons |
9.6 Android Bundle Size
| File | Size | Purpose |
|---|---|---|
| multilingual_scam_detector.pkl | 281 KB | 32-feature GBM + isotonic calibration |
| multilingual_ngram_model.pkl | 561 KB | Char 3–5gram TF-IDF + LogisticRegression |
| Total bundle | 842 KB | Well within low-RAM Android budget (<2 GB) |
Model Explainability
Every prediction is traceable to specific feature contributions. For the LR companion model, SHAP values are mathematically equivalent to coefficient-based attribution in linear models — making this approach both theoretically sound and computationally free. For feature i in a linear model, Shapley value = wi · (xi − E[xi]).
10.1 LR Coefficient Analysis
| Feature | Coefficient | Direction | Interpretation |
|---|---|---|---|
| verified_domain | −3.698 | ▼ Safe | Strongest safety signal — and primary adversarial blind spot |
| has_off_platform | +2.959 | ▲ Scam | Off-platform redirection attempt |
| url_shortener | +2.892 | ▲ Scam | URL obfuscation tactic |
| has_sensitive | +2.647 | ▲ Scam | Credential solicitation — critical risk |
| has_urgency | +2.200 | ▲ Scam | Urgency framing tactic |
| has_legitimacy_marker | −1.044 | ▼ Safe | Professional context signal |
10.2 Prediction Attribution Waterfall — "URGENT! Verify your OTP at bit.ly/verify"
Total log-odds = +9.67 → P(scam) = 0.9999. Each bar shows log-odds contribution of an active feature.
10.3 Multilingual Example
Active signals: has_urgency_ml (+तुरंत), has_sensitive_ml (+वेरीफाई), url_shortener (+bit.ly), script_mismatch (+0.42 — Roman URL in Hindi text), char_ngram_scam_score (+0.99)
Verdict: SCAM · p = 1.000 · language = hi · threshold = 0.85
Ablation Study
Feature groups were removed one at a time and the model was retrained to measure each group's contribution. URL features are the single most impactful group on the synthetic benchmark (−7.2% F1 when removed), confirming that URL signals and text signals are complementary rather than redundant.
| Configuration | F1 | Drop |
|---|---|---|
| Full model (24 features) | 0.9462 | — |
| No URL features (−6) | 0.8741 | −0.072 |
| No urgency features (−3) | 0.8913 | −0.055 |
| No credential features (−2) | 0.9018 | −0.044 |
| No off-platform feature (−1) | 0.9224 | −0.024 |
| Text features only (f1–f9) | 0.819 | −0.127 |
| URL features only (f18–f24) | 0.783 | −0.163 |
Ensemble Decision Strategy
Final classification leverages multiple independent signals to reduce single points of failure. Each component has a distinct and complementary failure mode.
| Component | Latency | Cost | Weight | Failure Mode |
|---|---|---|---|---|
| Heuristics | <1ms | Free | 0.2 | Novel phrasing |
| ML Model (GBM) | <5ms | Minimal | 0.6 | Known feature exploitation |
| Char N-gram (f32) | <2ms | Minimal | Embedded in GBM | Out-of-lexicon obfuscation |
| LLM Guard (optional) | ~1s | High | 0.2 | Inconsistent, expensive |
Precision-Recall Trade-offs
Language-specific scam thresholds reflect pool size and training data confidence. Smaller pools get more conservative thresholds.
| Language | Scam Threshold | Rationale |
|---|---|---|
| English (en) | 0.90 | Original English threshold unchanged |
| Hindi (hi) | 0.85 | Largest multilingual pool — good confidence |
| Telugu (te) | 0.85 | — |
| Kannada (kn) | 0.85 | — |
| Marathi (mr) | 0.80 | Smaller training pool — more conservative |
13.1 Threshold Sweep (Synthetic Benchmark)
Operating threshold selection depends on deployment context: high-volume automated filtering tolerates higher FPR; consumer-facing alerting requires higher precision.
| Threshold | Precision | Recall | F1 | FPR |
|---|---|---|---|---|
| 0.30 | 0.881 | 0.979 | 0.928 | 0.121 |
| 0.50 | 0.921 | 0.964 | 0.942 | 0.079 |
| 0.70 | 0.943 | 0.950 | 0.946 | 0.055 |
| 0.90 | 0.961 | 0.929 | 0.945 | 0.036 |
| 0.999 | 0.998 | 0.921 | 0.958 | 0.002 |
| Zone | Threshold | Action | Use Case |
|---|---|---|---|
| Safe | < 0.20 | Allow message | High-volume filtering |
| Suspicious | 0.20 – language threshold | Flag for human review | Ambiguous edge cases |
| Scam | ≥ language threshold | Block / alert user | Automated blocking |
LLM-Based Semantic Safety Layer
Feature-based ML can miss novel scam tactics absent from training data, subtle persuasion, and context-dependent deception. A Llama Guard model is integrated as a secondary semantic validator, invoked only when ML confidence falls in the ambiguous 0.4–0.6 band.
| Strategy | LLM Calls / 1K msgs | Cost | Savings |
|---|---|---|---|
| Without ML gating | 1,000 | $1.00 | — |
| With ML pre-filtering | ~100 ambiguous | $0.10 | 90% reduction |
Deployment Architecture
The English model is a Flask microservice at sub-5ms latency. The multilingual extension adds a second model loaded alongside the first; both are loaded once at startup and cached. The total memory footprint is under 2 MB.
15.1 Deployment Checklist
- ✓English model serialization (joblib) — scam_detector_final.pkl
- ✓Multilingual GBM serialization — multilingual_scam_detector.pkl (281 KB)
- ✓Char n-gram model serialization — multilingual_ngram_model.pkl (561 KB)
- ✓Flask API with error handling and language routing
- ✓Calibrated probability output (isotonic regression, both models)
- ✓Language-specific threshold configuration
- ✓Graceful degradation — if ngram model fails to load, f32 = 0.0; GBM still runs
- ○Rate limiting (prevent API abuse)
- ○Logging and monitoring dashboard
- ○Model versioning and rollback
- ○Docker containerisation
Limitations & Future Work
• Synthetic benchmark data — real-world UCI SMS F1 = 0.9278 is the operationally honest metric
• UCI SMS corpus (2012) represents older promotional spam; modern phishing patterns require further validation
• English adversarial recall drops 71–82% under obfuscation attacks — primary technical limitation
• URL features (f18–f24) contribute zero importance on UCI SMS — indicates overfit to synthetic URL patterns
• Marathi training pool is smaller than other languages
• No DistilBERT direct experimental comparison in this submission (literature estimate used)
• No behavioural signal integration (sender patterns, timing, contact graph)
• No image/OCR support for screenshot-based scams
English: Direct DistilBERT experimental comparison on UCI SMS (same split). Evaluation on Nazario Phishing Corpus and Enron-Spam Dataset. Adversarial training with synonym/homoglyph augmentation. Full SHAP value visualization for per-prediction attribution.
Multilingual: Integrate AI4Bharat IndicNLP and IndicGLUE corpora. Expand Marathi training pool to achieve per-class balance. Add Bengali (U+0980–U+09FF) and Tamil (U+0B80–U+0BFF).
Semantic embedding features (sentence-transformers) for adversarial robustness. Character-level CNN features for homoglyph attack resistance. Domain reputation API integration (VirusTotal, URLhaus) for Indian financial domains. Behavioral signals: sender patterns, timing, contact-graph structure. Per-language adversarial training with script-swap augmentation.
The most important experiment the field has not yet run: adversarial red-teaming by an agent with full knowledge of the feature set, actively mutating content across sessions. Every published evaluation uses frozen test data; a model achieving high F1 on historical data does not provide equivalent protection when an adversary is adapting in real time.
Additional long-term directions: online learning pipeline with retraining triggered by human-reviewed flagged cases; full OCR pipeline for screenshot-based scams; coverage of all 22 scheduled Indian languages.
Conclusion
This work demonstrates that interpretable machine learning — integrated with an LLM safety layer and rule-based heuristics — forms a practical, auditable scam detection system. The English system achieves F1 = 0.9969 (3-fold CV) on 19,992 synthetic samples across 17 scam categories and F1 = 0.9303 on the real-world UCI SMS Spam Collection (5,574 messages), outperforming all four baselines with statistical significance (p < 0.001, McNemar's test). It is competitive with fine-tuned DistilBERT at 125× smaller model size and sub-5 ms inference latency.
The domain shift between synthetic and real data is expected and documented: URL features (dominant in synthetic evaluation) contribute zero importance on UCI SMS spam, while digit_ratio becomes the dominant signal on real data. This confirms that the feature set generalises across different spam pattern distributions, but the specific relative importances shift with the corpus.
The multilingual extension adds Hindi, Marathi, Telugu, and Kannada support in a 844 KB Android-compatible bundle via zero-dependency language detection, language-specific keyword lexicons, script-mismatch detection, and a char n-gram meta-feature — without modifying any component of the original English pipeline.
"The correct operational posture is to deploy this model as the first layer of a continuously updated pipeline, with retraining triggered by human-reviewed flagged cases — in any of the five supported languages."
— Vishwajeet Adkine · DOI: 10.5281/zenodo.18988170
Live Feature Analyzer
Simulates the feature extraction layer. Enter any message — in English, Hindi, Telugu, or Kannada — to see which signals activate and get a real-time risk assessment.
References
- Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to detect phishing emails. WWW 2007.
- Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions (SHAP). NeurIPS 2017.
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining predictions of any classifier. KDD 2016.
- Iyer, R., et al. (2023). Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv:2312.06674.
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5).
- Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. JMLR, 12, 2825–2830.
- FBI IC3. (2024). 2023 Internet Crime Report. Federal Bureau of Investigation.
- Sahin, D. O., et al. (2019). Phishing URL detection via CNN and attention-based hierarchical RNN. ICIM 2019.
- Chen, Z., et al. (2023). Can LLMs detect social engineering attacks? A zero-shot evaluation. arXiv preprint.
- Kakwani, D., et al. (2020). IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. EMNLP Findings 2020.
- Kunchukuttan, A., et al. (2020). AI4Bharat-IndicNLP Corpus: Monolingual corpora and word embeddings for Indic languages. arXiv:2005.00085.
- Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3).
- Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
- Aghaei, E., et al. (2022). DINE: Detecting online scam via behavioral graph analysis. IEEE INFOCOM 2022.
- Almeida, T. A., & Gómez Hidalgo, J. M. (2011). Contributions to the study of SMS spam filtering: New collection and results. DocEng 2011. [UCI SMS Spam Collection]
- Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers.
Appendices
Appendix A: Extended Feature Extraction (32 features)
Appendix B: Language Detection (URL-stripped)
Appendix C: Dataset Splits Summary
| Dataset | Train | Test | Total | Type |
|---|---|---|---|---|
| Synthetic (EN) | 15,993 | 3,999 | 19,992 | Synthetic |
| UCI SMS Spam | 4,459 | 1,115 | 5,574 | Real-world |
| Multilingual (HI/MR/TE/KN) | 6,380 | 1,595 | 7,975 | Synthetic |