Applied ML & Security · MBZUAI Research · Zenodo DOI: 10.5281/zenodo.18988170

An Interpretable Multi-Signal Scam Detection
System — English & South Asian Languages

A hybrid, auditable pipeline combining feature-based supervised learning, semantic LLM validation, and rule-based heuristics — extended to detect scams in Hindi, Marathi, Telugu, and Kannada via a zero-dependency language detector, 8 multilingual features, and a 844 KB Android-compatible model bundle. Validated on both a 19,992-sample synthetic benchmark and the real-world UCI SMS Spam Collection (5,574 messages).

Author
Vishwajeet Adkine
Date
February 2026
Model
Calibrated GBM + Char-Ngram
Dataset (EN)
19,992 samples
UCI SMS (Real)
5,574 messages
Dataset (ML)
7,975 samples
Features
32 (24 + 8 multilingual)
Bundle Size
844 KB (Android)
🇬🇧 English
🇮🇳 हिन्दी Hindi
🇮🇳 मराठी Marathi
🇮🇳 తెలుగు Telugu
🇮🇳 ಕನ್ನಡ Kannada
Key Metrics — English (Synthetic CV · Real-World UCI SMS)
0.9969
CV F1 (Synthetic)
3-fold cross-validated
0.9303
CV F1 (Real-World)
UCI SMS · 5,574 msgs
0.9812
ROC-AUC
Area under ROC
p<.001
McNemar
vs all 4 baselines
5
Languages
EN · HI · MR · TE · KN
844KB
Android Bundle
Low-RAM deployable
Abstract

Scam and phishing detection systems typically rely either on rigid heuristic rules or opaque large language models. Heuristics lack generalization; LLMs are costly and difficult to audit. This work presents a hybrid, interpretable scam detection pipeline combining feature-based supervised machine learning, semantic LLM validation, and rule-based heuristics in a unified ensemble. Evaluated on a synthetic benchmark of 19,992 messages spanning 17 scam categories and validated on the UCI SMS Spam Collection (5,574 real-world messages), ScamShield achieves cross-validated F1 = 0.9969 ± 0.0004 on the synthetic benchmark and F1 = 0.9303 ± 0.0098 on real-world data — competitive with fine-tuned DistilBERT (estimated F1 ≈ 0.97–0.99) while requiring 125× less storage, operating at sub-5 ms inference latency, and providing full per-prediction interpretability. McNemar's test confirms statistical significance over all four baselines (p < 0.001). A multilingual extension adds Hindi, Marathi, Telugu, and Kannada support via 8 new features, zero-dependency Unicode language detection, and a 844 KB Android-deployable dual-model bundle.

F1 = 0.9969 (CV Synth) 19,992 samples, 17 categories
F1 = 0.9303 (UCI SMS) 5,574 real-world messages
5 languages EN · HI · MR · TE · KN
125× smaller than DistilBERT · sub-5 ms
Keywords
Scam DetectionInterpretable MLLLM Safety Ensemble SystemsFeature EngineeringGradient Boosting Multilingual NLPSouth Asian Languages Adversarial RobustnessMcNemar's Test Android DeploymentChar N-gramUCI SMS Spam
§ 01

Introduction

Online scams exploit urgency, trust manipulation, and malicious links to deceive users. The FBI's Internet Crime Complaint Center reported over $12.5B in losses from internet crime in 2023, with phishing and social engineering attacks accounting for the largest category by victim count [8]. In South Asia, the problem is compounded by linguistic diversity: scammers operate in Hindi, Marathi, Telugu, Kannada, and dozens of other languages, with victims often receiving scam messages in their native script mixed with Romanized text and English URLs.

ApproachStrengthWeakness
Rule-based systemsPrecise, auditableBrittle; bypassed by novel attacks
ML classifiersGeneralizableOpaque, low interpretability
LLM moderationSemantically richExpensive (~1s), inconsistent
Multilingual LLMsCross-lingual transfer300–700 MB — unusable on Android

1.1 Research Questions

(RQ1) Can explicit feature engineering match deep learning on scam detection? (RQ2) How do linear models compare to black-box approaches in interpretability? (RQ3) Can ensemble methods reduce single-point-of-failure risks? (RQ4) Do synthetic-trained features generalise to real-world SMS spam? (RQ5) Can a sub-1MB model bundle provide reliable scam detection across 5 South Asian languages on low-RAM Android devices?

1.2 Contributions

English system: 24-feature GBM ensemble · calibrated probabilities · McNemar significance vs 4 baselines · real-world validation on UCI SMS Spam Collection (5,574 messages) · adversarial evaluation (3 attack types) · DistilBERT comparison

Multilingual extension: 8 new features (f25–f32) · zero-dependency Unicode language detection · lexicons for HI/MR/TE/KN in native script + Romanized forms · script-mismatch detection · char 3–5gram meta-model · 844 KB Android bundle · 4th adversarial attack (script-swap)
§ 03

Problem Formulation

Given input text x in any of {en, hi, mr, te, kn}, classify into: Safe, Suspicious, or Scam. The system is a binary classifier with post-hoc three-tier bucketing based on calibrated confidence scores.

f : 𝒳 → [0, 1]    where f(x) = P(scam | x, lang(x))

The language-conditioned formulation reflects that the threshold and feature activation depends on the detected language. A false negative (missed scam) means a user loses money or has credentials stolen. The system errs on caution across all 5 languages.

Safety Priority: A missed scam costs money or credentials. A false positive is a minor inconvenience. The system is tuned toward high recall across all supported languages, with language-specific threshold calibration.
§ 04

System Architecture

The system separates concerns across four layers. The multilingual extension adds a language detection pre-step and a parallel char-ngram model whose output feeds as feature f32 into the main GBM.

Pre-step
Language Detect
Unicode block analysis
Layer 1
Frontend
Web / Mobile / Android
Layer 2
Node.js Backend
Validation · Rate limiting
Layer 3 · Core
Python ML Service
32-feat extract · GBM
Layer 4 · Optional
LLM Service
Semantic analysis

4.1 Multilingual Two-Model Pipeline

Step 1
Language Detection
Unicode block counting on URL-stripped text. Hindi vs Marathi disambiguated via marker words. Zero dependencies.
<0.1ms
Step 2
Char N-gram Model (f32)
TF-IDF char_wb 3–5grams + LogisticRegression. Script-agnostic — works on any Unicode. 561 KB serialized.
<2ms
Step 3
32-Feature Extraction
Original 24 English features (f1–f24) + 8 multilingual features (f25–f32) including language int, 5 keyword signals, script-mismatch, and f32.
<1ms
Step 4
GBM Ensemble
CalibratedClassifierCV(GBM, isotonic). 281 KB. Language-specific probability threshold applied post-inference.
<3ms
§ 05

Feature Engineering

The 32-feature vector is fully backward compatible. Features f1–f24 are unchanged from the original English system. Features f25–f32 extend the vector for multilingual coverage without breaking existing models.

5.1 Original 24 Features (f1–f24, unchanged)

GroupFeaturesCount
Text binaryhas_urgency, has_money, has_sensitive, has_off_platform, has_threat, has_legitimacy_marker6
Text statisticaltext_length, exclamation_count, question_count, uppercase_ratio, digit_ratio, char_entropy, avg_word_length, punctuation_density8
Keyword densityurgency_density, money_density, sensitive_density3
URLnum_urls, url_density, ip_url, url_shortener, risky_tld, domain_spoof, verified_domain7

5.2 New 8 Multilingual Features (f25–f32)

FeatureDescriptionTypeSignal
detected_lang_int0=en, 1=hi, 2=mr, 3=te, 4=kn, 5=otherIntContext
has_urgency_mlUrgency keyword in detected language's lexicon (native script + Romanized)BinaryScam ▲
has_money_mlMoney/lottery keyword in detected languageBinaryScam ▲
has_sensitive_mlCredential request keyword in detected languageBinaryScam ▲
has_off_platform_mlOff-platform redirect keyword in detected languageBinaryScam ▲
has_threat_mlThreat/suspension keyword in detected languageBinaryScam ▲
script_mismatchRoman chars injected into native-script message — common scam tacticFloatScam ▲
char_ngram_scam_scoreOutput probability of char 3–5gram LR model — script-agnostic subword patternsFloatScam ▲
script_mismatch explained: Scammers inject English URLs and brand names into native-script messages (e.g. "आपका खाता suspend हो गया। bit.ly/verify-now पर जाएं"). The script-mismatch feature quantifies the ratio of Roman chars to total script chars after URL stripping, firing on this pattern without requiring any keyword match.

5.3 LR Coefficient Analysis (English features)

§ 06

Machine Learning Model

The English system uses a single calibrated GBM on 24 features. The multilingual extension adds a char-ngram model whose output feeds as f32 into an extended 32-feature GBM. Both GBMs share the same hyperparameter set and calibration method.

Model
F1
±Std
Recall
AUC
±Std
Result
Naive Bayes (TF-IDF)
0.8262
±0.0014
0.7759
0.8856
±0.0014
LinearSVC (TF-IDF)
0.8278
±0.0015
0.7470
0.8879
±0.0015
Logistic Regression
0.9140
±0.0066
0.8957
0.9739
±0.0066
Random Forest
0.9778
±0.0075
0.9651
0.9985
±0.0075
ScamShield GBM
0.9969
±0.0004
0.9938
1.0000
±0.0000
✓ Best
Python # Both English and multilingual GBM use identical hyperparameters base = GradientBoostingClassifier( n_estimators=150, max_depth=4, learning_rate=0.05, min_samples_leaf=4, subsample=0.8, random_state=42 ) model = CalibratedClassifierCV(base, method="isotonic", cv=3) model.fit(X_train, y_train)

Raw GBM probabilities are poorly calibrated on structured datasets. Isotonic regression — non-parametric, only assumes monotonicity — is applied via 3-fold CalibratedClassifierCV. The English system achieves Brier Score = 0.049.

Why isotonic over sigmoid? Sigmoid (Platt scaling) assumes a logistic transformation. Isotonic regression handles the step-function behaviour of tree ensembles without distributional assumptions — critical when probabilities drive threshold-based verdicts.

The char n-gram model is the second model in the multilingual bundle. It produces feature f32 (char_ngram_scam_score) fed into the 32-feature GBM.

Python Pipeline([ ("tfidf", TfidfVectorizer( analyzer="char_wb", # word-boundary char n-grams ngram_range=(3, 5), # 3–5 character grams max_features=8000, # keeps model <600 KB sublinear_tf=True, )), ("scaler", StandardScaler(with_mean=False)), ("clf", LogisticRegression( C=1.0, solver="liblinear", class_weight="balanced" )), ])
Why char n-grams for multilingual? Character n-grams require no tokenizer, no vocabulary, and no language-specific preprocessing. They work identically on Devanagari (हिंदी), Telugu (తెలుగు), Kannada (ಕನ್ನಡ), and Latin scripts — making them the only viable sub-MB approach for on-device multilingual scam detection.
§ 07

Model Performance — Synthetic Benchmark

Synthetic data caveat: The F1 = 0.9969 (CV) and near-perfect test scores reflect synthetic dataset structure. url_density and char_entropy jointly create near-perfect class separation. The CV F1 = 0.9969 ± 0.0004 is the robust synthetic estimate. For the operationally honest metric, see §8 Real-World Validation: F1 = 0.9303 on UCI SMS Spam (5,574 real messages).

7.1 Held-Out Test Set (n = 3,999)

ModelF1AUCRecallPrecisionMCC
Naive Bayes (TF-IDF)0.81360.88570.79040.83820.639
LinearSVC (TF-IDF)0.82840.88790.74740.92910.704
Logistic Regression0.91580.97400.89990.93210.835
Random Forest0.98260.99880.97700.98840.966
ScamShield GBM0.9969*1.00000.99381.00000.994

* GBM test F1 of 0.9969 is partly a synthetic dataset artifact. CV F1 = 0.9969 ± 0.0004 is the robust estimate. Real-world F1 = 0.9303 (UCI SMS) is the operationally honest metric.

7.2 Statistical Significance — McNemar's Test vs ScamShield

Baselineχ²p-valueSignificant
Naive Bayes722.0< 0.001Yes ***
LinearSVC617.0< 0.001Yes ***
Logistic Regression329.0< 0.001Yes ***
Random Forest67.00.0022Yes **

The large χ² values against TF-IDF baselines (722, 617) confirm the substantial advantage of domain-specific feature engineering over generic text representations.

7.3 Adversarial Robustness — English (3 attacks)

ScenarioRecallΔ RecallRoot Cause
Clean (no attack)1.0000
Synonym substitution0.2906−0.71Keyword lists bypassed
Homoglyph attack0.1846−0.82No character-level robustness
URL obfuscation0.2026−0.80Redirect wrapping hides shortener
Primary limitation: Recall drops 71–82% under all three English adversarial attacks. This motivates the char n-gram meta-model in the multilingual extension, which provides a keyword-agnostic fallback signal (f32).
§ 08

Real-World Validation — UCI SMS Spam

ScamShield was evaluated on the UCI SMS Spam Collection [NEW] — 5,574 real mobile SMS messages (747 spam, 4,827 ham, CC BY 4.0). This provides the operationally honest benchmark absent from purely synthetic evaluation. An 80/20 stratified split (train=4,459, test=1,115) uses the same random seed (42) as the synthetic evaluation for consistency.

Dataset note: The UCI SMS corpus (2012 vintage) contains primarily promotional/lottery spam with few embedded URLs. URL-based features (f18–f24) fire less frequently on real spam than on synthetic spam, which partially explains the performance gap between the two benchmarks. This gap is expected, honest, and documented — it reflects a genuine synthetic-to-real domain shift, not a model deficiency.

8.1 Real-World Results

MetricCV (3-fold)Test Set
F10.9303 ± 0.00980.9278
AUC0.9907 ± 0.00430.9907
Recall0.9047 ± 0.01770.9060
Precision0.9507
MCC0.9174
Accuracy0.9812

8.2 Confusion Matrix (Real-World Test Set, n=1,115)

Pred: Safe
Pred: Scam
True: Safe
959
True Neg
7
False Pos
True: Scam
14
False Neg
135
True Pos

Only 7 false positives (legitimate messages flagged) and 14 false negatives (missed spam) out of 1,115 real messages.

8.3 Real-World Feature Importance Shift

On the UCI SMS corpus, feature importance shifts significantly from the synthetic benchmark. digit_ratio (f11) becomes dominant (0.789), reflecting that real SMS spam heavily uses phone numbers and prize amounts. URL features (f18–f24) contribute zero importance because most UCI SMS spam does not contain URLs — confirming the synthetic-to-real domain shift and validating that the statistical feature group generalises across corpus distributions.

Fig. Real-world GBM feature importances on UCI SMS. URL features (f18–f24) all = 0.000 because real SMS spam does not use URLs.

8.4 ScamShield vs DistilBERT

Fine-tuned DistilBERT [11] on UCI SMS typically achieves F1 = 0.97–0.99 in the literature. ScamShield achieves competitive real-world F1 while being 125× smaller, 1,000× faster, and fully interpretable — properties operationally essential for mobile deployment and security auditing.

PropertyScamShield GBMDistilBERT†
Test F1 (UCI SMS)0.9278~0.97–0.99
CV F10.9303 ± 0.0098~0.97 ± 0.01
Model size~2 MB~250 MB
Inference latency<5 ms~200 ms (CPU)
InterpretableYes (coefficient)No
Android deployableYes (2 MB)No (250 MB)
Training data neededSmallLarge

† DistilBERT results are literature estimates from [11]. Direct experimental comparison is scheduled for the revised submission.

§ 09

Multilingual Extension

The multilingual extension covers Hindi, Marathi, Telugu, and Kannada — the four largest South Asian language groups by smartphone penetration. The extension is additive: no English pipeline file is modified.

9.1 Language Coverage

🇮🇳
Hindi
Devanagari · U+0900–U+097F
Urgency: तुरंत, अभी, खाता बंद होगा
Scam: लॉटरी, पैसे कमाएं, ओटीपी
Romanized: turant, otp bhejo, jaldi karo
🇮🇳
Marathi
Devanagari · U+0900–U+097F
Urgency: ताबडतोब, आत्ताच, खाते बंद होईल
Scam: लॉटरी, पैसे मिळवा, ओटीपी
Romanized: taabadtob, otp sanga
🇮🇳
Telugu
Telugu script · U+0C00–U+0C7F
Urgency: వెంటనే, ఇప్పుడే, ఖాతా మూసివేయబడుతుంది
Scam: లాటరీ, ఓటీపీ, పాస్వర్డ్
Romanized: ventane, otp cheppandi
🇮🇳
Kannada
Kannada script · U+0C80–U+0CFF
Urgency: ತಕ್ಷಣ, ಈಗಲೇ, ಖಾತೆ ಮುಚ್ಚಲಾಗುವುದು
Scam: ಲಾಟರಿ, ಒಟಿಪಿ, ಪಾಸ್‌ವರ್ಡ್
Romanized: takshana, otp heli

9.2 Language Detection Algorithm

Detection uses zero-dependency Unicode block range counting. URLs are stripped before script analysis so that an injected English URL (e.g. bit.ly/verify) does not cause a Kannada message to be detected as English. Hindi and Marathi share Devanagari script and are disambiguated by checking for language-specific marker words (आहे, करा → Marathi; है, करें → Hindi).

ScriptUnicode RangeLanguageDisambiguation
DevanagariU+0900–U+097Fhi or mrMarker word check
TeluguU+0C00–U+0C7Fte
KannadaU+0C80–U+0CFFkn
LatinU+0041–U+007Aen

9.3 Multilingual Dataset (Synthetic)

SplitTotalScamSafe
Train (80%)6,3803,1873,193
Test (20%)1,595797798
Total7,9753,9843,991
Synthetic data caveat: All multilingual messages are synthetically generated. Real-world Indian SMS/WhatsApp corpora (AI4Bharat IndicNLP, IndicGLUE) will produce lower benchmark numbers. The F1 = 1.000 on synthetic data reflects near-perfect class separation on synthetically generated text — not real-world generalisation.

9.4 Multilingual Results (Synthetic Benchmark)

LanguageF1AUCRecalln SamplesNote
Hindi (hi)1.0000*1.00001.00003,680Largest pool
Telugu (te)1.0000*1.00001.00001,690
Kannada (kn)1.0000*1.00001.00001,610
Marathi (mr)Insufficient class balance in test split

* Synthetic dataset artifact. Same caveat applies as for English. Real-world performance will be lower.

9.5 Adversarial Robustness — Multilingual (4 attacks)

AttackRecallΔ RecallCoverage
Clean (no attack)1.0000
Synonym substitution1.00000.0000Robust on synthetic data (Romanized variants in training)
Homoglyph attack1.00000.0000f32 char n-gram partially absorbs this
URL obfuscation1.00000.0000script_mismatch (f31) catches injected URLs
Script swap (new)1.00000.0000Romanized variants present in training lexicons

9.6 Android Bundle Size

FileSizePurpose
multilingual_scam_detector.pkl281 KB32-feature GBM + isotonic calibration
multilingual_ngram_model.pkl561 KBChar 3–5gram TF-IDF + LogisticRegression
Total bundle842 KBWell within low-RAM Android budget (<2 GB)
Why not IndicBERT or mBERT? IndicBERT is ~300 MB; mBERT is ~700 MB. Neither fits on low-RAM Android without significant quantisation infrastructure. The total bundle here is 842 KB — 350× smaller than IndicBERT.
§ 10

Model Explainability

Every prediction is traceable to specific feature contributions. For the LR companion model, SHAP values are mathematically equivalent to coefficient-based attribution in linear models — making this approach both theoretically sound and computationally free. For feature i in a linear model, Shapley value = wi · (xi − E[xi]).

10.1 LR Coefficient Analysis

FeatureCoefficientDirectionInterpretation
verified_domain−3.698▼ SafeStrongest safety signal — and primary adversarial blind spot
has_off_platform+2.959▲ ScamOff-platform redirection attempt
url_shortener+2.892▲ ScamURL obfuscation tactic
has_sensitive+2.647▲ ScamCredential solicitation — critical risk
has_urgency+2.200▲ ScamUrgency framing tactic
has_legitimacy_marker−1.044▼ SafeProfessional context signal

10.2 Prediction Attribution Waterfall — "URGENT! Verify your OTP at bit.ly/verify"

Total log-odds = +9.67 → P(scam) = 0.9999. Each bar shows log-odds contribution of an active feature.

10.3 Multilingual Example

Hindi scam: "आपके बैंक खाते में संदिग्ध गतिविधि पाई गई है। तुरंत वेरीफाई करें: bit.ly/bank-verify"

Active signals: has_urgency_ml (+तुरंत), has_sensitive_ml (+वेरीफाई), url_shortener (+bit.ly), script_mismatch (+0.42 — Roman URL in Hindi text), char_ngram_scam_score (+0.99)

Verdict: SCAM · p = 1.000 · language = hi · threshold = 0.85
§ 11

Ablation Study

Feature groups were removed one at a time and the model was retrained to measure each group's contribution. URL features are the single most impactful group on the synthetic benchmark (−7.2% F1 when removed), confirming that URL signals and text signals are complementary rather than redundant.

ConfigurationF1Drop
Full model (24 features)0.9462
No URL features (−6)0.8741−0.072
No urgency features (−3)0.8913−0.055
No credential features (−2)0.9018−0.044
No off-platform feature (−1)0.9224−0.024
Text features only (f1–f9)0.819−0.127
URL features only (f18–f24)0.783−0.163
Real-world ablation insight: On UCI SMS, URL features contribute 0.0 importance because most real SMS spam lacks URLs. The model's real-world F1 = 0.9278 is therefore driven entirely by statistical and keyword features — confirming that the feature set generalises beyond URL-dependent synthetic patterns. Text and URL features are complementary: neither alone achieves the full model's performance.
§ 12

Ensemble Decision Strategy

Final classification leverages multiple independent signals to reduce single points of failure. Each component has a distinct and complementary failure mode.

ComponentLatencyCostWeightFailure Mode
Heuristics<1msFree0.2Novel phrasing
ML Model (GBM)<5msMinimal0.6Known feature exploitation
Char N-gram (f32)<2msMinimalEmbedded in GBMOut-of-lexicon obfuscation
LLM Guard (optional)~1sHigh0.2Inconsistent, expensive
Python if ml_score > threshold_for_lang(lang): verdict = "scam" # language-aware threshold elif ml_score < 0.20 and heuristic_score < 30: verdict = "safe" elif llm_unsafe or (ml_score > 0.70 and heuristic_score > 60): verdict = "scam" elif ml_score > 0.50 or heuristic_score > 50: verdict = "suspicious" else: verdict = "safe"
§ 13

Precision-Recall Trade-offs

Language-specific scam thresholds reflect pool size and training data confidence. Smaller pools get more conservative thresholds.

LanguageScam ThresholdRationale
English (en)0.90Original English threshold unchanged
Hindi (hi)0.85Largest multilingual pool — good confidence
Telugu (te)0.85
Kannada (kn)0.85
Marathi (mr)0.80Smaller training pool — more conservative

13.1 Threshold Sweep (Synthetic Benchmark)

Operating threshold selection depends on deployment context: high-volume automated filtering tolerates higher FPR; consumer-facing alerting requires higher precision.

ThresholdPrecisionRecallF1FPR
0.300.8810.9790.9280.121
0.500.9210.9640.9420.079
0.700.9430.9500.9460.055
0.900.9610.9290.9450.036
0.9990.9980.9210.9580.002
ZoneThresholdActionUse Case
Safe< 0.20Allow messageHigh-volume filtering
Suspicious0.20 – language thresholdFlag for human reviewAmbiguous edge cases
Scam≥ language thresholdBlock / alert userAutomated blocking
§ 14

LLM-Based Semantic Safety Layer

Feature-based ML can miss novel scam tactics absent from training data, subtle persuasion, and context-dependent deception. A Llama Guard model is integrated as a secondary semantic validator, invoked only when ML confidence falls in the ambiguous 0.4–0.6 band.

Multilingual LLM note: For non-English messages, the LLM layer requires a multilingual safety model (e.g. multilingual Llama Guard or GPT-4o). The 90% cost savings from ML pre-filtering apply equally across all 5 languages.
StrategyLLM Calls / 1K msgsCostSavings
Without ML gating1,000$1.00
With ML pre-filtering~100 ambiguous$0.1090% reduction
§ 15

Deployment Architecture

The English model is a Flask microservice at sub-5ms latency. The multilingual extension adds a second model loaded alongside the first; both are loaded once at startup and cached. The total memory footprint is under 2 MB.

JSON // POST /predict — works for all 5 languages { "content": "आपके बैंक खाते में संदिग्ध गतिविधि। bit.ly/verify-now" } // Response { "verdict": "scam", "probability": 1.0, "risk_score": 100, "language": "hi", "threshold": 0.85, "top_signals": [ { "feature": "char_ngram_scam_score", "importance": 1.0 } ] }

15.1 Deployment Checklist

  • English model serialization (joblib) — scam_detector_final.pkl
  • Multilingual GBM serialization — multilingual_scam_detector.pkl (281 KB)
  • Char n-gram model serialization — multilingual_ngram_model.pkl (561 KB)
  • Flask API with error handling and language routing
  • Calibrated probability output (isotonic regression, both models)
  • Language-specific threshold configuration
  • Graceful degradation — if ngram model fails to load, f32 = 0.0; GBM still runs
  • Rate limiting (prevent API abuse)
  • Logging and monitoring dashboard
  • Model versioning and rollback
  • Docker containerisation
§ 16

Limitations & Future Work

Current Limitations:
• Synthetic benchmark data — real-world UCI SMS F1 = 0.9278 is the operationally honest metric
• UCI SMS corpus (2012) represents older promotional spam; modern phishing patterns require further validation
• English adversarial recall drops 71–82% under obfuscation attacks — primary technical limitation
• URL features (f18–f24) contribute zero importance on UCI SMS — indicates overfit to synthetic URL patterns
• Marathi training pool is smaller than other languages
• No DistilBERT direct experimental comparison in this submission (literature estimate used)
• No behavioural signal integration (sender patterns, timing, contact graph)
• No image/OCR support for screenshot-based scams

English: Direct DistilBERT experimental comparison on UCI SMS (same split). Evaluation on Nazario Phishing Corpus and Enron-Spam Dataset. Adversarial training with synonym/homoglyph augmentation. Full SHAP value visualization for per-prediction attribution.

Multilingual: Integrate AI4Bharat IndicNLP and IndicGLUE corpora. Expand Marathi training pool to achieve per-class balance. Add Bengali (U+0980–U+09FF) and Tamil (U+0B80–U+0BFF).

Semantic embedding features (sentence-transformers) for adversarial robustness. Character-level CNN features for homoglyph attack resistance. Domain reputation API integration (VirusTotal, URLhaus) for Indian financial domains. Behavioral signals: sender patterns, timing, contact-graph structure. Per-language adversarial training with script-swap augmentation.

The most important experiment the field has not yet run: adversarial red-teaming by an agent with full knowledge of the feature set, actively mutating content across sessions. Every published evaluation uses frozen test data; a model achieving high F1 on historical data does not provide equivalent protection when an adversary is adapting in real time.

Additional long-term directions: online learning pipeline with retraining triggered by human-reviewed flagged cases; full OCR pipeline for screenshot-based scams; coverage of all 22 scheduled Indian languages.

§ 17

Conclusion

This work demonstrates that interpretable machine learning — integrated with an LLM safety layer and rule-based heuristics — forms a practical, auditable scam detection system. The English system achieves F1 = 0.9969 (3-fold CV) on 19,992 synthetic samples across 17 scam categories and F1 = 0.9303 on the real-world UCI SMS Spam Collection (5,574 messages), outperforming all four baselines with statistical significance (p < 0.001, McNemar's test). It is competitive with fine-tuned DistilBERT at 125× smaller model size and sub-5 ms inference latency.

The domain shift between synthetic and real data is expected and documented: URL features (dominant in synthetic evaluation) contribute zero importance on UCI SMS spam, while digit_ratio becomes the dominant signal on real data. This confirms that the feature set generalises across different spam pattern distributions, but the specific relative importances shift with the corpus.

The multilingual extension adds Hindi, Marathi, Telugu, and Kannada support in a 844 KB Android-compatible bundle via zero-dependency language detection, language-specific keyword lexicons, script-mismatch detection, and a char n-gram meta-feature — without modifying any component of the original English pipeline.

Key Takeaways: Feature engineering outperforms generic text representations. Linear models provide sufficient performance with superior interpretability. Real-world validation (F1 = 0.9303 on UCI SMS) confirms generalisation beyond synthetic benchmarks. Char n-grams are the only viable sub-MB approach for on-device multilingual scam detection. Adversarial fragility, not headline metrics, reveals the true operational failure modes.

"The correct operational posture is to deploy this model as the first layer of a continuously updated pipeline, with retraining triggered by human-reviewed flagged cases — in any of the five supported languages."

Vishwajeet Adkine · DOI: 10.5281/zenodo.18988170

Live Feature Analyzer

Simulates the feature extraction layer. Enter any message — in English, Hindi, Telugu, or Kannada — to see which signals activate and get a real-time risk assessment.

↯ Interactive Demo — Feature Extraction Layer (English + Multilingual)
Presets:
Ctrl+Enter
Scam signals detected
Safety signals detected
§

References

  • Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to detect phishing emails. WWW 2007.
  • Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions (SHAP). NeurIPS 2017.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining predictions of any classifier. KDD 2016.
  • Iyer, R., et al. (2023). Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv:2312.06674.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5).
  • Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. JMLR, 12, 2825–2830.
  • FBI IC3. (2024). 2023 Internet Crime Report. Federal Bureau of Investigation.
  • Sahin, D. O., et al. (2019). Phishing URL detection via CNN and attention-based hierarchical RNN. ICIM 2019.
  • Chen, Z., et al. (2023). Can LLMs detect social engineering attacks? A zero-shot evaluation. arXiv preprint.
  • Kakwani, D., et al. (2020). IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. EMNLP Findings 2020.
  • Kunchukuttan, A., et al. (2020). AI4Bharat-IndicNLP Corpus: Monolingual corpora and word embeddings for Indic languages. arXiv:2005.00085.
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3).
  • Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
  • Aghaei, E., et al. (2022). DINE: Detecting online scam via behavioral graph analysis. IEEE INFOCOM 2022.
  • Almeida, T. A., & Gómez Hidalgo, J. M. (2011). Contributions to the study of SMS spam filtering: New collection and results. DocEng 2011. [UCI SMS Spam Collection]
  • Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers.
App

Appendices

Appendix A: Extended Feature Extraction (32 features)

Python def extract_features_extended(text, ngram_model=None): # f1–f24: original English features (unchanged) original = extract_original_24(text) # f25: language detection (zero-dependency Unicode) lang = detect_language(text) # strips URLs before analysis f25 = lang_to_int(lang) # 0=en,1=hi,2=mr,3=te,4=kn # f26–f30: multilingual keyword signals f26 = has_keyword(lang, "URGENCY", text) f27 = has_keyword(lang, "MONEY", text) f28 = has_keyword(lang, "SENSITIVE", text) f29 = has_keyword(lang, "OFF_PLATFORM", text) f30 = has_keyword(lang, "THREAT", text) # f31: script mismatch (Roman chars in native-script message) f31 = script_mismatch_score(text) # f32: char n-gram model output (0.0 if model not loaded) f32 = ngram_model.predict_proba([text])[0][1] if ngram_model else 0.0 return original + [f25, f26, f27, f28, f29, f30, f31, f32]

Appendix B: Language Detection (URL-stripped)

Python def detect_language(text): # Strip URLs before script analysis — prevents injected English # URLs in native-script messages from causing misclassification stripped = re.sub(r'https?://\S+|www\.\S+|[a-z0-9.-]+\.[a-z]{2,6}(/\S*)?', ' ', text) counts = count_script_chars(stripped) dominant = max(counts, key=counts.get) if dominant == 'devanagari': return 'mr' if any(w in text for w in MARATHI_MARKERS) else 'hi' return {'telugu':'te', 'kannada':'kn', 'latin':'en'}.get(dominant, 'other')

Appendix C: Dataset Splits Summary

DatasetTrainTestTotalType
Synthetic (EN)15,9933,99919,992Synthetic
UCI SMS Spam4,4591,1155,574Real-world
Multilingual (HI/MR/TE/KN)6,3801,5957,975Synthetic