This study investigates the use of interpretable machine learning to identify fraudulent e-commerce websites by analyzing publicly observable attributes such as domain structure, SSL certificate metadata, and external reputation signals.
Using a balanced dataset of 1,140 online shops, the research identifies distinct behavioral patterns: fraudulent sites typically feature newly registered domains, short-lived SSL certificates, and a notable absence of external validation on platforms like TrustPilot or Tranco. Unlike legitimate businesses, these deceptive operations often lack operational transparency, frequently omitting official logos or relying on free email providers while attempting to mimic professional branding.
To operationalize these findings, the study evaluated three distinct algorithms—Logistic Regression, Random Forest, and XGBoost—ultimately selecting Random Forest as the optimal model for its balance of high accuracy (93%) and clear feature interpretability. The final output is a transparent risk-scoring framework that assigns a probability-based score (0-100) to web domains, categorizing them into low, medium, or high-risk tiers.
This system is deployed via a command-line tool capable of batch processing, allowing analysts to distinguish between genuine merchants and transient fraud schemes based on quantifiable data rather than opaque, "black-box" predictions. The CLI tool is adapted onto streamlit for public test.
Read full (PDF): Scribd - Google Drive
Github: https://github.com/rafifmsn/ecom-fraud-risk-scoring
Website: https://efr-scoring.streamlit.app