SEMPHISH: A Phishing Detection Tool Based on Semantic Hashes
| Title | SEMPHISH: A Phishing Detection Tool Based on Semantic Hashes |
| Publication Type | Conference Paper |
| Year of Publication | 2025 |
| Authors | Romeo, F, Blefari, F, Pironti, FAurelio, Lupinacci, M, Furfaro, A, Otranto, F |
| Conference Location | The 12th International Conference on Future Internet of Things and Cloud (FiCloud 2025) |
| Abstract | Phishing is a type of social engineering attack in which users are deceived into performing specific actions, often under the guise of a legitimate organization such as Google, their employer, or a financial institution. This paper presents SEMPHISH, a phishing detection tool that leverages semantic hashes and machine learning techniques to identify webpages that visually or structurally mimic well-known legitimate websites. The underlying approach relies on Semantic Hashing techniques, applied to both the source code and screenshots of webpages, to compute similarity scores. The extracted similarity scores are subsequently analyzed using machine learningbased classifiers. To evaluate the performance of SEMPHISH, a custom dataset has been built. Multiple performance metrics were evaluated through extensive experimentation with various machine-learning algorithms. This enabled assessing the impact on detection performance of each similarity score individually, as well as evaluating the hybrid approach leveraging multiple scores. Additionally, it facilitated determining the optimal algorithm and parameter configuration for detecting, preventing, and mitigating phishing threats. The configuration of SEMPHISH which leverages the eXtreme Gradient Boosting classifier performed the best by scoring an accuracy of 95.15% and an F1-score of 94.99% on the analyzed dataset. |
| DOI | 10.1109/FiCloud66139.2025.00009 |
