Take a sample of phishing e-mails and find the most common words

Our Objective

Take a sample of phishing e-mails and find the most common words in it.


The Theory

  • Phishing emails are fraudulent messages sent by cybercriminals to deceive individuals into divulging sensitive information, such as login credentials, financial details, or personal data. These emails often appear to be from reputable sources, such as banks, online services, or government institutions, in an attempt to trick recipients into taking some action that benefits the attackers.
  • Here are some common characteristics of phishing emails:
  • Urgent or alarming content: Phishing emails often create a sense of urgency or alarm to prompt recipients into immediate action. For example, they may claim that an account has been compromised, and action must be taken to prevent unauthorized access.
  • Suspicious links: Phishing emails may contain links that appear legitimate but actually direct users to fake websites designed to steal login credentials or personal information.
  • Fake attachments: Some phishing emails include attachments that contain malware or viruses. The attackers try to entice recipients to open these attachments, which could compromise their devices.
  • Spoofed sender addresses: Phishers may use email addresses that resemble legitimate ones to deceive recipients. These email addresses might have minor variations or misspellings.
  • Poor grammar and spelling: Phishing emails often contain grammar and spelling mistakes, as they are frequently sent by non-native English speakers or quickly put together.
  • Requests for personal information: Phishing emails may ask recipients to provide personal information, such as passwords, Social Security numbers, or credit card details, under the guise of verifying accounts or claiming rewards.


Learning Outcomes 

  • Code Modularity: The student will see the benefits of writing modular code by encapsulating the word counting logic in a separate function (find_common_words). Modular code is easier to maintain and reuse.
  • List and Loop Manipulation: The student will become familiar with iterating over elements in a list.
  • Word Frequency Analysis: The student will understand how to analyze word frequencies in a given dataset. The code finds and counts the occurrences of each word in the phishing emails.
  • Text Processing: The student will learn how to process text data by converting the text to lowercase and extracting individual words from strings. This is a common step in text analysis tasks.
  • Regular Expression Usage: The student will learn how to use regular expressions (re module) to extract words from strings. The regular expression \b\w+\b helps in matching word characters with word boundaries.