Counting unique words in Python

Counting unique words in a text is a fundamental task in various applications, from simple word processors to advanced natural language processing (NLP) projects. Counting unique words in Python can be achieved elegantly using a combination of regular expressions (for word extraction) and the Counter class from the collections module (for efficient counting). In this guide, we’ll break down the process step-by-step, providing you with a powerful tool for text analysis.

1. Why Count Unique Words? Applications and Insights

Understanding the frequency of words in a text can reveal valuable insights:

  • Authorship Analysis: Identify writing styles and patterns.
  • Topic Modeling: Discover the main themes in a document.
  • Text Summarization: Extract the most important keywords and phrases.
  • Sentiment Analysis: Analyze the emotional tone of text.

2. Python Tools: Regular Expressions and Counter

  • Regular Expressions (re module): Powerful pattern-matching tools for extracting words from text.
  • Counter (from collections): A specialized dictionary subclass for counting hashable objects efficiently.

3. Building a Unique Word Counter: Step-by-Step Guide

import re
from collections import Counter

def count_words(filepath):
    with open(filepath, 'r') as file:
        text = file.read()

    words = re.findall(r'[a-z0-9\'\-]+', text.lower())  # Extract words (case-insensitive)
    print(f"Total words: {len(words)}")

    word_counts = Counter(words)
    for word, count in word_counts.most_common(20): 
        print(f"{word}: {count}")

Explanation:

  1. Read File: Open the text file in read mode.
  2. Extract Words: Use a regular expression to find all sequences of letters, numbers, hyphens, and apostrophes (our definition of a word).
  3. Count Words: Create a Counter object and populate it with the extracted words. The Counter automatically counts the occurrences of each word.
  4. Display Results: Print the total number of words and the top 20 most frequent words along with their counts.

4. Running the Word Counter: Test It Out!

count_words('shakespeare.txt')  # Replace with your text file

5. Key Takeaways: Efficient Word Frequency Analysis

  • Regular Expressions: Master pattern matching to extract relevant data from text.
  • Counter Class: Leverage this specialized dictionary for fast and convenient counting.
  • Customizable: Adjust the regular expression or number of top words to suit your needs.

Frequently Asked Questions (FAQ)

1. Can I use this function to count words in strings instead of files?

Yes, simply remove the file-reading part and pass the string directly to re.findall().

2. How can I exclude common words (like “the,” “and,” “a”) from the count?

Create a list of stop words and filter them out before counting:

stop_words = ["the", "and", "a", "in"]  
words = [word for word in words if word.lower() not in stop_words]

3. Can I customize how the results are displayed?

Absolutely! Modify the print statements to format the output in your preferred style (e.g., table, CSV, etc.).