https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Finding Significant Words Using TF/IDF

Description: This notebook shows how to discover significant words. The method for finding significant terms is tf-idf. The following processes are described:

  • An educational overview of TF-IDF, including how it is calculated

  • Using the constellate client to retrieve a dataset

  • Filtering based on a pre-processed ID list

  • Cleaning the tokens in the dataset

  • Creating a gensim dictionary

  • Creating a gensim bag of words corpus

  • Computing the most significant words in your corpus using gensim implementation of TF-IDF

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Take me to the Research Version of this notebook ->

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • constellate client to collect, unzip, and read our dataset

  • pandas to load a preprocessing list

  • gensim to help compute the tf-idf calculations

  • NLTK to create a stopwords list (if no list is supplied)

Research Pipeline:

  1. Build a dataset

  2. Create a “Pre-Processing CSV” with Exploring Metadata (Optional)

  3. Complete the TF-IDF analysis with this notebook


What is “Term Frequency- Inverse Document Frequency” (TF-IDF)?

TF-IDF is used in machine learning and natural language processing for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:

  1. Term Frequency- A measure of how many times a given word appears in a document

  2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

Before starting this lesson, we recommend reading the explanation of TF/IDF in the Constellate documentation.

TF-IDF Calculation in Plain English

\[(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}\]

There are variations on the TF-IDF formula, but this is the most widely-used version.

Computing TF-IDF with your Dataset

We’ll use the constellate client to automatically retrieve the dataset in the JSON file format.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

# Default dataset is "Shakespeare Quarterly," 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the constellate client, passing the dataset_id as an argument using the get_dataset method.

# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
# dataset_metadata will be a string containing that file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')

Apply Pre-Processing Filters (if available)

If you completed pre-processing with the “Exploring Metadata and Pre-processing” notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file must be in the root folder.

# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
import os

pre_processed_file_name = f'data/pre-processed_{dataset_id}.csv'

if os.path.exists(pre_processed_file_name):
    df = pd.read_csv(pre_processed_file_name)
    filtered_id_list = df["id"].tolist()
    use_filtered_list = True
    print('Pre-Processed CSV found. Successfully read in ' + str(len(df)) + ' documents.')
else: 
    use_filtered_list = False
    print('No pre-processed CSV file found. Full dataset will be used.')

Define a Unigram Processing Function

In this step, we gather the unigrams. If there is a Pre-Processing Filter, we will only analyze documents from the filtered ID list. We will also process each unigram, assessing them individually. We will complete the following tasks:

  • Lowercase all tokens

  • Remove tokens in stopwords list

  • Remove tokens with fewer than 4 characters

  • Remove tokens with non-alphabetic characters

We can define this process in a function.

# Define a function that will process individual tokens
# Only a token that passes through all three `if` 
# statements will be returned. A `True` result for
# any `if` statement does not return the token. 

def process_token(token):
    token = token.lower()
    if len(token) < 4: # If True, do not return token
        return None
    if not(token.isalpha()): # If True, do not return token
        return None
    return token # If all are False, return the lowercased token

Collect lists of Document IDs, Titles, and Unigrams

Next, we process all the unigrams into a list called documents. For demonstration purposes, this code runs on a limit of 500 documents, but we can change this to process all the documents. We are also collecting the document titles and ids so we can reference them later.

documents = [] # A list that will contain all of our unigrams
document_ids = [] # A list that will contain all of our document ids
document_titles = [] # A list that will contain all of our titles

for document in constellate.dataset_reader(dataset_file):
    processed_document = [] # Temporarily store the unigrams for this document
    document_id = document['id'] # Temporarily store the document id for this document
    document_title = document['title'] # Temporarily store the document title for this document
    if use_filtered_list is True:
        # Skip documents not in our filtered_id_list
        if document_id not in filtered_id_list:
            continue
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document += [clean_gram] * count # Add the unigram as many times as it was counted
    if len(processed_document) > 0:
        document_ids.append(document_id)
        document_titles.append(document_title)
        documents.append(processed_document)

At this point, we have unigrams collected for all our documents insde the documents list variable. Each index of our list is a single document, starting with documents[0]. Each document is, in turn, a list with a single stringe for each unigram.

Note: As we collect the unigrams for each document, we are simply including them in a list of strings. This is not the same as collecting them into word counts, and we are not using a Counter() object here like the Word Frequencies notebook.

The next cell demonstrates the contents of each item in our document list. Essentially,

# Show the unigrams collected for a particular document
print(document_titles[4])
list(documents[0])

If we wanted to see word frequencies, we could convert the lists at this point into Counter() objects. The next cell demonstrates that operation.

# Convert a given document into a Counter object to determine
# word frequencies count

# Import counter to help count word frequencies
from collections import Counter

word_freq = Counter(documents[0]) # Change documents index to see a different document
word_freq.most_common(25) 
len(document_ids[0])

Now that we have all the cleaned unigrams for every document in a list called documents, we can use Gensim to compute TF/IDF.


Using Gensim to Compute “Term Frequency- Inverse Document Frequency”

It will be helpful to remember the basic steps we did in the explanatory TF-IDF example:

  1. Create a list of the frequency of every word in every document

  2. Create a list of every word in the corpus

  3. Compute TF-IDF based on that data

So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In gensim, this is called a “dictionary”. A gensim dictionary is similar to a Python dictionary, but here it is called a gensim dictionary to show it is a specialized kind of dictionary.

Creating a Gensim Dictionary

Let’s create our gensim dictionary. A gensim dictionary is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.

import gensim
dictionary = gensim.corpora.Dictionary(documents)

Now that we have a gensim dictionary, we can get a preview that displays the number of unique tokens across all of our texts.

print(dictionary)

The gensim dictionary stores a unique identifier (starting with 0) for every unique token in the corpus. The gensim dictionary does not contain information on word frequencies; it only catalogs all the unique words in the corpus. You can see the unique ID for each token in the text using the .token2id() method.

list(dictionary.token2id.items())

We could also look up the corresponding ID for a token using the .get method.

# Get the value for the key 'people'. Return 0 if there is no token matching 'people'. 
# The number returned is the gensim dictionary ID for the token. 

dictionary.token2id.get('people', 0) 

For the sake of example, we could also discover a particular token using just the ID number. This is not something likely to happen in practice, but it serves here as a demonstration of the connection between tokens and their ID number.

Normally, Python dictionaries only map from keys to values (not from values to keys). However, we can write a quick for loop to go the other direction. This cell is simply to demonstrate how the gensim dictionary is connected to the list entries in the gensim bow_corpus.

# Find the token associated with a token id number
token_id = 100

# If the token id matches, print out the associated token
for dict_id, token in dictionary.items():
    if dict_id == token_id:
        print(token)

Creating a Bag of Words Corpus

The next step is to connect our word frequency data found within documents to our gensim dictionary token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We will create a Python list called bow_corpus that will turn our word counts into a series of tuples where the first number is the gensim dictionary token ID and the second number is the word frequency.

Combining Gensim dictionary with documents list to create Bag of Words Corpus

# Create a bag of words corpus

bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

print('Bag of words corpus created successfully.')
# Examine the bag of words corpus for a specific document

list(bow_corpus[0][:10]) # List out a slice of the first ten items

Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code will replace the token IDs in the last example with the actual tokens.

word_counts = [[(dictionary[id], count) for id, count in line] for line in bow_corpus]
list(word_counts[0][:10])

Create the TfidfModel

The next step is to create the TF-IDF model which will set the parameters for our implementation of TF-IDF. In our TF-IDF example, the formula for TF-IDF was:

\[(Times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-word)}\]

In gensim, the default formula for measuring TF-IDF uses log base 2 instead of log base 10, as shown:

\[(Times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Number-of-documents-containing-the-word)}\]

If you would like to use a different formula for your TF-IDF calculation, there is a description of parameters you can pass.

# Create our gensim TF-IDF model
model = gensim.models.TfidfModel(bow_corpus) 

Now, we apply our model to the bow_corpus to create our results in corpus_tfidf. The corpus_tfidf is a python list of each document similar to bow_document. Instead of listing the frequency next to the gensim dictionary ID, however, it contains the TF-IDF score for the associated token. Below, we display the first document in corpus_tfidf.

# Create TF-IDF scores for the ``bow_corpus`` using our model
corpus_tfidf = model[bow_corpus]

# List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
list(corpus_tfidf[0][:10])

Let’s display the tokens instead of the gensim dictionary IDs.

example_tfidf_scores = [[(dictionary[id], count) for id, count in line] for line in corpus_tfidf]

# List out the TF-IDF scores for the first 10 tokens of the first text in the corpus
list(example_tfidf_scores[0][:10]) 

Find Top Terms in a Single Document

Finally, let’s sort the terms by their TF-IDF weights to find the most significant terms in the document.

# Sort the tuples in our tf-idf scores list

# Choosing a document by its index number
# Change n to see a different document
n = 0

def Sort(tfidf_tuples): 
    tfidf_tuples.sort(key = lambda x: x[1], reverse=True) 
    return tfidf_tuples 

# Print the document id and title
print('Title: ', document_titles[n])
print('ID: ', document_ids[n])

#List the top twenty tokens in our example document by their TF-IDF scores
list(Sort(example_tfidf_scores[n])[:20]) 

We could also analyze across the entire corpus to find the most unique terms. These are terms that appear in a particular text, but rarely or never appear in other texts. (Often, these will be proper names since a particular article may mention a name often but the name may rarely appear in other articles. There’s also a fairly good chance these will be typos or errors in optical character recognition.)

# Define a dictionary ``td`` where each document gather
td = { 
dictionary.get(_id): value for doc in corpus_tfidf
for _id, value in doc
}

# Sort the items of ``td`` into a new variable ``sorted_td``
# the ``reverse`` starts from highest to lowest
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True) 
for term, weight in sorted_td[:25]: # Print the top 25 terms in the entire corpus
    print(term, weight)

Display Most Significant Term for each Document

We can see the most significant term in every document.

# For each document, print the ID, most significant/unique word, and TF/IDF score

n = 0

for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(document_ids[n], dictionary.get(word_id), score)
    if n >= 10:
        break

Ranking documents by TF-IDF Score for a Search Word

from collections import defaultdict
terms_to_docs = defaultdict(list)
for doc_id, doc in enumerate(corpus_tfidf):
    for term_id, value in doc:
        term = dictionary.get(term_id)
        terms_to_docs[term].append((doc_id, value))
    if doc_id >= 500:
        break
# Pick a unigram to discover its score across documents
search_term = 'coriolanus'

# Display a list of documents and scores for the search term
matching = terms_to_docs.get(search_term)
for doc_id, score in sorted(matching, key=lambda x: x[1], reverse=True):
    print(document_ids[doc_id], score)