https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Creating a Stopwords List

Description: This notebook creates a stopwords list and exports it into a CSV file. The following processes are described:

  • Loading the NLTK stopwords list

  • Modifying the stopwords list in Python

  • Saving a stopwords list to a .csv file

  • Loading a stopwords list from a .csv file

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Take me to the Learning Version of this notebook ->

Difficulty: Intermediate

Completion time: 5 minutes

Knowledge Required:

Knowledge Recommended: None

Data Format: CSV files

Libraries Used:

  • nltk to create an initial stopwords list

  • csv to read and write the stopwords to a file

Research Pipeline: None


Create a Basic Stopwords List

# Start from NLTK
import nltk
from nltk.corpus import stopwords # Import stopwords from nltk.corpus
stop_words = stopwords.words('english') # Create a list `stop_words` that contains the English stop words list
# Start from Gensim. Creates a list from a frozen set. 
import gensim
from gensim.parsing.preprocessing import STOPWORDS

stop_words = list(STOPWORDS)
# Start from spaCy. 
import spacy
sp = spacy.load('en_core_web_sm')

stop_words = sp.Defaults.stop_words

Output stop_words to CSV File

# Output stop_words to a CSV file

import csv # Import the csv module to work with csv files
with open('data/stop_words.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(stop_words)

To Read in the Stopwords CSV

# Open the CSV file and list the contents

with open('data/stop_words.csv', 'r') as f:
    stop_words = f.read().strip().split(",")