https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Working with Dataset Files

Description: This notebook describes how to:

  • Read and write files (.txt, .csv, .json)

  • Use the tdm_client to read in metadata

  • Use the tdm_client to read in data

This notebook describes how to read and write text, CSV, and JSON files using Python. Additionally, it explains how the tdm_client can help users easily load and analyze their datasets.

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion Time: 60 minutes

Knowledge Required:

Knowledge Recommended: None

Data Format: Text (.txt), CSV (.csv), JSON (.json)

Libraries Used:

  • pandas to read and write CSV files

  • json to read and write JSON files

  • tdm_client to retrieve and read data

Research Pipeline: None


Files in Python

Working with files is an essential part of Python programming. When we execute code in Python, we manipulate data through the use of variables. When the program is closed, however, any data stored in those variables is erased. To save the information stored in variables, we must learn how to write it to a file.

At the same time, we may have notebooks for applying specific analyses, but we need to have a way to bring data into the notebook for analysis. Otherwise, we would have to type all the data into the program ourselves! Both reading-in from files and writing-out data out to files are important skills for data science and the digital humanities.

This section describes how to work with three kinds of common data files in Python:

  • Plain Text Files (.txt)

  • Comma-Separated Value files (.csv)

  • Javascript Object Notation files (.json)

Each of these filetypes are in wide use in data science, digital humanities, and general programming.

Three Common Data File Types

Plain Text Files (.txt)

A plain text file is one of the simplest kinds of computer files. Plain text files can be opened with a text editor like Notepad (Windows 10) or TextEdit (OS X). The file can contain only basic textual characters such as: letters, numbers, spaces, and line breaks. Plain text files do not contain styling such as: heading sizes, italic, bold, or specialized fonts. (To including styling in a text file, writers may use other file formats such as rich text format (.rtf) or markdown (.md).)

Plain text files (.txt) can be easily viewed and modified by humans by changing the text within. This is an important distinction from binary files such as images (.jpg), archives (.gzip), audio (.wav), or video (.mp4). If a binary file is opened with a text editor, the content will be largely unreadable.

Comma-Separated Value Files (.csv)

A comma-separated value file is also a text file that can easily be modifed with a text editor. A CSV file is generally used to store data that fits in a series or table (like a list or spreadsheet). A spreadsheet application (like Microsoft Excel or Google Sheets) will allow you to view and edit a CSV data in the form of a table.

Each row of a CSV file represents a single row of a table. The values in a CSV are separated by commas (with no space between), but other delimiters can be chosen such as a tab or pipe (|). A tab-separated value file is called a TSV file (.tsv). Using tabs or pipes may be preferable if the data being stored contains commas (since this could make it confusing whether a comma is part of a single entry or a delimiter between entries).

The text contents of a sample CSV file

Username,Login email,Identifier,First name,Last name
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith

The same CSV file represented in Google Sheets:

CSV table view in Google Sheets

JavaScript Object Notation (.json)

A Javascript Object Notation file is also a text file that can be modified with a text editor. A JSON file stores data in key/value pairs, very similar to a Python dictionary. One of the key benefits of JSON is its compactness which makes it ideal for exchanging data between web browsers and servers.

While smaller JSON files can be opened in a text editor, larger files can be difficult to read. Viewing and editing JSON is easier in specialized editors, available online at sites like:

A JSON file has a nested structure, where smaller concepts are grouped under larger ones. Like extensible markup language (.xml), a JSON file can be checked to determine that it is valid (follows the proper format for JSON) and/or well-formed (follows the proper format defined in a specialized example, called a schema).

The text contents of a sample JSON file

{
    "firstName": "Julia",
    "lastName": "Smith",
    "gender": "woman",
    "age": 57,
    "address": {
        "streetAddress": "11434",
        "city": "Detroit",
        "state": "Mi",
        "postalCode": "48202"
    },
    "phoneNumbers": [
        { "type": "home", "number": "7383627627" }
    ]
}

The same JSON file represented in JSON Editor Online

An image of the JSON file showing the structure

Opening, Reading, and Writing Text Files (.txt)

Open the File

Before we can read or write to text file, we must open the file. Normally, when we open a file, a new window appears where we can see the contents. In Python, opening a file simply means create a file object that refers to the particular file. When the file has been opened, we can read or write to the file. Finally, we must close the file. Here are the three steps:

  1. Use the open() function to create a file object

  2. Use the .read(), .readlines(), or .write() method on the file object

  3. Use the close() function to close the file object

Let’s practice on sample.txt, a sample text file.

# Open the text file `sample.txt` creating
# a file object called `f`

f = open('data/sample.txt', 'r')

We have created a file object called f. The first argument ('sample.txt') is a string containing the file name. You can see the sample.txt in the same directory as this lesson. If your file was called reports.txt, you would replace that argument with 'reports.txt'. The second argument ('r') determines that we are opening the file in “read” mode, where the file can be read but not modified. There are three main modes that can be specified:

Argument

Mode Name

Description

‘r’

read

Reads file but no writing allowed (protects file from modification)

‘w’

write

Writes to file, overwriting any information in the file (saves over the current version of the file

‘a’

append

Appends (or adds) information to the end of the file (new information is added below old information)

Read the File

.read() method

Now that we have a file object f opened in “read mode,” let’s read the contents with the .read() method. We will create a variable called file_contents to hold the data that we are reading in.

# Create a variable called `file_contents`
# that will hold the result of using the
# .read() method on our file object
file_contents = f.read()
print(file_contents)

When we are finished with the file object, we must close it using the .close() method on it. It is very important to always close a file, otherwise your program may crash or create memory problems.

# Close the file by using the .close() method
# on the file object
f.close()

.readlines() method

If a file is very large, we may want to read a single line at a time so as not to fill all of the available computer memory. To read a single line at a time, we can use the .readlines() method instead of the .read() method.

# Open the file sample.txt in read mode
# creating the file object `f`
f = open('data/sample.txt', 'r')
file_contents = f.readlines()
print(file_contents)

With the .read() method, we read in the whole text file as a single string. The .readlines() gives us a Python list, where each item is a single string representing a single line of the text document. You may also notice that each line ends with \n which represents a line break in a string. If we print the string, the line break is visible in our output.

# Print the first item in the file_contents list
# Note the \n turns into a line break
print(file_contents[0])
f.close()

Write to the File

To write to a file, we need to open it and create our file object in either write (‘w’) or append (‘a’) mode.

Append mode

Let’s start with append mode which adds new data to the bottom of the file while leaving any previous information intact.

# Opening a file in append mode
# and creating a new file object 
# called `f`
f = open('data/sample.txt', 'a')

Now we can use the .write() method to append a string to the file.

# Appending an eleventh line to the file
f.write('\nThis is the eleventh line')
f.close()

Can you read the file back in to see whether the .write() was successful?

# Open the the file in read mode
# create a file object called `sample_file`
f = 

# Use the .read() method on the file object
# Store the result in a variable `file_contents`
file_contents = 

# Print the contents
print(file_contents)

# Close the file
f.close()

Write mode

Opening a file in write mode is useful in two scenarios:

  • Creating a new text file and writing data to it

  • Overwriting all data in the file with new data

Here is an example:

# Creating a new file in write mode
f = open('data/new_sample.txt', 'w')

# Define a string variable to add to the new file
string = 'Here is some data\nWith a second line'

# Using write method on the file object
contents = f.write(string)

# Close file object
f.close()

Open/Close Files with open

The with open technique is commonly used in Python because it has two significant advantages:

  • It is more compact

  • It automatically closes the file afterward

The basic form resembles a flow control statement, ending in a colon and then executing an indented block of code. After the block executes, the file is closed automatically.

with open('data/sample.txt', 'r') as f:
    print(f.read())

Opening, Reading, and Writing CSV Files (.csv)

CSV file data can be easily opened, read, and written using the pandas library. (For large CSV files (>500 mb), you may wish to use the csv library to read in a single row at a time to reduce the memory footprint.) Pandas is flexible for working with tabular data, and the process for importing and exporting to CSV is simple.

# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv('data/sample.csv')
# Display the dataframe
print(df)

After you’ve made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.)

# Write data to new file
# Keeping the Header but removing the index
df.to_csv('data/new_sample.csv', header=True, index=False)

Opening, Reading, and Writing JSON Files (.json)

JSON files use a key/value structure very similar to a Python dictionary. We will start with a Python dictionary called py_dict and then write the data to a JSON file using the json library.

# Defining sample data in a Python dictionary
py_dict = {
    "firstName": "Julia",
    "lastName": "Smith",
    "gender": "woman",
    "age": 57,
    "address": {
        "streetAddress": "11434",
        "city": "Detroit",
        "state": "Mi",
        "postalCode": "48202"
    },
    "phoneNumbers": [
        { "type": "home", "number": "7383627627" }
    ]
}

To write our dictionary to a JSON file, we will use the with open technique we learned that automatically closes file objects. We also need the json library to help dump our dictionary data into the file object. The json.dump function works a little differently than the write method we saw with text files.

We need to specify two arguments:

  • The data to be dumped

  • The file object where we are dumping

# Open/create sample.json in write mode
# as the file object `f`. The data in py_dict
# is dumped into `f` and then `f` is closed

import json
with open('data/sample.json', 'w') as f:
    json.dump(py_dict, f)

To read data in from a JSON file, we can use the json.load function on our file object. Here we load all the content into a variable called content. We can then print values based on particular keys.

with open('data/sample.json') as f:
    contents = json.load(f)
    print('First Name: ' + contents['firstName'])
    print('Last Name: '+ contents['lastName'])
    print('Age: ' + str(contents['age']))
    print('Phone Number: ', contents['phoneNumbers'][0]['number'])

Opening datasets with tdm_client

The tdm_client helps retrieve a given dataset and/or its associated metadata. The metadata is supplied in a CSV file and the full dataset is supplied in a compressed JSON Lines file (.jsonl.gz). For any analysis focused on metadata pre-processing, we recommend users start with the CSV file since it is both easier and faster to view, parse, and manipulate.

Metadata CSV vs. JSON Lines Data File

All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:

  1. The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas. It is nice to be able to easily view all the metadata in a Pandas dataframe.

  2. The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computer time and costs. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

More information is available, including the metadata categories, in the FAQ “What is the data file format?”.

Retrieving data (tdm_client methods)

By passing the tdm_client a dataset ID (here called dataset_id), we can automatically download the metadata CSV file or the full JSON Lines dataset created by the dataset builder.

  • Use the .get_metadata() method to retrieve the metadata CSV file

  • Use the get_dataset() method to retrieve the full JSON data file

Code

Result

f = tdm_client.get_metadata(dataset_id)

Automatically retrieves a metadata CSV file and creates a file object f

f = tdm_client.get_dataset(dataset_id)

Automatically retrieves a compressed JSON Lines dataset file (jsonl.gz) and creates a file object f

The JSON Lines file will be downloaded in a compressed gzip format (jsonl.gz). We can iterate over each document in the corpus by using the dataset_reader() method.

Reading in data from a dataset object

The dataset_reader() method will read in data from the compressed JSON dataset file object. By keeping the data in a compressed format and reading in a single line at a time, we reduce the processing time and memory use. These can be substantial for large datasets. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

The dataset_reader() essentially unzips each row of the dataset at a time. Each row constitutes all the metadata and data available for a single document. Here is what that looks like the actual tdm_client:

        for row in input_file:
            yield json.loads(row)

In most cases, users will want to iterate over every document/row in the JSON Lines file. In practice, that looks like:

Code

Result

for document in tdm_client.dataset_reader(f):

Iterates over each document in file object f