4. Obtaining and Processing Data

Reading 4: Obtaining and Processing Data #

As a software designer, you will be expected to generate, gather, process, show, or explain data. These tasks are a core part of software design work, even if they may not be explicit tasks. For example, you may be asked to show that rewriting code a certain way makes it run faster than before, or explain to what extent adding a new feature increases user engagement with your application or service.

To help you build these skills, this reading provides a short primer on working with data in different ways. While the topics discussed here have some relation to the field of data science, this reading focuses more on the programmatic aspects of working with data. This includes reading and writing files, downloading data from the Web, organizing and processing data, and visualizing data. As part of these topics, we will introduce a Python language feature (file input/output) and three libraries (requests, pandas, and pyplot).

You will also notice that this reading has significantly more externally-linked content. Rather than explaining everything in this reading itself, there are more points at which we link to documentation pages on the Web instead. Sooner or later you will need to learn a library or feature mostly through its documentation, so learning things through external documentation is a good skill to pick up now. As you do this, it’s important to remember that you do not need to absorb everything from these readings, but you should be familiar enough with the content that none of the major concepts are surprising to you.

Default Arguments and Keyword Arguments #

As you read through documentation and with reading and writing files, you will see uses of two Python features you may be unfamiliar with: default arguments and keyword arguments. (See the official documentation of print for an example.) A few of the libraries whose documentation pages we link to have functions that make extensive uses of these features, and thus before getting into these libraries, we explain what default arguments and keyword arguments are.

Default arguments make function parameters optional #

A default argument is a way of making a function parameter optional, setting it to some default value if the argument is not provided. Here is a trivial example:

def answer_to_life(number=42):
    return f"The answer to life, the universe, and everything is {number}"

Notice that we write number=42 as a parameter, with no spaces around the equals sign (=). This means that we can call the function one of two ways: with an argument, like answer_to_life(27) or simply as answer_to_life(). If we use the latter, number is set to 42 by default.

Keyword arguments change only some default argument values #

Some functions have many parameters with default arguments, like this:

def many_default_arguments(param_1=42, param_2="spam", param_3=True,
    # Do things here...

If you just wanted to use a different value for param_4, it would be tedious to use the default values for every other parameter, like this:

many_default_arguments(42, "spam", True, 1.618)

To avoid this syntax, you can use a keyword argument, which allows you to define the values of specific parameters. Visually, it looks quite similar to a default argument, except that it is used when calling the function:


This leaves the other arguments as their default values.

File Input/Output #

In the context of computing, you are likely familiar with the term file as some data that is stored as a unit on your machine, like a document, application, or video. However, a file is simply a way to store or record data, just like a file of paper documents. In the UNIX family of operating systems (of which Linux is one), nearly everything is treated as a file: directories, a network device like a wi-fi card, USB drives, and HDMI ports are simply treated as devices that the computer can write data to or read data from. This is an example of abstraction: by treating all of the above as files, they can be written to or read from in almost exactly the same way, using the same set of functions.

Below, we describe the elements of a file and the ways to work with them.

Paths identify and locate files #

A file’s path provides two important pieces of information about a file: what it is called and where on your machine it is located. You have actually already seen paths in this course: ~/softdes/foo.py is a path, for example. That path describes the file location: within the current user’s home directory (~), in the softdes directory, the file foo.py is the one that this path represents. The file’s name is a bit trickier: we would colloquially call this file foo.py, but to differentiate from a different file named foo.py in another directory, the file’s name is actually the entire path (~/softdes/foo.py).

(As an aside, the file’s location is also not completely straightforward, since the file data is actually stored on a device like a hard drive, and the location of the data on this hard drive is not always sequentially organized. That being said, the operating system takes care of this for us, so we can treat the path as being the definitive location of a file.)

The os.path module of the Python standard library provides some convenience functions for working with paths. Its functions work on both UNIX operating systems as well as Windows. This can be useful, because among other things, Windows writes paths with backslashes(\) instead of forward slashes (/), such as in C:\Users\admin\Documents.

Use open (and sometimes close) to access a file #

There are two ways to access a file: reading data from it and writing data to it. Both are core operations to data processing. For example, if you are writing a Markov text generator, as you already did in a previous assignment, reading a source text from a file allows you to simply download a text from the Web and load its contents into your program to be able to generate random text.

In Python, to read from or write to a file, you need to ask the operating system to provide you access to it first. You can do this by using the built-in function open, which provides you with a file object that you can use to access the file. The “Reading and Writing Files” section of the official Python tutorial describes how to do this. You should read this section before moving on.

The built-in function close tells the operating system that you no longer need access to the file. Forgetting to close an open file is a common mistake, and in rare cases can have confusing or catastrophic consequences. If another program is trying to write to a file before your program has closed it, for example, the file contents may be what your program wrote to it, what the other program wrote to it, some combination of the two, or something entirely different. To avoid this, we recommend always using the with form of opening files.

read and write sometimes have better alternatives #

Once you have an open file, you need to actually read the data into a form that can be used by the rest of your program, such as in a string. The “Methods of File Objects” section of the official Python tutorial describes ways to do this using the read and write functions. You should read this section before moving on.

As the documentation mentions, you can use f.readlines() on a file object f to get a list of strings representing every line in the file. But if the file is large, this can slow your machine down quite a bit. Because of this, you are strongly recommended to use the for line in f: syntax:

with open("foo.txt", "r") as f:
    for line in f:
        # Do something with line here, like the following.

The only thing that you should be aware of if using this syntax or f.readlines() is that each line will have a newline character (\n) at the end of each line (except perhaps the last), so you should use the strip function to get just the text of the line. If you used print(line) in the example above, you would instead get an extra blank line between every line in the original file.

For writing to files, you need to add the newline character yourself to start a new line if using f.write(). If you are writing a file line by line and want to have line breaks added for you, as print does, use this syntax instead:

print("Hello world!", file=f)

The file=f keyword argument here tells Python to write Hello world! to the file object f instead of to the screen (sometimes called standard output or stdout).

The requests Library #

If you are not generating your own data, it is likely that you will get much of your data from the Web. Downloading a single file through your Web browser is one way to get this data, but if you need to download a large number of files, you will find it easier to automate this task using Python.

In this reading, we will describe how to use the requests library to programmatically download data from the Web. We recommend requests because it has a relatively straightforward syntax and a large number of convenience features, but keep in mind that there are other libraries that can provide similar functionality.

You need a bit of HTTP knowledge before using requests #

The requests library assumes that you know a little about HTTP, the protocol by which you access Web pages. In short, when you visit a webpage like https://www.youtube.com/watch?v=dQw4w9WgXcQ, your computer is making an HTTP request to YouTube, whose response is the webpage content. Your browser then processes this content to show you the webpage, putting the video player, links, etc. in the correct places.

An HTTP request is a specially formatted message that provides details on what is being requested from the server. When visiting the page above, your browser contacts https://www.youtube.com with an HTTP request that includes the following information:

  • The desired page on the YouTube server, which in this case is /watch (we describe what ?v=dQw4w9WgXcQ means below)
  • Cookies for any accounts logged into YouTube on the browser
  • The languages, file formats, etc., that the browser will accept as a response
  • Information about the browser’s version number and operating system (called the user agent)
  • Parameters for the request - in this case, there is only one parameter v with the value dQw4w9WgXcQ

This is called a GET request, and is one of the most common types of HTTP requests. Notice that the URL ends with /watch?v=dQw4w9WgXcQ - this is called a query string. In a query string, what comes before the question mark (?) is the page to request from the server, and what follows is a series of parameters and values written in the form x=1&y=foo&z=true, where each parameter (x, y, and z) is set to some value, and each parameter-value pair is separated by the ampersand (&) symbol. When using parameters in requests, note that parameter values should all be written as strings, even if they represent other types like integers.

Learn basic requests and responses in requests #

With this in mind, you likely have enough context to learn how to use the requests library. You can do this by visiting their Quickstart page and reading through the end of the “Response Content” section. Do this before moving on.

If you want to test this feature for yourself, you can try accessing this text file, which contains the text of the Lewis Carroll poem “Jabberwocky”. Specifically, you can try a GET request for this file, and simply print the contents of the response text to see if it matches the text found in the file itself. If you want to challenge yourself, you can write the text to a new file using the techniques from above.

Web Data Formats #

Now that you have a sense of how requests works for text data, you should familiarize yourself with different types of data that you can commonly get from the Web. Below, we describe some of the most common types of data, along with libraries for working with them.

Note that all of the tutorials below are optional, but we will briefly use some of them in examples later on in this reading. We remind you that especially on your first readthrough, you only need to understand the major ideas, and thus we encourage you to focus on the big ideas for now and use the remaining links as a reference later.

Use Pillow to work with images #

To read media files, such as images, GIFs, and videos, you can read the HTTP response content as binary data (a sequence of bytes) instead. You can see the “Binary Response Content” section of the requests Quickstart guide for how to do this. If you want to work with images, we recommend the Pillow library, which is designed to process image files in Python.

Use requests or json to work with structured API data #

Many sites also have APIs (application programming interfaces, which allow you to send specifically formatted requests to certain URLs to receive text data for processing. For example, you may use an API to get upcoming weather and temperature data or stock prices. This data is often returned in a format called JSON (JavaScript Object Notation). The way that JSON is structured corresponds nicely to Python data types, and requests provides a function that allows you to parse the text of the JSON response into the appropriate Python data types. You can read about this in the “JSON Response Content” section of the Quickstart guide. The json module in the Python standard library also provides some useful functions for working with JSON data, including writing and reading with both strings and files.

Use Beautiful Soup to work with webpages in HTML format #

For most URLs, sending an HTTP request will simply get you the content of the webpage itself. This is almost always in a format called HTML (HyperText Markup Language). The formatting of HTML can often be quite difficult to read, and thus we recommend using the Beautiful Soup library to process HTML data in Python. (Unfortunately, the documentation can be a bit difficult to understand without a decent knowledge of HTML.) If you want to learn more about HTML, we recommend HTML Dog’s Tutorial, but again, doing this is optional.

Use Pandas to work with tabular data #

Finally, a common format for tables of data (similar to those found in spreadsheets) is CSV (comma-separated values). For this, we recommend using the Pandas library, which makes it quite easy to work with tabular data. Pandas defines two new types: Series, which represents a single column of data, and DataFrame, which represents a spreadsheet-like table of data. As you might expect from a spreadsheet, it is possible to name or otherwise index the rows and columns. You can then work with the data using these names.

While we could try to reinvent the wheel and explain the different features of Pandas, their tutorials are excellent, and so we will link the relevant ones here:

Use NumPy to work with arrays of numbers #

If you are working with purely numerical data, we recommend using NumPy, which is a library optimized for scientific computing with arrays. If you have worked with MATLAB in the past, you can think of NumPy as a reasonably close equivalent in Python. Particularly when working with large arrays of data, you will find that NumPy is often far faster than even Python’s built-in libraries.

NumPy has several documentation pages aimed at different audiences. If you are newer to Python programming, we recommend their beginner’s guide. If you have some scientific computing experience and want a guide with more code and less explanation, we recommend the quickstart tutorial. We have provided (or will soon, depending on when you read this) a Jupyter notebook that provides some more explanations and some exercises for you to work with NumPy.

Data Visualization #

Simply getting and processing data is not always enough - often, the most important task is to make a compelling point with data. To effectively do this, you should learn different ways of visualizing data.

For this purpose, we will use Matplotlib and Pyplot. Matplotlib is a plotting library for Python whose syntax is designed to be realtively similar to the plotting syntax of MATLAB. Pyplot is a part of Matplotlib that is particularly well-suited for interactively working with plots, as you often do in Jupyter notebooks.

In this section, we will point you to some resources for learning these libraries, and provide some tips on how to effectively visualize data.

Prepare a Jupyter notebook for plotting #

By default, plots in a Jupyter notebook will not be shown in the notebook itself. To make plots appear within a notebook, you will need to add and run a cell with the following code at the top of your Jupyter notebook:

%matplotlib inline

As with the code that you run to make VS Code read the latest version of files, you will need to run this every time you start or restart your notebook.

Learn Matplotlib and Pyplot #

As with Pandas, Matplotlib includes some excellent tutorials, which you should go through to learn about how to use the library. We recommend you read the tutorials below in order:

Make the point of your visualization clear #

Though the tutorials above explain a good deal about how to create and work with plots, they don’t say much about what actually makes a good plot.

By far the most important thing to know as when you design and create a plot is the point you are trying to make.

Here is a set of two plots from the Pyplot tutorial:

Two unlabeled plots made in PyPlot

We don’t know the point this graphic is trying to make for a number of reasons: (1) we have no idea what the numbers on the $x$ or $y$ axis represent, (2) the plot doesn’t have a title, so we don’t know what the overall plot is showing, and (3) we don’t have the context, so we don’t know why we are seeing this plot.

The purpose of a plot is to provide evidence or support for a claim you make. So for example, suppose that you have two swings, one of which is rusty, and claim that the rusty swing stops earlier than the other. You could conduct an experiment to analyze this: sit the same person on each swing in turn, start them forward of center, and release them, tracking their distance from the normal hanging center of the swing over time. Then, saying that the blue plot represents the position of the person in the more rusted swing over time gives more credibility to your claim.

In general, always explain the point of your plots with proper context, especially when making them publicly accessible.

Label your plots with relevant information #

While we now know what the point of the above plot is, we still don’t know some key details. How heavy was the person sitting in the swing? How long did you track their position for? Is their position measured in feet, inches, or meters?

To make sure that the plot effectively provides evidence for a claim, this type of information is necessary. However, not all of it has to be included in the plot itself. Typically, it’s best if the plot includes at least a title and a labeled set of axes.

If we adapt the code used to produce the two plots above, we can get a plot that looks like this:

Same plots as above with title and axis labels

Now we can clearly see what each plot is showing and that they are plotting the same units, making it easier to quickly compare the two sets of data visually. In general, it’s important to show both the quantity represented and the units when labeling axes. (The exception is if the quantity has no units.) In addition, you should choose a title that simply describes what the data represents, not the point that you are trying to make. However, you should also make sure that you do not simply state what the data is; in the figure above, titling one of the plots “Swing position over time” would be too vague, particularly since both plots can be described this way.

Explain the significance of your plots in text #

Beyond properly labeling the plots and making a point, it’s important to highlight the important parts of the plot so it’s clear how the plot provides evidence for your point. Don’t assume that the reader will just make the connection on their own. For example, you might write the following after showing the plot above:

The above figure shows the positions of the two swings over time. As we can see, the rusty swing shows a clear decrease in the distance from the center over time, with the peak distance at less than 50cm from the center at 1 second. The normal swing, on the other hand, has almost the same distance (if not the same) even at 5 seconds, when the rusty swing has almost stopped moving.

This explanation makes it very clear how the plot connects with your claim, and the plot itself includes enough precise information to effectively back up your claim.