If you are not generating your own data, it is likely that you will get much of your data from the Web. Downloading a single file through your Web browser is one way to get this data, but if you need to download a large number of files, you will find it easier to automate this task using Python.
In this reading, we will describe how to use the
requests library to
programmatically download data from the Web. We recommend
requests because it
has a relatively straightforward syntax and a large number of convenience
features, but keep in mind that there are other libraries that can provide
requests library assumes that you know a little about HTTP, the protocol
by which you access Web pages. In short, when you visit a webpage like
your computer is making an HTTP request to YouTube, whose response is the
webpage content. Your browser then processes this content to show you the
webpage, putting the video player, links, etc. in the correct places.
An HTTP request is a specially formatted message that provides details on what is being requested from the server. When visiting the page above, your browser contacts https://www.youtube.com with an HTTP request that includes the following information:
- The desired page on the YouTube server, which in this case is
/watch(we describe what
- Cookies for any accounts logged into YouTube on the browser
- The languages, file formats, etc., that the browser will accept as a response
- Information about the browser’s version number and operating system (called the user agent)
- Parameters for the request - in this case, there is only one parameter
vwith the value
This is called a GET request, and is one of the most common types of HTTP
requests. Notice that the URL ends with
/watch?v=dQw4w9WgXcQ - this is called
a query string. In a query string, what comes before the question mark (
is the page to request from the server, and what follows is a series of
parameters and values written in the form
x=1&y=foo&z=true, where each
z) is set to some value, and each parameter-value
pair is separated by the ampersand (
&) symbol. When using parameters in
requests, note that parameter values should all be written as strings, even
if they represent other types like integers.
Tutorial and Exercise #
With this in mind, you likely have enough context to learn how to use the
requests library. You can do this by visiting their Quickstart
page and reading
through the end of the “Response Content” section. Do this before moving on.
If you want to test this feature for yourself, you can try accessing this text file, which contains the text of the Lewis Carroll poem “Jabberwocky”. Specifically, you can try a GET request for this file, and simply print the contents of the response text to see if it matches the text found in the file itself. If you want to challenge yourself, you can write the text to a new file using the techniques from above.