Data-Driven Stories

Midterm Project: Data-Drive Stories #

Introduction #

Data is a central piece to almost anything you do in computing. Whether you are generating, obtaining, processing, analyzing, or presenting data, at the end of this course you should be quite comfortable working with data in various ways. In the larger scheme of things, data can be used to tell interesting or useful stories: with data, you can make a compelling argument, explore an important question, or highlight cool facts.

Project Summary #

In this project, you will be applying your Python knowledge (and hopefully learning a new library or two along the way) to obtain data and tell a compelling or interesting story with it. Beyond the deliverables and a few constraints, this project is very open-ended - you may use any type of data you want, and your story can be on any topic you choose (within reason).

Like with the Gene Finder problem, this project is an opportunity to build a polished software product. It is also an opportunity to learn some new skills, such as libraries to help you obtain or process specific types of data, or to visualize data in different ways.

This project also tests your ability to properly scope your ideas. There are many good ideas and stories out there, but you should be careful to not be too ambitious or risky, and instead structure your project in terms of small milestones and stretch goals. The skills you build in this regard will be crucial for successfully completing Project 3.

Examples of Data-Driven Stories #

To give a few ideas of how you can tell interesting stories with data, here are a few examples. (These are all projects that would be infeasible to do in two weeks, and we don’t expect you to do a project like this - but these ideas may provide some inspiration.)

Making a Compelling Argument: Social Distancing #

At the start of the COVID-19 pandemic, the Washington Post published an article called “Why outbreaks like coronavirus spread exponentially, and how to ‘flatten the curve’". The article shows animated simulations of various quarantining and social distancing measures and how it would affect the spread of the virus among populations (on average).

The simulations showed that social distancing was effective at flattening the curve, reducing the peak number of cases at any given time. Furthermore, the data showed that stronger social distancing measures would lead to an even flatter curve. While many subtleties of the SARS-CoV-2 virus were not known at the time, the article made a compelling case for the effectiveness of social distancing.

Exploring an Important Question: Protests in Washington DC #

In May 2017, the New York Times published an interactive article called “Did the Turkish President’s Security Detail Attack Protesters in Washington? What the Video Shows”, which examined videos of protests that took place at the Turkish Ambassador’s residence in Washington, DC. The article presented the videos in a remarkably well-annotated way to explore what happened during the protests.

While arrest warrants were issued and a few people were charged in the following months, many of the criminal charges against those involved were dropped in 2018 (though civil cases have been filed and are currently pending).

Highlighting Cool Facts: TV Shows' Worst Episodes #

A post to the Data is Beautiful subreddit called Worst Episode Ever? The Most Commonly Rated Shows on IMDb and Their Lowest Rated Episodes examined how popular shows' worst-rated episodes compared to the ratings for the remainder of their episodes. You can see an interactive version of the graphic as a Tableau visualization.

Project Deliverables #

This project consists of five deliverables:

  1. Project Proposal
  2. Project Check-In
  3. Implementation
  4. Computational Essay
  5. Presentation

Below, we explain what each component involves.

Proposal #

To ensure that your project idea is reasonably scoped and that you have thought through the key details of your data-driven story, you must complete and submit a project proposal.

In this proposal, you will answer three key questions:

  1. Research Question: What is the primary question that you aim to answer in doing this project?
  2. Data: What data will you collect to answer this question? Where will you get this data from?
  3. Visuals: In what way(s) will you present your data? How will these visuals help you to answer the question?

This proposal should be submitted on Canvas.

If you want to discuss your story ideas with your assigned project advisor (also on Canvas), feel free to do so.

Your proposal is graded on a completion basis, but your project advisor must sign off on your proposal before you start work on your project. Failing to get a proposal accepted will not affect your final grade (unless you submit a proposal with missing information or fail to submit one altogether), but in an effort to scope your work properly, you will be given a selection of alternate topics to choose instead, and will have to work on this topic as your project.

As you do your proposal, we thus encourage you to think about not only interesting research questions, but whether the data is programmatically accessible or generatable. In particular, this might mean looking for file links that follow a specific pattern, parsing through webpage content, an accessible API, etc. Doing a bit of sampling to see how feasible this is would be well worth investigating - for example, Amazon webpage data is notoriously hard to scrape with Python because of their anti-scraping measures, so even the fact that their product data is publicly available does not guarantee that it will be feasible to get data from there.

Check-In #

Halfway through the project, you will have a short check-in meeting (~5 minutes) with your project advisor. In advance of this meeting, you should submit a brief agenda with the following information:

  1. Data: How much data do you have? What does your data look like?
  2. Reflection: What has gone well so far? What did not go well?
  3. Planning: What still needs to be done? How are you planning and prioritizing the remaining tasks to ensure that everything gets done?

The check-in is a chance to reflect on the work you are doing in the context of the overall story, and it is also a chance for you to pivot if the project has any unexpected hurdles. In particular, if you do not have data by the check-in meeting, we will give you a selection of alternate topics and you will be required to complete a project on one of these alternate topics instead. This is to minimize last-minute pivots and to ensure that you have a project that can be reasonably completed by the deadline.

Implementation #

Your implementation should consist of two components: (1) code to obtain or generate data and write it to one or more files, and (2) code to read, process, and summarize the data. The expectation is that this will consist of at least two .py files. Other than this, you are free to structure your code and functions in any way you choose, but be aware that writing large functions can make debugging more difficult.

If you are obtaining data, you are free to choose any source of data you like, though in the proposal stage, you will have to demonstrate that you are likely to actually obtain this data by the check-in. You may find the requests library helpful, and we have also provided some tips for working with a few select sources later in this guide, which you might find helpful. You must obtain data from the Web, and you must do so programmatically, that is, by writing one or more Python functions to obtain the data.

If you are generating data, then you are also free to choose data that you like, though again, in the proposal stage, you must demonstrate that you are likely to be able to complete the generation of this data by the check-in. The only hard constraints are that you must generate your data programmatically (via Python code) and that you must write it to a file.

For processing and summarizing your data, you may consider different libraries to process your data, depending on what type of data you are working with. The reading on Web data formats has a few libraries that you may find helpful. You are free to do whatever processing and summarization that you feel is necessary to effectively tell your story, but you must read from the files you generated with the other component of your implementation, and you must produce at least three substantially different visualizations of your data. This does not mean that each of your visualizations need to look different, but they must communicate different points that help tell your story.

Your code, including your testing files, should be written with proper style as measured in previous assignments in this course. This means that your code should not produce style errors reported by Pylint, your code should use reasonably clear and precise variable names, and if you write helper functions for any part of your code, the names of these functions and their parameters should also be clear.

Additionally, your functions should be appropriately scoped. As an approximate rule of thumb, you should avoid writing functions that are more than 50 lines or so, breaking such functions into several smaller functions instead. Ideally, each of your functions should do one thing, and do it well. If in doubt, talk to a NINJA or instructor.

Documentation #

As with previous assignments, every function should have a docstring in line with what we have seen so far in this course.

Your code should be well-documented so that it is both runnable and reproducible. In particular, your submission repository should have a file called README.md written in Markdown. This file should briefly summarize your project and explain how to use your code to obtain and/or analyze your data. By following the instructions in the README, anyone else should be able to clone your project code to their machine and perform a similar analysis to your work.

If you installed any additional packages or libraries, you must mention them in the README. Ideally, you should also include installation instructions for these libraries, such as by using the conda or pip commands (which are often found in the documentation for other libraries).

It should also be relatively easy for someone else to use your project. This means, among other things, that you should avoid “hard-coding” paths such as /home/user/softdes-2020-03 in your code, as then someone else would have to change this to their own folder path before your code will work.

Additionally, you should also avoid using “magic numbers” for the same reason. If you do have a path such as /home/user/softdes-2020-03 in your code, and you repeatedly use that string, someone else would have to change each instance of it. A better approach would be to assign the path to a variable once, and then use the variable in your code. This way, someone would only have to change the value of the variable and could be sure that the rest of your code works as intended.

If for some reason you do use a hard-coded path, mention this in your README so that others can easily use your code.

Unit Tests #

To the extent possible, you should thoroughly test the code you write. It is not necessary to test every part of your code, as some functionality, like obtaining Web data or creating visualizations, is difficult to test. Your code should be organized enough that there is a reasonable number of functions that you can test with unit tests.

As each of your implementations will be unique, you will need to create your own test file(s) for this project. You can use the testing files for previous assignments as a starting point for your tests. You are free to write tests in whatever way you choose, as long as you are clear about why you are running each test. Please feel free to reach out to the CAs or instructors if you need help writing unit tests.

Because this will be your first time writing your own test file, we will provide considerable flexibility in how you structure your tests - you may write a large number of testing functions, or define a list of cases and run a single test that goes through each of those cases as in some previous assignments. If you want to structure tests in entirely your own way, you are certainly welcome to do so as well.

As you write your unit tests, you may find the pytest documentation helpful. If you want a helpful starting point for writing your own unit tests, we would recommend starting with the Installation and Getting Started guide.

A Note About API Keys and Sensitive Information #

For some data sources or some types of analysis, you will need to generate a secret value or set of values called an API key or secret, or keep part or all of your data secret. Often times, you need to define these as variables in Python code and then use these to access your data.

While it is tempting and easy to check this file directly into your repository, this can be very dangerous. If your repository is public, someone can use this key to access the API as you, which can lead to a variety of undesirable consequences, such as you being suspended or banned from the service. If this happens to be a social media site, you can very easily lose access to your account. If you check in sensitive information (particularly personally identifiable information like names, addresses, etc.), the consequences can be even more disastrous.

Because of this, you should avoid checking in any type of secret information into GitHub. The easiest way to do this is the following:

  • Create a separate file with the secrets in it, and call this something like api_keys.py.
  • Copy this into the appropriate location in your repository, but do not add or commit it.
  • Create a file called .gitignore in the root of your repository. The root of your repository is the main folder of the entire repository, not any of its subfolders.
  • In the .gitgnore file, write the path to the secret keys file. If you have completed this step correctly, you should not see the secret keys file when running git status (but you will see .gitignore).
  • Add and commit the .gitignore file instead.

To use the keys in your code, you can simply import from the file. Because the keys file is never checked into GitHub, you can avoid publicizing your secrets.

We will say it again: do not check in sensitive information or API keys to GitHub. Doing so will reflect on your grade, as doing this in a more high-stakes project can create catastrophic problems. If in doubt, talk to a CA or instructor.

Computational Essay #

The computational essay is where you will tell your story in detail. You have seen several examples of computational essays already, both on previous assignments and in class.

Your computational essay should be formatted as a Jupyter notebook and should interleave text and code. You should use code from your implementation by importing it into your notebook - long blocks of code in your notebook will make the flow of the essay hard to follow, and we would recommend avoiding it. If you find yourself defining long functions in your notebook, you should strongly consider moving those functions to your implementation instead.

Your notebook may be titled anything you like, as long as it is reasonably clear. A generic title like Untitled.ipynb is not appropriate. Also, you should include your name, as well as your partner’s name (if applicable) at the top of your essay, and give your essay a reasonably clear title.

Your essay should include the four sections listed below.

Introduction #

In the introduction, you should aim to tell the reader what your project is about. In doing so, you should answer the following questions:

  • What is the question you are trying to answer or the story that you are trying you tell?
  • Why is this question or story important?
  • What were the main steps your project made towards answering the question or telling the story?

Beyond answering these questions, you are free to structure this section however you wish.

Methodology #

In the methodology, you should explain how you obtained, processed, and summarized or visualized the data. In doing so, you should answer the following questions:

  • Where did you get your data from?
  • How did you get this data (i.e., did you programmatically download it or did you access it through an API)?
  • How did you store and/or process this data (e.g., did you store and process it in Pandas)?
  • What information did you get from this data that you used in the presentation of your results?

Results #

In the results, you should show the main summaries or visualizations of your data, along with any accompanying information. In doing so, you should answer the following questions:

  • What summaries or visualizations did you create?
  • What are the interesting and/or important parts of these summaries or visualizations?
  • How do these results answer your questions or tell your story?

Conclusion #

In the conclusion, you should provide key takeaways for the reader. In doing so, you should answer the following questions, where applicable:

  • What are the important insights that the reader should get from this project?
  • What are the contextual or ethical implications of your topic or work?
  • What lessons did you learn as you did the project?
  • What were the most difficult, challenging, or frustrating parts of the project?
  • In what ways would you extend or change your project if you had more time?

Presentation #

You should summarize your story and results in a presentation that you will share with the class on Monday 4/5. Depending on how many projects there are, you should expect that you will have around 5 minutes to present your results.

Your presentation should clearly state the following four pieces of information:

  1. What was the central question you tried to answer, or what story you were trying to tell? This should be presented in a way that is broadly accessible, even to someone who does not know about your project topic.
  2. How did you collect and analyze data to answer your question or tell your story? At a minimum, you should describe where you got your data from and how you used it to generate your results.
  3. What were the main results of your project? Ideally, this will be a claim supported by a summary or visualization of your data.
  4. What are the key takeaways of your project? This might include lessons that you learned from doing the work, insights that the audience should get from your project, or a call to action.

Because this can be quite a lot of information to condense into 5 minutes, what you do not present is just as important as what you do present. You will likely have many details that you want to talk about, but do not have time for. We suggest that you practice the presentation with only the key elements required to tell your story, and then add other material as time allows.

Your presentation should be visually appealing, and your presentation should not simply be a read-through of your computational essay. You should think of your presentation as a way to present the highlights of your essay and to make the audience want to read your essay for the full details. Again, if in doubt, talk to a NINJA or instructor.

If time allows, there will be a short opportunity for the audience to ask a question or two after your presentation. You should be prepared to answer such questions about your code, results, or overall story.

Professionalism is important in public presentations. You should stay on time - timing will be strictly enforced. Additionally, while humor and other methods of telling a more compelling or understandable story may be helpful, inappropriate language or disparaging others will not be tolerated. A potentially useful question you may want to ask yourself is the following: if this presentation circulated on social media at the height of my career, how comfortable would I be with that?

As the presentation date near, we will provide a shared slide deck here in which you can add your presentation slides.

Submission #

The proposal is due at 3:30 pm Eastern on Tuesday, 3/23. This gives the instructors time to read and provide feedback on your proposals. In the event that your proposal is not accepted at this stage, you will have a few minutes on Wednesday 3/24 to come up with a suitable story before being assigned an alternative topic.

The check-in agenda is due at 3:30pm Eastern on Sunday, 3/28. This again gives the instructors time to read your updates before the check-in meetings on Monday, 3/29.

All other project deliverables should be submitted and available in GitHub by 2 pm Eastern on Monday, 4/5.

For your GitHub repository, please create a new repository in the olincollege GitHub organization, which you can find here. This makes it easy to add the instructors and CAs to your repo. Please give this repository a descriptive name for your project; a name like midterm-project is not acceptable. Ensure your files are visible on GitHub by visiting the repository page and checking your files.

Barring any emergencies, there are no late days for this project. If something comes up that will prevent you from submitting the project on time, contact us as soon as possible.

Assessment and Grading #

A rubric for this project will be made available shortly, but it will be based on the deliverables above, evaluating on criteria such as correctness and style, as previous assignments have been.

Project Resources #

Below are a few resources you might find helpful for this project, including possible topics for inspiration and a few sources you might look at for data, along with how to get data from these sources.

Possible Topics #

Please note that these topics are only examples, and some may be more complicated than is feasible to do in a couple weeks. Feel free to scale some of these topics down or to consider them in a different context.

Steps to Philosophy #

An often-repeated rumor is that in nearly every page on Wikipedia, following the first “non-trivial” link to get to a new page, and then repeating this process eventually leads to the page for “Philosophy”. For example, if we started with the page for “Olin College”, we could follow the first links to get to (in order): “Private university”, “Tax break”, “Tax avoidance”, “Tax”, “Legal person”, “Law”, “System”, “Interaction”, “Causality”, “Event (relativity)”, “Physics”, “Natural science”, “Branches of science”, “Science”, “Scientific method”, “Empirical evidence”, “Information”, “Uncertainty”, “Epistemology”, and finally, “Philosophy”. (Note that this is an unusually long chain of articles for this phenomenon.)

You could simply ask whether this rumor is true on average, but you may also want to look at the average number of pages traversed to get the page for “Philosophy”. Additionally, determining how to follow the first “non-trivial” link in the page may lead to different results.

An Average Flag #

If you are more interested in processing images, you could ask what the “average” US state flag looks like. There are many ways to go about exploring this question: you could simply try to average the color of every pixel in the flag. However, you will also run into some interesting subquestions along the way: should you include the flags of Washington, DC or the territories of the US, and does this substantially change the result? What do you do with flags of different sizes or ratios of length to height?

You could also ask the question in a more international context: what does the average national flag of each continent look like? If you want to try a more ambitious project, you could explore different ways of calculating the “average” flag and how they compare to each other.

Translational Equilibrium #

An interesting linguistics game you can play with a service like Google Translate is to try translating some English text into another language and then back from that language into English, and repeat this process until the English text does not change anymore. If you want to see examples of this, check out TranslationParty.

A question you could ask is whether there are certain languages, translation engines, or types of text that take an exceptionally large number of rounds of doing this for the translation to stabilize (or to reach “translational equilibrium”). You might also ask how changing languages or translation engines between each round of translation affects your results.

Answering this question also has a helpful real-world application: it helps to identify aspects of automated online translation that are still considered relatively difficult, since modern translator sites are generally quite good at this task.

Flight Patterns, Literally #

If you are interested in numerical data analysis, here is a deceptively complex question: what are the busiest flight segments in the US? While you could look it up and get a reasonably close answer, you could also calculate this for yourself. A number of public sources on flight data within the US are available, and you might use this data to try and answer the question.

The complexity of this question is actually in the approximations that you have to make. Many sources will aggregate flight data, so you will simply get the total number of passengers that traveled from one airport to another in a given month. If you consider the busiest flight segments as the number of flights, you will need to come up with a way to estimate the average number of passengers carried on each flight.

A Unique College #

As you were choosing a college to attend, every college almost certainly explained the ways in which it was unique and could provide you an academic experience that no other school could provide. With your data analysis skills, you could put this claim to the test, at least approximately.

You can scrape a variety of pages from the website of your college and a number of other comparable colleges. You can then analyze the set of words used on these pages to determine the set (and possibly counts) of “unique” words not used by any other college, then determine which college is, at least linguistically, the most unique.

As you explore this question, you may find it interesting to restrict your analysis to a specific set of colleges (e.g., those in the same area or those of similar size) or to a specific set of pages on each school’s website (e.g., admissions).

If you choose to pursue this question, it is likely that familiarity with the Beautiful Soup library will be helpful. You should also test your code before grabbing data, and minimize the number of times you scrape each school’s website for data. Some colleges have limited resources and IT staff, and performing large amounts of scraping may end up making someone’s job more difficult.

Lyrical Complexity #

If you are interested in music, you may find it compelling to analyze the lyrical complexity of your favorite artists and others in their genre. There are a number of ways you can approach this problem: you can simply consider the size of an artist’s vocabulary (i.e., the number of unique words they use), or you can look at words they use that few or no other artists use.

If you want to challenge yourself, you could use a natural language processing library to do a deeper analysis. For example, you could map out an artist’s sentence structure (i.e., the placement of nouns, verbs, and other grammatical constructs) and determine which artists have the most complex sentence structures on average.

Data Sources #

Here are a few data sources you might find useful, as well as how to access data on them. While a few of them directly relate to questions above, you do not need to use them for this purpose (or even at all).

Project Gutenberg #

Project Gutenberg is a website that has tens of thousands of freely available e-books that are no longer protected by copyright. Many literature classics can be found on Project Gutenberg, such as most, if not all, of the works of Charles Dickens. The data on this site is in plaintext form, which is useful for analysis (as opposed to trying to pull text out of a PDF document).

Note that files on the site have text at the beginning and end (e.g., blurbs about Project Gutenberg) that you might want to strip out for your analysis.

Another hurdle with using Project Gutenberg is that they impose a limit on how many texts you can download in a 24-hour period. You should therefore avoid writing code using Project Gutenberg’s URL and running it repeatedly to test, as you may get banned. If you are interested in using this data source and provide us with enough notice, we are happy to help set up some testing code that you can run instead as you develop code to obtain data.

Wikipedia #

To get articles from Wikipedia, we recommend using the wikipedia package. You can download it using the following command:

$ pip install wikipedia

To get started, we recommend looking at the quickstart guide.

Twitter #

If you are using Twitter, be warned that the service has rather strict limits on what data you can access and how much. This limit is especially strict on getting data from accounts with many followers (@realDonaldTrump, for example, produces very few tweets when accessed through the API).

To access data through Twitter, if you’d like to be experimental, you can try using Tweepy. Users of this at Olin have reported it as being easy to work with, but to our knowledge it has not been used in this course yet. You can instead use the python-twitter package, which has been used in projects for this course before.

You may need to create a Twitter application to access API data. Be warned that this process can take some time for approval. If you have any issues with the process, please come talk to a member of the teaching team.

Reddit #

To access Reddit data, you can install the PRAW package:

$ pip install praw

Follow the instructions here to create a Reddit application, which you will need to access the API. Then, use the quickstart guide to get a sense of how to use the library.

Note that this library has not been extensively tested by the teaching team, but feel free to ask us if you try this method and run into any issues.

Google #

To access search data from Google, you can install the googlesearch library as follows:

$ pip install google

To perform a search, you can use the following.

import googlesearch

for result in googlesearch.search(query="Software Design"):
    print(result)

Note that this library essentially access the HTML of the search page on Google, and then attempts to extract the relevant information from it. A cleaner approach is to use Google’s official Python package, but note that doing so is significantly more complicated.

Public Datasets #

The teaching team collectively knows about a fair few other datasets - feel free to come to talk to us if you are looking for more ideas.