2. Data Mining

Project 2: Data Mining #

Introduction #

Data is a central piece to almost anything you do in computing. Whether you are generating, obtaining, processing, analyzing, or presenting data, at the end of this course you should be quite comfortable working with data in various ways. In the larger scheme of things, data can be used to tell interesting or useful stories: with data, you can make a compelling argument, explore an important question, or highlight cool facts.

Project Summary #

In this project, you will be applying your Python knowledge (and hopefully learning a new library or two along the way) to obtain data and tell a compelling or interesting story with it. Beyond the deliverables and a few constraints, this project is very open-ended - you may use any type of data you want, and your story can be on any topic you choose (within reason).

Like Project 1, this project is an opportunity to build a polished software product. It is also an opportunity to learn some new skills, such as libraries to help you obtain or process specific types of data, or to visualize data in different ways.

This project also tests your ability to properly scope your ideas. There are many good ideas and stories out there, but you should be careful to not be too ambitious or risky, and instead structure your project in terms of small milestones and stretch goals. The skills you build in this regard will be crucial for successfully completing Project 3.

Examples of Data-Driven Stories #

To give a few ideas of how you can tell interesting stories with data, here are a few examples. (Note that this is not necessarily the quality of story we expect - but it provide some inspiration.)

Making a Compelling Argument: Social Distancing #

At the start of the COVID-19 pandemic, the Washington Post published an article called “Why outbreaks like coronavirus spread exponentially, and how to ‘flatten the curve’". The article shows animated simulations of various quarantining and social distancing measures and how it would affect the spread of the virus among populations (on average).

The simulations showed that social distancing was effective at flattening the curve, reducing the peak number of cases at any given time. Furthermore, the data showed that stronger social distancing measures would lead to an even flatter curve. While many subtleties of the SARS-CoV-2 virus were not known at the time, the article made a compelling case for the effectiveness of social distancing.

Exploring an Important Question: Protests in Washington DC #

In May 2017, the New York Times published an interactive article called “Did the Turkish President’s Security Detail Attack Protesters in Washington? What the Video Shows”, which examined videos of protests that took place at the Turkish Ambassador’s residence in Washington, DC. The article presented the videos in a remarkably well-annotated way to explore what happened during the protests.

While arrest warrants were issued and a few people were charged in the following months, many of the criminal charges against those involved were dropped in 2018 (though civil cases have been filed and are currently pending).

Highlighting Cool Facts: TV Shows' Worst Episodes #

A post to the Data is Beautiful subreddit called Worst Episode Ever? The Most Commonly Rated Shows on IMDb and Their Lowest Rated Episodes examined how popular shows' worst-rated episodes compared to the ratings for the remainder of their episodes. You can see an interactive version of the graphic as a Tableau visualization.

Project Deliverables #

There are six deliverables for this project:

  1. Implementation
  2. Unit Tests
  3. In-Code Documentation
  4. README
  5. Computational Essay
  6. Presentation

Below, we explain what each component involves.

Implementation #

Your implementation should consist of two components: (1) code to obtain your data and write it to one or more files, and (2) code to read, process, and summarize the data. The expectation is that this will consist of at least two .py files, but you are free to structure your code and functions in any way you choose.

For obtaining your data, you are free to choose any source of data you like. We have provided some tips for working with a few select sources later in this guide, which you might find helpful. The only restrictions for obtaining your data are that you must obtain data from the Web, and you must do so programmatically, that is, by writing one or more Python functions to obtain the data.

For processing and summarizing your data, you may consider different libraries to process your data, depending on what type of data you are working with. The relevant section of Reading 4 has a few libraries that you may find helpful. You are free to do whatever processing and summarization that you feel is necessary to effectively tell your story. As you work through this stage, we strongly recommend going to a NINJA if you have questions or are unsure about how well your data and story go together.

Unit Tests #

You should thoroughly test the processing and summarization component of your implementation. It is not necessary to test the parts of your implementation that obtain data, as we have not seen the concepts necessary to do this yet (called “mock objects”).

As each of your implementations will be unique, you will need to create your own test file(s) for this project. You can use the testing files for previous assignments as a starting point for your tests. You are free to write tests in whatever way you choose, as long as you are clear about why you are running each test.

Because this will be your first time writing your own test file, we will provide considerable flexibility in how you structure your tests - you may write a large number of testing functions, or define a list of cases and run a single test that goes through each of those cases as in Project 1. If you want to structure tests in entirely your own way, you are certainly welcome to do so as well.

As you write your unit tests, you may find the pytest documentation helpful. If you want a helpful starting point for writing your own unit tests, we would recommend starting with the Installation and Getting Started guide.

In-Code Documentation #

As with the previous project, to help both your future self and others who may want to use and/or extend your work, you should ensure that your code is readable and well-documented.

Code Style #

Your code, including both your implementation and testing files, should be written with proper style. This means that running pycodestyle on these files should produce no output indicating a style warning or error. Additionally, your code should use reasonably clear and precise variable names, and if you write helper functions for any part of your code, the names of these functions and their parameters should also be clear.

Additionally, your functions should be appropriately scoped. As an approximate rule of thumb, you should avoid writing functions that are more than 50 lines or so, breaking such functions into several smaller functions instead. Ideally, each of your functions should do one thing, and do it well. If in doubt, talk to a NINJA or instructor.

Documentation #

Each function that you write should be well-documented. At a minimum, this means having a docstring that explains what each function does, and if applicable, the type and description of each parameter and return value. The docstring should also list any assumptions made by the function about its inputs.

As a reminder, a docstring should consist of the following:

  • A one-sentence description of what the function does, written in the imperative (“Return…” instead of “Returns…").
  • If applicable, one or more paragraphs that provide more detail on what the function does, assumptions it makes about its inputs, or its behaviors in certain cases.
  • A list of the function’s arguments, with each describing the argument’s type and what it represents. If both items are clear for all arguments based on the function’s one-sentence description, then this section can be omitted.
  • A description of the return value’s type and what it represents. If the function returns None or the return value’s type and description are clear from the function’s one-sentence description, this section can be omitted.

Additionally, any part of your code that requires additional explanation or justification should have a line comment with the relevant details.

README #

Your submission repository should have a file called README.md written in Markdown. This file should briefly summarize your project and explain how to use your code to obtain and/or analyze your data. By following the instructions in the README, anyone else should be able to clone your project code to their machine and perform a similar analysis to your work.

If you installed any additional packages or libraries, you must mention them in the README. Ideally, you should also include installation instructions for these libraries, such as by using the conda or pip commands (which are often found in the documentation for other libraries).

Beyond these requirements, the structure and content of the README is up to you, but we expect that your README should be relatively easy to follow and understand.

Computational Essay #

The computational essay is where you will tell your story in detail. You have seen several examples of computational essays already, in Project 1 and in Assignment 4, as well as in the examples above.

Your computational essay should be formatted as a Jupyter notebook and should interleave text and code. You should use code from your implementation by importing it into your notebook - long blocks of code in your notebook will make the flow of the essay hard to follow, and we would recommend avoiding it. If you find yourself defining long functions in your notebook, you should strongly consider moving those functions to your implementation instead.

Your notebook may be titled anything you like, as long as it is reasonably clear. A generic title like Untitled.ipynb is not appropriate. Also, you should include your name, as well as your partner’s name (if applicable) at the top of your essay, and give your essay a reasonably clear title.

Your essay should include the four sections listed below.

Introduction #

In the introduction, you should aim to tell the reader what your project is about. In doing so, you should answer the following questions:

  • What is the question you are trying to answer or the story that you are trying you tell?
  • Why is this question or story important?
  • What were the main steps your project made towards answering the question or telling the story?

Beyond answering these questions, you are free to structure this section however you wish.

Methodology #

In the methodology, you should explain how you obtained, processed, and summarized or visualized the data. In doing so, you should answer the following questions:

  • Where did you get your data from?
  • How did you get this data (i.e., did you programmatically download it or did you access it through an API)?
  • How did you store and/or process this data (e.g., did you store and process it in Pandas)?
  • What information did you get from this data that you used in the presentation of your results?

Results #

In the results, you should show the main summaries or visualizations of your data, along with any accompanying information. In doing so, you should answer the following questions:

  • What summaries or visualizations did you create?
  • What are the interesting and/or important parts of these summaries or visualizations?
  • How do these results answer your questions or tell your story?

Conclusion #

In the conclusion, you should provide key takeaways for the reader. In doing so, you should answer the following questions, where applicable:

  • What are the important insights that the reader should get from this project?
  • What are the contextual or ethical implications of your topic or work?
  • What lessons did you learn as you did the project?
  • What were the most difficult, challenging, or frustrating parts of the project?
  • In what ways would you extend or change your project if you had more time?

Presentation #

You should summarize your story and results in a presentation that you will share with the class on Tuesday, 11/3. Depending on how many projects there are, you should expect that you will have between 5 and 10 minutes to present your results, with the exact time limit being clarified in the first few days of the project.

Your presentation should clearly state the following four pieces of information:

  1. What was the central question you tried to answer, or what story you were trying to tell? This should be presented in a way that is broadly accessible, even to someone who does not know about your project topic.
  2. How did you collect and analyze data to answer your question or tell your story? At a minimum, you should describe where you got your data from and how you used it to generate your results.
  3. What were the main results of your project? Ideally, this will be a claim supported by a summary or visualization of your data.
  4. What are the key takeaways of your project? This might include lessons that you learned from doing the work, insights that the audience should get from your project, or a call to action.

Because this can be quite a lot of information to condense into 5 minutes, what you do not present is just as important as what you do present. You will likely have many details that you want to talk about, but do not have time for. We suggest that you practice the presentation with only the key elements required to tell your story, and then add other material as time allows.

Your presentation should be visually appealing, and your presentation should not simply be a read-through of your computational essay. You should think of your presentation as a way to present the highlights of your essay and to make the audience want to read your essay for the full details. Again, if in doubt, talk to a NINJA or instructor.

If time allows, there will be a short opportunity for the audience to ask a question or two after your presentation. You should be prepared to answer such questions about your code, results, or overall story.

Professionalism is important in public presentations. You should stay on time - timing will be strictly enforced. Additionally, while humor and other methods of telling a more compelling or understandable story may be helpful, inappropriate language or disparaging others will not be tolerated. A potentially useful question you may want to ask yourself is the following: if this presentation circulated on social media at the height of my career, how comfortable would I be with that?

As the presentation date near, we will provide a shared slide deck here in which you can add your presentation slides.

Roadmap #

Check-Ins #

We have two check-ins to help assess where you are at in the project and to help you plan for the remainder of the project.

The first check-in is with your NINJA, and will take place during 10/22-10/25. You should come prepared to discuss several ideas for the project and where you might get data. You should also come to the check-in with any specific questions you may have. These can include whether or not a specific idea is feasible, where to find data on a specific topics, or examples of similar projects to your idea.

The second check-in is with a NINJA and instructor, and will take place around Thursday, 10/29 (potentially a day before or after, based on scheduling, but we expect that the majority can occur on Thursday). In this check-in, you will present what work you currently have and your plan for the remainder of the project. The teaching team will give you feedback based on the current state of your project. To make good use of the limited check-in time, you are encouraged to plan out what you will present or any questions you have; simply walking through what you have without a question in mind will likely not result in us providing useful feedback.

Suggested Milestones #

Because this project is open-ended, we suggest the following rough milestones for your project. Note that these are all the latest days by which you should aim to have things done to avoid having a huge workload near the end of your project; getting things done a day or two earlier is even better.

  • By Sunday, 10/25, you should have a clear idea of your project topic and have a few candidate sources for where you can get this data.
  • By the end of Tuesday, 10/27, you should have attempted to obtain data from your candidate sources and write this data to one or more files.
  • By the end of Thursday, 10/29, you should have a clear picture of what you are doing in the remainder of your project, and should not make further changes to what data you will collect or what summaries/visualizations you will generate.
  • By Sunday, 11/1, you should be more or less done with the implementation and focus your efforts on the remaining deliverables.

Submission #

With the exception of the presentation, which should be given in class on Tuesday, 11/3, all project deliverables should be submitted and available in GitHub by 10 am Eastern on Wednesday, 11/4.

To submit your project, add, commit, and push any files that you changed to GitHub. Ensure your files are visible on GitHub by visiting the repository page and checking your files. For this project, your repository can be called anything you like. To help us find your code, then, you must submit a URL to your project GitHub repository on Canvas.

Barring any emergencies, there are no late days for this project. If something comes up that will prevent you from submitting the project on time, contact us as soon as possible.

Assessment and Grading #

The project is worth a total of 240 points. Again, there is no specific, detailed rubric for this project, but here is the breakdown of how each deliverable factors into your grade.

  • Implementation: 75 points
  • Computational Essay: 50 points
  • README: 30 points
  • Unit Tests: 30 points
  • Style and Documentation: 30 points
  • Presentation: 25 points

As with the previous project, we may award extra credit in any of these categories for exceptional work.

Below are some deliverable-specific notes on grading.

Implementation #

In this project, you have the option to try obtaining data from a new source (e.g., using an API) or to learn a new library or framework for obtaining or processing data. We will take these decisions into account when grading your project. This will affect both the way we assign grades overall for this deliverable, as well as how we weight the obtaining and processing components of your implementation. If you learn a new library while completing this project, we will not expect as much from the specific analysis and processing you do of the data as we would if you had not learned the library.

As an example, if you obtain data by programmatically downloading a series of files with requests and processing the data with Pandas, we would expect to see a significantly more thorough question or analysis. On the other hand, if you learned a new API to obtain data and used advanced features of Pyplot to visualize your data, then we understand if your analysis is not as sophisticated, and will grade your implementation accordingly.

In this way, you should think of the implementation for this project as a sort of “choose your own adventure” - you can optimize for learning new topics (breadth) or using what you know to tackle a hard problem (depth), and we will grade you based on how you have allocated your effort between these two. There is no “right way” to go about this project - whichever way you choose, we will aim to identify and assess you in the best possible interpretation of your work.

Style and Documentation #

Reproducibility #

As you style your code, one aspect of this project that is particularly important to keep in mind is reproducibility. It should be relatively easy for someone else to use your project. This means, among other things, that you should avoid “hard-coding” paths such as /home/user/softdes-2020-03 in your code, as then someone else would have to change this to their own folder path before your code will work.

Additionally, you should also avoid using “magic numbers” for the same reason. If you do have a path such as /home/user/softdes-2020-03 in your code, and you repeatedly use that string, someone else would have to change each instance of it. A better approach would be to assign the path to a variable once, and then use the variable in your code. This way, someone would only have to change the value of the variable and could be sure that the rest of your code works as intended.

If for some reason you do use a hard-coded path, mention this in your README so that others can easily use your code.

A Note About API Keys and Sensitive Information #

For some data sources or some types of analysis, you will need to generate a secret value or set of values called an API key or secret, or keep part or all of your data secret. Often times, you need to define these as variables in Python code and then use these to access your data.

While it is tempting and easy to check this file directly into your repository, this can be very dangerous. If your repository is public, someone can use this key to access the API as you, which can lead to a variety of undesirable consequences, such as you being suspended or banned from the service. If this happens to be a social media site, you can very easily lose access to your account. If you check in sensitive information (particularly personally identifiable information like names, addresses, etc.), the consequences can be even more disastrous.

Because of this, you should avoid checking in any type of secret information into GitHub. The easiest way to do this is the following:

  • Create a separate file with the secrets in it, and call this something like api_keys.py.
  • Copy this into the appropriate location in your repository, but do not add or commit it.
  • Create a file called .gitignore in the root of your repository. The root of your repository is the main folder of the entire repository, not any of its subfolders.
  • In the .gitgnore file, write the path to the secret keys file. If you have completed this step correctly, you should not see the secret keys file when running git status (but you will see .gitignore).
  • Add and commit the .gitignore file instead.

To use the keys in your code, you can simply import from the file. Because the keys file is never checked into GitHub, you can avoid publicizing your secrets.

We will say it again: do not check in sensitive information or API keys to GitHub. If you do, this will reflect on your score for the Style and Documentation deliverable. As always, if in doubt, talk to a NINJA or instructor.

Project Resources #

Below are a few resources you might find helpful for this project, including possible topics for inspiration and a few sources you might look at for data, along with how to get data from these sources.

Possible Topics #

Please note that these topics are only examples, and some may be more complicated than is feasible to do in a couple weeks. Feel free to scale some of these topics down or to consider them in a different context.

Steps to Philosophy #

An often-repeated rumor is that in nearly every page on Wikipedia, following the first “non-trivial” link to get to a new page, and then repeating this process eventually leads to the page for “Philosophy”. For example, if we started with the page for “Olin College”, we could follow the first links to get to (in order): “Private university”, “Tax break”, “Tax avoidance”, “Tax”, “Legal person”, “Law”, “System”, “Interaction”, “Causality”, “Event (relativity)”, “Physics”, “Natural science”, “Branches of science”, “Science”, “Scientific method”, “Empirical evidence”, “Information”, “Uncertainty”, “Epistemology”, and finally, “Philosophy”. (Note that this is an unusually long chain of articles for this phenomenon.)

You could simply ask whether this rumor is true on average, but you may also want to look at the average number of pages traversed to get the page for “Philosophy”. Additionally, determining how to follow the first “non-trivial” link in the page may lead to different results.

An Average Flag #

If you are more interested in processing images, you could ask what the “average” US state flag looks like. There are many ways to go about exploring this question: you could simply try to average the color of every pixel in the flag. However, you will also run into some interesting subquestions along the way: should you include the flags of Washington, DC or the territories of the US, and does this substantially change the result? What do you do with flags of different sizes or ratios of length to height?

You could also ask the question in a more international context: what does the average national flag of each continent look like? If you want to try a more ambitious project, you could explore different ways of calculating the “average” flag and how they compare to each other.

Translational Equilibrium #

An interesting linguistics game you can play with a service like Google Translate is to try translating some English text into another language and then back from that language into English, and repeat this process until the English text does not change anymore. If you want to see examples of this, check out TranslationParty.

A question you could ask is whether there are certain languages, translation engines, or types of text that take an exceptionally large number of rounds of doing this for the translation to stabilize (or to reach “translational equilibrium”). You might also ask how changing languages or translation engines between each round of translation affects your results.

Answering this question also has a helpful real-world application: it helps to identify aspects of automated online translation that are still considered relatively difficult, since modern translator sites are generally quite good at this task.

Flight Patterns, Literally #

If you are interested in numerical data analysis, here is a deceptively complex question: what are the busiest flight segments in the US? While you could look it up and get a reasonably close answer, you could also calculate this for yourself. A number of public sources on flight data within the US are available, and you might use this data to try and answer the question.

The complexity of this question is actually in the approximations that you have to make. Many sources will aggregate flight data, so you will simply get the total number of passengers that traveled from one airport to another in a given month. If you consider the busiest flight segments as the number of flights, you will need to come up with a way to estimate the average number of passengers carried on each flight.

A Unique College #

As you were choosing a college to attend, every college almost certainly explained the ways in which it was unique and could provide you an academic experience that no other school could provide. With your data analysis skills, you could put this claim to the test, at least approximately.

You can scrape a variety of pages from the website of your college and a number of other comparable colleges. You can then analyze the set of words used on these pages to determine the set (and possibly counts) of “unique” words not used by any other college, then determine which college is, at least linguistically, the most unique.

As you explore this question, you may find it interesting to restrict your analysis to a specific set of colleges (e.g., those in the same area or those of similar size) or to a specific set of pages on each school’s website (e.g., admissions).

If you choose to pursue this question, it is likely that familiarity with the Beautiful Soup library will be helpful. You should also test your code before grabbing data, and minimize the number of times you scrape each school’s website for data. Some colleges have limited resources and IT staff, and performing large amounts of scraping may end up making someone’s job more difficult.

Lyrical Complexity #

If you are interested in music, you may find it compelling to analyze the lyrical complexity of your favorite artists and others in their genre. There are a number of ways you can approach this problem: you can simply consider the size of an artist’s vocabulary (i.e., the number of unique words they use), or you can look at words they use that few or no other artists use.

If you want to challenge yourself, you could use a natural language processing library to do a deeper analysis. For example, you could map out an artist’s sentence structure (i.e., the placement of nouns, verbs, and other grammatical constructs) and determine which artists have the most complex sentence structures on average.

Data Sources #

Here are a few data sources you might find useful, as well as how to access data on them. While a few of them directly relate to questions above, you do not need to use them for this purpose (or even at all).

Project Gutenberg #

Project Gutenberg is a website that has tens of thousands of freely available e-books that are no longer protected by copyright. Many literature classics can be found on Project Gutenberg, such as most, if not all, of the works of Charles Dickens. The data on this site is in plaintext form, which is useful for analysis (as opposed to trying to pull text out of a PDF document).

Note that files on the site have text at the beginning and end (e.g., blurbs about Project Gutenberg) that you might want to strip out for your analysis.

Another hurdle with using Project Gutenberg is that they impose a limit on how many texts you can download in a 24-hour period. You should therefore avoid writing code using Project Gutenberg’s URL and running it repeatedly to test, as you may get banned. If you are interested in using this data source and provide us with enough notice, we are happy to help set up some testing code that you can run instead as you develop code to obtain data.

Wikipedia #

To get articles from Wikipedia, we recommend using the wikipedia package. You can download it using the following command:

$ pip install wikipedia

To get started, we recommend looking at the quickstart guide.

Twitter #

If you are using Twitter, be warned that the service has rather strict limits on what data you can access and how much. This limit is especially strict on getting data from accounts with many followers (@realDonaldTrump, for example, produces very few tweets when accessed through the API).

To access data through Twitter, if you’d like to be experimental, you can try using Tweepy. Users of this at Olin have reported it as being easy to work with, but to our knowledge it has not been used in this course yet. You can instead use the python-twitter package, which has been used in projects for this course before.

You may need to create a Twitter application to access API data. Be warned that this process can take some time for approval. If you have any issues with the process, please come talk to a member of the teaching team.

Reddit #

To access Reddit data, you can install the PRAW package:

$ pip install praw

Follow the instructions here to create a Reddit application, which you will need to access the API. Then, use the quickstart guide to get a sense of how to use the library.

Note that this library has not been extensively tested by the teaching team, but feel free to ask us if you try this method and run into any issues.

Google #

To access search data from Google, you can install the googlesearch library as follows:

$ pip install google

To perform a search, you can use the following.

import googlesearch

for result in googlesearch.search(query="Software Design"):
    print(result)

Note that this library essentially access the HTML of the search page on Google, and then attempts to extract the relevant information from it. A cleaner approach is to use Google’s official Python package, but note that doing so is significantly more complicated.

Public Datasets #

The teaching team collectively knows about a fair few other datasets - feel free to come to talk to us if you are looking for more ideas.