For the project you will create, in groups of 2 or 3, a thorough analysis of a particular dataset using at least one relatively advanced method that we have discussed in class and at least one method we have not discussed in class. Deliverables for the project include:

  1. Your primary analysis write-up. This will include:
    1. a single R markdown file (that can be knitted) and
    2. the resulting knitted pdf file
  2. An appendix containing your exploratory or preliminary analyses. This will include:
    1. a single R markdown file (that can be knitted) and
    2. the resulting knitted pdf file

Deadlines

This part of the page will be updated with more details and deadlines later.

Data Sets

Choice of a strong dataset will be particularly important. Since our objective is to explore methods beyond multiple regression, your data set should be complex enough that more advanced methods are required/helpful. I am providing a few possible data sets that I found here. You are welcome to use one of these options, or to propose your own data sets.

Data Set Option 1: Dialects in America

Vaux, Bert, and Scott Golder. “The Harvard dialect survey.” Cambridge, MA: Harvard University Linguistics Department (2003).

A few years ago, a bunch of maps showing the distribution of dialects in America were published online. Here’s an example of maps like this: https://www.businessinsider.com/american-english-dialects-maps-2018-1

I have written to Prof. Bert Vaux at the University of Cambridge and he has agreed to share data from his 2003 survey of dialect use in the United States. We have survey responses from 47472 people. For each person, we have a record of where that person lives as given by their city, state and zip code, and their responses to 122 questions. The questions and possible responses for each question are posted here: https://www4.uwm.edu/FLL/linguistics/dialect/maps.html

I think a good goal for this project would be to use these data to fit a classification method, and use it to create one or more maps similar to the map linked above, showing predicted class membership at each location in the US.

A second analysis could be to use hierarchical clustering to group the survey respondents into a few groups according to their responses to the 122 survey questions.

You may also come up with other questions you’d like to answer.

Note that while Prof. Vaux has generously shared his data with us, he has asked that we not make the data public. You will be able to make your analysis public if you wish, but not the data set.

Data Set Option 2: Shelter Animal Outcomes

https://www.kaggle.com/c/shelter-animal-outcomes

Here’s a description of this contest from Kaggle:

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.

Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we’re asking Kagglers to predict the outcome for each animal.

We also believe this dataset can help us understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home.

A fair amount of data cleaning and preprocessing would need to be done to work with this data set.

Data Set Option 3: Superconductivity

https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data

https://www.sciencedirect.com/science/article/pii/S0927025618304877?via%3Dihub

Here’s a quote from the original article linked above, Hamidieh, Kam, A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Volume 154, November 2018, Pages 346-354.

Superconducting materials - materials that conduct current with zero resistance - have significant practical applications. Perhaps the best known application is in the Magnetic Resonance Imaging (MRI) systems widely employed by health care professionals for detailed internal body imaging…

However, the wide spread applications of superconductors have been held back by two major issues: (1) A superconductor conducts current with zero resistance only at or below its superconducting critical temperature (\(T_c\))… (2) The scientific model and theory that predicts is an open problem which has been baffling the scientific community since the discovery of superconductivity in 1911 by Heike Kamerlingh Onnes, in Leiden.

In the absence of any theory-based prediction models, simple empirical rules based on experimental results have guided researchers in synthesizing superconducting materials for many years…

In this study, we take an entirely data-driven approach to create a statistical model that predicts \(T_c\) based on its chemical formula.

Data Set Option 4: Bank Marketing

http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Here’s a description of this data set from the link above:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Source: S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Data Set Option 5: Bike share use

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Here’s a description of this data set from the link above:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

An interesting thing about this data set is that it is a regression problem (response variable is the number of bike rentals), but a normal distribution for the response is not appropriate (since the data are integer counts). A good option for this data set would be a Poisson regression model.

Data Set Option 6: Your Proposal

You are also welcome to propose using a data set of your choosing. If you go this route, I reserve the right to veto your proposal if I don’t think it will work out for the purposes of this project. You may wish to propose multiple options. As you do this, I suggest that you avoid time series data. We have not studied methods for dealing with data with the dependence structures that come with data collected over time. I’m also less likely to accept proposals involving image data, unless the data set is well curated and pre-processed.

Here are some places where you can look for data sets:

Methods

Although we spent a fair amount of time in this class discussing linear regression models, I fundamentally consider that to be a topic from Stat 242 that we reviewed in this course. You may include an analysis using a “plain vanilla” linear regression model if you like, but you should also choose at least one of the other more advanced methods we discussed in class, as well as one other more advanced method that we did not have a chance to discuss in class. Some options are listed below.

Methods discussed in class (pick at least one from this list):

Methods not discussed in class:

If you would like, I can suggest methods appropriate to your group’s chosen data set.

Guidelines for the primary analysis write-up

Overall, the project write-up should be written in clear, concise prose. No R code should be shown unless it is explicitly needed to make a point (I will show you how to hide R code that is in an R markdown document so that it runs but is not displayed in the knitted document).

Please follow the structure below:

  1. Title
  2. Summary: an introduction to the statistical problem you are addressing, brief description of the methods you consider, and summary of the results. Aim for less than a page.
  3. Data: a brief summary of key features of the dataset. You should define each variable that will be used (to the level that it is possible to do this, given the information provided by your data source). It would also be good to include a few plots showing a few key insights about the data set. Note that there will probably not be enough space to present every plot you make during the course of conducting your analysis; you will have to select a small number of the most informative plots to include. At least a few sentences of context and description of the dataset should be included, and the number of observations in the data set should be stated. Aim for about 2 pages.
  4. Methods: a description of the statistical methods included in your analysis, your use of train/test splits and cross-validation to evaluate model performance. This should include a formal statement of any models fit and/or description of procedures used to fit the models. Aim for about 2 pages.
  5. Results: a presentation of your results, including at a minimum a comparison of the relative performance of the different models you considered. For some projects, other results may need to be presented as well.
  6. Discussion: summarize your work, its limitations, and possible future steps/improvements.
  7. References: cite all sources in a standard format.

Items one through 6 above should total 10 pages at most, including figures and tables. Fewer pages is also fine, but if your report is looking like it will be less than 5 pages please check with me to make sure you’re describing your work in enough detail. You should use 11 or 12 point font and no less than 1 inch margins all around.

Proper Citations

Any direct quotes from other sources should be minimal (a sentence or two, not multiple paragraphs!), should be between quotation marks, and should be cited at the point of quotation. It’s ok to include a figure from another source, but this should be cited in the figure caption.

I will deduct a very large amount of points if you take text from another source without citing it properly. If you have any questions on how to cite appropriately, ask!

Collaboration on GitHub

It is expected that each group will craft their analysis collaboratively on GitHub. I will be checking to ensure that there are multiple commits of substantive content from each group member’s GitHub account. We’ll talk more about collaborative use of GitHub in coming days.

Group Dynamic Report

Ideally, all group members would be equally involved and able and committed to the project. In reality, it doesn’t always work that way. I’d like to reward people fairly for their efforts in this group endeavor, because it’s inevitable that there will be variation in how high a priority people put on this class and how much effort they put into this project.

To this end I will ask each of you (individually) to describe how well (or how poorly!) your project group worked together and shared the load. Also give some specific comments describing each member’s overall effort. Were there certain group members who really put out exceptional effort and deserve special recognition? Conversely, were there group members who really weren’t carrying their own weight? And then, at the end of your assessment, estimate the percentage of the total amount of work/effort done by each member. (Be sure your percentages sum to 100%!)

For example, suppose you have 3 group members: X, Y and Z. In the (unlikely) event that each member contributed equally, you could assign:

Or in case person Z did twice as much work as each other member, you could assign:

Or if member Y didn’t really do much at all, you could assign:

I’ll find a fair way to synthesize the (possibly conflicting) assessments within each group. And eventually I’ll find a way to fairly incorporate this assessment of effort and cooperation in each individual’s overall grade. Don’t pressure one another to give everyone glowing reports unless it’s warranted, and don’t feel pressured to share your reports with one another. Just be fair to yourselves and to one another. Let me know if you have any questions or if you run into any problems.

Grading and Assessment Criteria

Your project grade makes up 20% of your final grade for the class. Basically, we’re doing this instead of a final.