DATA 146: Introduction to Data Science

Course ID: DATA 146
Course Attribute: GE1
Title: Introduction to Data Science
Credit Hours: 3
Meeting Times: 11:00 to 11:50 MWF
Location: Remote Synchronous Off-Campus
Date Range: Spring Semester 2021

Course Description

This course will focus research design in the context of data, providing an overview of different modeling approaches, their differences, and the context(s) in which each might be most appropriate to apply. Special attention will be given to cases in which complete information is not available. Each modeling framework's disciplinary history will be considered, and the overlaps and distinctions between them discussed. Students will be expected to acquire a strong capability to identify the most appropriate modeling strategies given a problem and problem context, as well as learn the limitations or advantages of a given approach.

Courses Objectives (Overview)

In this course students will learn the fundamentals of data processing and modeling in the context of Data Science. Emphasis will be placed on careful planning and deliberate decision making when working with data and building models. Programming will be done in the Python language and we will be making extensive use of the scikit-learn collection. After learning about the basics of having a good Data Pipeline, students will be introduced to a variety of supervised and unsupervised machine-learning techniques including various methods for regression, classification, and clustering. By the end of the course, students are not expected to be an expert on any particular technique, but should exhibit a solid high-level understanding of the goals of each method, be able to determine when a particular type of model is more or less suitable to a real-world problem and, most importantly, demonstrate a keen attention to detail when working with data. Throughout the course, there will be a very strong emphasis placed on understanding why we are doing what we are doing.

Honor Code

Among our most significant traditions is the student-administered honor system. The Honor Code is an enduring tradition with a documented history that originates as far back as 1736. The essence of our honor system is individual responsibility. Today, students, such as yourself, administer the Honor pledge to each incoming student while also serving to educate faculty and administration on the relevance of the Code and its application to students’ lives.

The Pledge

“As a member of the William and Mary community, I pledge on my honor not to lie, cheat, or steal, either in my academic or personal life. I understand that such acts violate the Honor Code and undermine the community of trust, of which we are all stewards.”

Accessibility, Attendance & Universal Learning

William & Mary accommodates students with disabilities in accordance with federal laws and university policy. Any student who feels s/he may need accommodation based on the impact of a learning, psychiatric, physical, or chronic health diagnosis should contact Student Accessibility Services staff at 757-221-2509 or at sas@wm.edu to determine if accommodations are warranted and to obtain an official letter of accommodation. For more information, please see www.wm.edu/sas.

I am committed to the principle of universal learning. This means that our classroom, our virtual spaces, our practices, and our interactions be as inclusive as possible. Mutual respect, civility, and the ability to listen and observe others carefully are crucial to universal learning. Active, thoughtful, and respectful participation in all aspects of the course will make our time together as productive and engaging as possible.

Grade Categories

exceptional A = 100 ≥ 97.0excellent A = 96.9 ≥ 93.0superior A- = 92.9 ≥ 90.0
very good B+ = 89.9 ≥ 87.0good B = 86.9 ≥ 83.0above average B- = 82.9 ≥ 80.0
normal C+ = 79.9 ≥ 77.0average C = 76.9 ≥ 73.0sub par C- = 72.9 ≥ 70.0
below average D+ = 69.9 ≥ 67.0poor D = 66.9 ≥ 63.0very poor D- = 62.9 ≥ 60.0
failing F < 60.0

note .9 = .9 with bar notation

Grading Opportunities

Six Data Science Labs10% each60% total
Two Individual Projects15% each30% total
Participation5% each half10% total

Semester Schedule

Week 1 (1/27)

Week 2 (2/1)

Week 3 (2/8)

  • Monday:
    • Lecture: data pipeline
  • Wednesday:
    • Open session, workshop & review
  • Friday: Spring Break Day, no class (2/12)
    • Project 1 upload your response to your GitHub page by midnight on Wednesday, February 17th

Week 4 (2/15)

  • Monday:
    • Lecture: descriptive Analysis etc...
  • Wednesday:
    • Module 1
    • Project 1 due by midnight
  • Friday:
    • Module 2

Week 5 (2/22)

  • Monday:
    • Lecture: regression
  • Wednesday:
    • Module 2
  • Friday:
    • Module 2

Week 6 (3/1)

  • Monday:
    • Module 2
  • Wednesday:
    • Module 3
    • Project 2 due by midnight
  • Friday:
    • Module 3

Week 7 (3/8)

  • Monday:
    • Lecture: regularization
  • Wednesday:
    • Module 3
  • Friday:
    • Module 3

Week 8 (3/15)

  • Monday:
    • Review, no "common" meeting
  • Tuesday
    • Project 3 due by midnight
  • Wednesday: Spring Break Day, No Class (3/17)
  • Thursday:
    • Mid-term project begins -- 8AM
  • Friday:
    • Mid-term continues
  • Sunday:
    • Mid-term project due

Week 9 (3/22)

  • Monday:
    • Mid-term project corrections
  • Wednesday:
    • Mid-term project corrections
  • Friday:
    • Mid-term project corrections

Week 10 (3/29)

  • Monday:
    • Lecture: decision trees & random forest
  • Wednesday:
    • Professor ill, class cancelled
  • Friday:
    • Introduce kNN

Week 11 (4/5)

  • Monday:
    • Review kNN
  • Wednesday: Spring Break Day, no class (4/7)
  • Friday:
    • Common Lecture: Ron Smith - clustering techniques: AHC, k-means & DBSCAN

Week 12 (4/12)

  • Monday:
    • Logistic regression
    • Project 5 assigned
  • Wednesday:
    • Decision trees and random forest
  • Friday:
    • ...
  • Sunday
    • Project 5 due by midnight

Week 13 (4/19)

  • Monday:
    • ...
  • Wednesday:
    • ...
  • Friday:
    • ...

Week 14 (4/26)

  • Monday: Spring Break Day, no class
  • Wednesday:
    • ...
  • Friday:
    • ...

Week 15 (5/3)

  • Monday:
    • Final Project
  • Wednesday:
    • Final Project
  • Friday: Last day of class
    • Final Project

Final

  • Final Project is due on the last day of the finals period at 5PM.

Projects

Project 1

Create a new markdown file and upload it to your GitHub repository. Provide a link to your newly created project1.md file from your main index. Populate your newly created markdown file with your answers to the following questions. Upload your response no later than midnight on Wednesday, February 17th.

  • Describe what is a package? Also, describe what is a library? What are the two steps you need to execute in order to install a package and then make that library of functions accessible to your workspace and current python work session? Provide examples of how you would execute these two steps using two of the packages we have used in class thus far. Be sure to include an alias in at least one of your two examples and explain why it is a good idea to do so.

  • Describe what is a data frame? Identify a library of functions that is particularly useful for working with data frames. In order to read a file in its remote location within the file system of your operating system, which command would you use? Provide an example of how to read a file and import it into your work session in order to create a new data frame. Also, describe why specifying an argument within a read_() function can be significant. Does data that is saved as a file in a different type of format require a particular argument in order for a data frame to be successfully imported? Also, provide an example that describes a data frame you created. How do you determine how many rows and columns are in a data frame? Is there an alternate terminology for describing rows and columns?

  • Import the gapminder.tsv data set and create a new data frame. Interrogate and describe the year variable within the data frame you created. Does this variable exhibit regular intervals? If you were to add new outcomes to the raw data in order to update and make it more current, which years would you add to each subset of observations? Stretch goal: can you identify how many new outcomes in total you would be adding to your data frame?

  • Using the data frame you created by importing the gapminder.tsv data set, determine which country at what point in time had the lowest life expectancy. Conduct a cursory level investigation as to why this was the case and provide a brief explanation in support of your explanation.

  • Using the data frame you created by importing the gapminder.tsv data set, multiply the variable pop by the variable gdpPercap and assign the results to a newly created variable. Then subset and order from highest to lowest the results for Germany, France, Italy and Spain in 2007. Create a table that illustrates your results (you are welcome to either create a table in markdown or plot/save in PyCharm and upload the image). Stretch goal: which of the four European countries exhibited the most significant increase in total gross domestic product during the previous 5-year period (to 2007)?

  • You have been introduced to four logical operators thus far: &, ==, | and ^. Describe each one including its purpose and function. Provide an example of how each might be used in the context of programming.

  • Describe the difference between .loc and .iloc. Provide an example of how to extract a series of consecutive observations from a data frame. Stretch goal: provide an example of how to extract all observations from a series of consecutive columns.

  • Describe how an api works. Provide an example of how to construct a request to a remote server in order to pull data, write it to a local file and then import it to your current work session.

  • Describe the apply() function from the pandas library. What is its purpose? Using apply) to various class objects is an alternative (potentially preferable approach) to writing what other type of command? Why do you think apply() could be a preferred approach?

  • Also describe an alternative approach to filtering the number of columns in a data frame. Instead of using .iloc, what other approach might be used to select, filter and assign a subset number of variables to a new data frame?

Project 2

Create a new markdown file and upload it to your GitHub repository. Provide a link to your newly created project2.md file from your main index. Populate your newly created markdown file with your answers to the following questions. Each question is worth 2.5 points. Upload your response no later than midnight on Wednesday, March 3rd.

  • Describe continuous, ordinal and nominal data. Provide examples of each. Describe a model of your own construction that incorporates variables of each type of data. You are perfectly welcome to describe your model using english rather than mathematical notation if you prefer. Include hypothetical variables that represent your features and target.

  • Comment out the seed from your randomly generated data set of 1000 observations and use the beta distribution to produce a plot that has a mean that approximates the 50th percentile. Also produce both a right skewed and left skewed plot by modifying the alpha and beta parameters from the distribution. Be sure to modify the widths of your columns in order to improve legibility of the bins (intervals). Include the mean and median for all three plots.

  • Using the gapminder data set, produce two overlapping histograms within the same plot describing life expectancy in 1952 and 2007. Plot the overlapping histograms using both the raw data and then after applying a logarithmic transformation (np.log10() is fine). Which of the two resulting plots best communicates the change in life expectancy amongst all of these countries from 1952 to 2007?

  • Using the seaborn library of functions, produce a box and whiskers plot of population for all countries at the given 5-year intervals. Also apply a logarithmic transformation to this data and produce a second plot. Which of the two resulting box and whiskers plots best communicates the change in population amongst all of these countries from 1952 to 2007?

Project 3

Create a new markdown file and upload it to your GitHub repository. Provide a link to your newly created project3.md file from your main index. Populate your newly created markdown file with your answers to the following questions. This lab is worth 10 points. Upload your response no later than midnight on Tuesday, March 16th.

  • Download the dataset charleston_ask.csv and import it into your PyCharm project workspace. Specify and train a model the designates the asking price as your target variable and beds, baths and area (in square feet) as your features. Train and test your target and features using a linear regression model. Describe how your model performed. What were the training and testing scores you produced? How many folds did you assign when partitioning your training and testing data? Interpret and assess your output.

  • Now standardize your features (again beds, baths and area) prior to training and testing with a linear regression model (also again with asking price as your target). Now how did your model perform? What were the training and testing scores you produced? How many folds did you assign when partitioning your training and testing data? Interpret and assess your output.

  • Then train your dataset with the asking price as your target using a ridge regression model. Now how did your model perform? What were the training and testing scores you produced? Did you standardize the data? Interpret and assess your output.

  • Next, go back, train and test each of the three previous model types/specifications, but this time use the dataset charleston_act.csv (actual sale prices). How did each of these three models perform after using the dataset that replaced asking price with the actual sale price? What were the training and testing scores you produced? Interpret and assess your output.

  • Go back and also add the variables that indicate the zip code where each individual home is located within Charleston County, South Carolina. Train and test each of the three previous model types/specifications. What was the predictive power of each model? Interpret and assess your output.

  • Finally, consider the model that produced the best results. Would you estimate this model as being overfit or underfit? If you were working for Zillow as their chief data scientist, what action would you recommend in order to improve the predictive power of the model that produced your best results from the approximately 700 observations (716 asking / 660 actual)?

Mid-term corrections

Project 5 - Part 1

Create a new markdown file and upload it to your GitHub repository. Provide a link to your newly created project5.md file from your main index. Populate your newly created markdown file with your answers to the following questions. This lab is worth 10 points. Upload your response no later than 5PM on Saturday, April 17th.

  • Download the anonymized dataset describing persons.csv from a West African county and import it into your PyCharm project workspace (right click and download from the above link or you can also find the data pinned to the slack channel). First set the variable wealthC as your target. It is not necessary to set a seed.

  • Perform a linear regression and compute the MSE. Standardize the features and again computer the MSE. Compare the coefficients from each of the two models and describe how they have changed.

  • Run a ridge regression and report your best results.

  • Run a lasso regression and report your best results.

  • Repeat the previous steps using the variable wealthI as your target.

  • Which of the models produced the best results in predicting wealth of all persons throughout the smaller West African country being described? Support your results with plots, graphs and descriptions of your code and its implementation. You are welcome to incorporate snippets to illustrate an important step, but please do not paste verbose amounts of code within your project report. Alternatively, you are welcome to provide a link in your references at the end of your (part 1) Project 5 report.

Project 5 - Part 2

Create a new markdown file and upload it to your GitHub repository. Provide a link to your newly created project5.md file from your main index (parts 1 & 2). Populate your newly created markdown file with your answers to the following questions. Each part of this lab is worth 10 points. Upload your response no later than midnight on Sunday, April 25th.

  • Download the anonymized dataset describing city_persons.csv from a larger city in a West African county and import it into your PyCharm project workspace (right click and download from the above link or you can also find the data pinned to the slack channel). This time we will only use the variable wealthC as your target. It is not necessary to set a seed.

Import your libraries, functions and create the commands DoKFold, GetData and CompareClasses as we have previously done in class. Import the data and then complete the following steps.

  • Execute a K-nearest neighbors classification method on the data. What model specification returned the most accurate results? Did adding a distance weight help?

  • Execute a logistic regression method on the data. How did this model fair in terms of accuracy compared to K-nearest neighbors?

  • Next execute a random forest model and produce the results. See the number of estimators (trees) to 100, 500, 1000 and 5000 and determine which specification is most likely to return the best model. Also test the minimum number of samples required to split an internal node with a range of values. Also produce results for your four different estimator values by both comparing both standardized and non-standardized (raw) results.

  • Repeat the previous steps after recoding the wealth classes 2 and 3 into a single outcome. Do any of your models improve? Are you able to explain why your results have changed?

  • Which of the models produced the best results in predicting wealth of all persons throughout the large West African capital city being described? Support your results with plots, graphs and descriptions of your code and its implementation. You are welcome to incorporate snippets to illustrate an important step, but please do not paste verbose amounts of code within your project report. Avoiding setting a seed essentially guarantees the authenticity of your results. You are welcome to provide a link in your references at the end of your (part 2) Project 5 report.

DATA 146: Introduction to Data Science

Extra Credit Report - due May 18th by 5PM

Analyze demographic data that describes a larger West African country

For this extra credit assignment, you are asked to analyze a household survey that includes several demographic variables that describe a West African country. The dataset is named more_people.csv and has been pinned to the slack channel #data146_extracredit. A household survey is a random, clustered and stratified sample from the larger population. In order to successfully obtain the extra credit, you will need to complete the following steps.

  • Using the variable wealthC as your target, apply the following models.
    • linear regression, ridge, lasso
    • kNN, logistic regression, random forest
    • stretch goal: neural networks or another model
  • Change the target to the variable wealthI and apply the same models as above.
  • Assess the output from each of your models. Produce plots that demonstrate the predictive power of each model. Which model performed the best? Justify and support your results with plots and other metrics that you produced.

For your extra-credit deliverable, write a 2 to 3 page report that describes your investigation into the demographic composition of this West African country. Introduce your report by describing the data itself in terms of its size and shape. Produce plots of the data in order to further describe it. Describe each implementation as part of the report on your investigation into the data. Publish your extra credit report on GitHub as a webpage and share your link to the slack channel above. Please feel free to ask questions on the slack channel and I will be happy to provide further direction or advice as needed. Your extra credit report is due May 18th.