MATH/CMPU 280: Intermediate Data Science Syllabus

Spring 2023


Key Information at a Glance:

Classrooms
  • Lecture: SP 105
  • Lab: SP 309
Times
  • Lecture: T/Th 12:00–1:15pm
  • Lab: F 1:00–3:00pm
Website Course Moodle page |
Instructor Simon Hoellerbauer
Instructor Email shoellerbauer@vassar.edu
Instructor Office Rockefeller 206
Office Hours

Dr. Hoellerbauer:

  • Wednesdays: 11:00am–12:30pm (On Zoom, link on Moodle page, sign up via Calendly) |
  • Thursdays: 2:00–3:30pm (In-Person, Drop-In, Rocky 206)
  • By appointment (please don’t hesitate to ask about different times!)

Huaihan Shan (Coach):

  • Mondays (6-8PM, SP 307)
  • Wednesdays (6-8PM, SP 307)
Textbooks

Please note that there is a free version of all of this book! The link below leads to the free version.

There will be other readings linked in the schedule or posted to Moodle.

Assessment
Item Percent of Grade
Homeworks 30%
Labs 20%
Short Essays 5%
Project 35%
Class Engagement 10%

The target audience for this course is students interested in data science, social science, computer science, quantitative analysis, statistics, and coding, among many others topics, who already have some data science and data analysis background.

This course builds on CMPU/MATH 144 and further develops core data science skills using a variety of real-world datasets. In this course, we will refine and expand data wrangling, data visualization, and modeling (with a more expansive focus on prediction) skills. Additionally, we will continue to investigate and discuss the impact of data quality in and ethical implications of data science. As data science combines aspects of computer science and statistics, this course will be very quantitative in nature, with a heavy focus on the computational implementation of data science techniques such as machine learning. Students will get further practice with the R and Python programming languages. We will practice using git for version control and coding collaboration.

This is an intermediate course and therefore assumes some familiarity with coding and statistics. You are not expected to have an advanced understanding of statistics and machine learning – we will focus more on the implementation and intuition behind intermediate data science techniques. If you have taken any AI or machine learning courses in the Computer Science and Mathematics and Statistics Departments, this class may be a bit too elementary for you. If you are not comfortble with coding, you may find this class somewhat difficult, especially initially. Please feel free to come chat with me about it if you are unsure!


Course Structure

This class is comprised of two in-class components: lectures that will involve in-class activities and weekly labs. Out of class, it involves (some) readings and periodic homework assignments. It is highly interactive. I will rarely lecture for a full class-period — we learn best by doing. There are some readings — most often from the textbook (see below), but also newspaper articles and some academic journal articles.

We will use Ed Discussion (look out for a sign-up link in the first week of classes) as a Wiki-style question and answer/discussion forum. This makes it easier for the instructor to answer common questions and also allows students to crowd-source answers. It allows you to write in code with proper formatting and with syntax highlighting, making it slightly easier to get proper feedback.

Unless indicated, you are expected to have completed the readings and assignments by the date they are listed in the course schedule.

Assignments and Grading

Homeworks (30%)

There will be five homework assignments due as noted in the schedule, almost always on a Thursday. They are weighted equally (so each is worth 6% of your overall grade). These homework assignments are due by 11:59pm on the days indicated, unless we decide something different in class or I announce a different due time. All homeworks are to be completed individually. You will typically have one week to complete a homework assignment.

Labs (20%)

There are weekly labs (except for certain project workdays indicated in the schedule). All labs are weighted equally. Labs are evaluated for effort and completion, not necessarily correctness. Each lab session you will complete an assignment in which you apply topics from that week, which you will turn in by 11:59pm on the following Monday. They are designed to be completed during the lab time.

We have a student coach who will help during lab.

Short Essays (5%)

During the semester you will write two short reflections (worth 2.5% of your grade each). These will be graded for effort and completion.

Project (35%)

The class has a capstone final project for which students, working in groups, conduct and present an original data analysis on a dataset of your (collective) choice. I will put you into project groups once the Add Period has ended.

The aim of the project is for you to apply concepts and techniques we will cover in this course. You can use an existing dataset (you may not reuse data used in this course for examples, labs, and assignments) or collect your own data using a survey or an experiment.1

The project consists of the following components (in parentheses is weight of that component’s grade in the overall project grade): 1. Proposal and Data (10%) 2. Peer Review of Preliminary Analysis (15%) 3. Report (30%) 4. Presentation (40%) 5. Team member evaluation (5%)

At the end of the project, you will evaluate your own and your group members’ contribution to the project.

More information will be provided about the project and the individual components later in the semester.

To use a grace day for a project component, all group members have to decide to use a grace day. You cannot use a grace day on the final report and presentation.

Class Engagement (10%)

Class engagement is what you may often see called “participation” in other classes. While participation is important, I know that know that not everyone participates in the same way. I encourage you all to be involved in class and during labs. What I am really looking for, however, is engagement in the course. All of the following demonstrate engagement in the course:

  • Participation in class discussions
  • Participation in individual and group in-class activities
  • Being active on Ed Discussion (asking or answering questions)
  • Coming to office hours

Extra Credit Opportunities

The Data Science and Society Initiative is organizing a colloquium series this semester where data scientists will come to present on data science topics. If you attend and write a two (2) page, double-spaced reflection on the speaker’s presentation, including what you learned and how it related to what you have learned in class, you will get two (2) percentage point of extra credit on a homework of your choice. You can only get points for attending and writing a reflection three (3) times. However, you are strongly encouraged to attend all of them!

While other extra credit opportunities may arise during the rest of the semester as decided by me, please do not ask for any particular extra credit opportunities.

Grade Breakdown

I will use the following grading scale:

  • A: 100-93; A-: < 93-90
  • B+: < 90-87; B: < 87-83; B-: < 83-80
  • C+ < 80-77; C: < 77-73; C-: < 73-70
  • D+: < 70-67; D: < 67-60
  • F: < 60

Some professors make subjective decisions about rounding up or down in certain ranges (93-94, for example). This has always struck me as unfair and subjective. This grading scale makes it clear exactly what percentage you need to get for a particular letter grade. I will not do any rounding beyond this.

Class Texts and Software

Texts

We will use the following textbooks:

Please note that the pdf version of this textbook is free! If you want a physical copy of this book, you will have to purchase them (I want to stress that in no way shape or form are you expected to purchase the physical copies). The link above leads to the free online versions.

For other topics, there will be readings posted to Moodle and linked in the schedule.

Textbook Accessibility and Affordability

Vassar students often report challenges accessing and affording required course materials. The College is committed to ensuring that every student can participate fully in the curriculum, regardless of financial need. The Movement for Affordable Textbooks (MAT) website highlights a variety of resources – financial, library, departmental, and peer-to-peer – that can help students navigate the costs of textbooks and other materials.

Software

There are software requirements for the course.

I believe that working through installing and having an installation of programs like RStudio, R, Python, and Git on one’s personal computer is a very useful skill and experience. However, I do not want this to be an obstacle to learning class material. If you do not have a laptop or have a laptop that cannot run these programs for any reason please let me know. Vassar has access to an RStudio Server, which will let you access RStudio and use R in a web browser, and there are ways to use Python remotely as well.

R

Students must download and install R, a free statistical program available at http://cran.r-project.org/, as well as RStudio (also free), which is available at https://posit.co/download/rstudio-desktop/. Please follow the instructions for downloading both here: https://moderndive.netlify.app/1-getting-started.html#installing (although note that since RStudio has become Posit, the steps for installing RStudio may be slightly different).

Reading More About R

If you want to brush up on your R skills, R for Data Science by Hadley Wickham and Garrett Grolemund is an excellent introduction to the tidyverse approach to R programming.

If you are interested in R as a programming language, feel free to check out Hadley Wickham’s2 excellent Advanced R. You can access its contents for free online here: https://adv-r.hadley.nz/. I also have a physical copy of this book in my office if you would like to peruse it.

Python

Students should also install Python and Jupyter Notebook on their computers. The most straightforward way to do so is to install the Anaconda distribution of Python; it collects a wide array of useful packages, libraries, and applications that make it possible to use Python on your own computer. You can install from https://www.anaconda.com/products/distribution. See https://docs.anaconda.com/anaconda/install/index.html for helpful instructions for different platforms.

Reading More About Python

There are many, many, many books written about Python. A good place to start is this link, which collects a list of 5 free books on learning Python; We will read excerpts from some of them this semester.

In particular, Python for Everybody by Charles R. Severance is a great resource for those learning to use Python for data science, and A Whirlwind Tour of Python by Jake VanderPlas is a great introduction to Python in general; it serves as a necessary preque to his Python Data Science Handbook.

All three of these books are available for free at the preceding links.

Git

Students also need to install Git (if it is not installed on their computers already). Please follow the instructions at this link to do so: http://rafalab.dfci.harvard.edu/dsbook/accessing-the-terminal-and-installing-git.html. A helpful resource for using Git is the ProGit online book.

Students will also need to make a Github account, if they have not done so already for a different class or previously on their own. Please see here for instructions: https://docs.github.com/en/get-started/signing-up-for-github/signing-up-for-a-new-github-account. We will use Github Classroom (look for a sign-up link in the first week of class) to make using Git a bit easier.

Please do not worry about setting up Git yet; we will do so together during the first lab.

Detailed Course Policies

Office Hours and Contact Policy

I will hold two sets of office hours during the week:

  • Wednesdays: 11:00am–12:30pm (On Zoom, link on Moodle page)

    Please sign up for a 15 minute slot via this link: https://calendly.com/simon_hoellerbauer/office-hours. You may sign up for two slots (but no more, please) if needed.

  • Thursdays: 2:00–3:30pm (In-Person, Rocky 206).

    These are drop-in, first-come, first-serve office hours. But if you are there before the end of office hours, I will do my best to get to you (that is, I won’t turn people who have been waiting away just because it is after 2:30pm)

If I have to change my office hours for any reason, I will let you know. If these times do not work for you, please email me; I am more than happy to schedule a meeting at a different time.

You are encouraged to come to my office hours and to contact me with any questions you may have, even if you just want to chat. You can come to office hours to talk about the class, but my office hours are open more generally as well. I am more than happy to talk about research interests, data science, quantitative political science, data science careers, classes at Vassar, and many other things.

For general questions about course materials, I encourage you to use Ed Discussion, which allows you to properly format code and also has things like syntax highlighting. If your question involves code, I would prefer that you use Ed Discussion; if it is a question about an assignment, you can make a private question that only I can see. The reason I prefer Ed Discussion for such questions is because of the ability to write code with proper formatting and highlighting.

I will try to respond to emails and private questions on Ed Discussion as soon as possible, although I cannot guarantee same day response. Therefore, I encourage you to ask me questions about assignments and projects as far in advance as possible, which will hopefully help you get in the habit of working on assignments well before they are due.

Attendance

Presence in the classroom is a key factor in student success in a course. I will take attendance at the beginning of each lecture and lab meeting. You are allowed two (2) unexcused absences from lecture and one (1) unexcused absence from lab.

Please contact me for excused absences. Excused absences are generally limited to short-term or long-term illness, personal emergencies, varsity athletic participation, and religious holidays, although I am willing to discuss absences that do not exactly fit under one of those categories on a case-by-case basis before the date of the planned absence. If I do not promptly reply to an email related to absences, please follow up with me.

After two unexcused absences from lecture, students will be penalized a letter grade on their class engagement grade – if you do not come to lecture, you can not engage in class. After one unexcused absence from lab, students will be penalized a letter grade on their lab grade. Each additional absence in either case will result in a further letter grade penalty. Please note that the potential consequences for unexcused absences is different for the two in-class aspects of the class.

Students who are consistently late for lecture or lab may also see the corresponding grades reduced. Excessive unexcused absences (more than 25% of class meetings, lab meetings or a combination of the two) may also be grounds for failing or dismissal from the course.

Late Work

I have a grace days policy in my class. During the semester, you have three (3) grace days to use on any assignment or combination of assignments. One grace day allows you to turn in an assignment one day (24 hours) late. The new due time would be the same time of the day, but one day later. For example, if a homework is due on September 15 at 11:59 PM, and you take a grace day, you can turn in this homework up to 11:59 PM on September 16. Weekends (a Saturday- Sunday period) count as one grace day. If you want to use a grace day on an assignment, you must indicate so on the assignment itself.

If you do not use a grace day and have not talked to me beforehand, I will deduct a letter grade (10 percentage points) per day that an assignment is late from the maximum grade you can receive. I will then grade your assignment as normal and weight it so that it could not exceed this new maximum grade. For example, if you turn in an assignment one day late and do not use a grace day, the highest grade you can receive is a 90. If you then receive an 85 on the assignment, your actual grade will be .85 * 90 = 76.5. I do this is because it helps me separate out where you lose points in ways that are not related to the lateness of your assignment.

If you exhaust your grace days and think you may need more time on an assignment, please contact me, but please note that I will only grant extensions in emergencies.

Academic Integrity3

THIS SECTION IS VERY IMPORTANT, PLEASE READ TO MAKE SURE YOU DO NOT GET INTO TROUBLE!!

In a class setting, cooperative work has both benefits and pitfalls. Peers learn a lot by explaining things to each other. But it can also be easy to stumble into a passive mindset where you’re not really assimilating the concepts. In this course, you are allowed — and in fact welcome to — discuss course content with your peers, including homework assignments. However, you must always write your own code and written answers, except for the final project, where you can share everything, including code, with your project partners, as you will turn in one assignment. 4

In addition, in general, you are not allowed to share code in any way for any assignment (except, as stated above, the final project, and then only within your group). You are also not allowed to turn in code obtained from others or online.

Do not post publicly on Ed Discussion about homework and other assignments (except for general, conceptual questions). Ed Discussion is only to be used to ask conceptual questions about class materials. You can write me a private message to ask about homework and other assignments.

To make it totally clear, you can use the following guidelines to determine what collaboration is allowed on assignments that are to be turned in:

What is Cheating?

  • Sharing code or other electronic files: either by copying, retyping, looking at, or supplying a copy of a file from this or a previous semester. Also not allowed is verbal or other description of one person’s code to another.
  • Sharing written assignments: Looking at, copying, or supplying an assignment.
  • Using other’s code. Using code from this or previous offerings of this class, from other courses at Vassar or other institutions (e.g., software or code found on the Internet).
  • Yes, this means using ChatGPT or similar platforms to write code for you is not allowed.
  • Looking at other’s code. Although mentioned above, it bears repeating. Looking at other students’ code or allowing others to look at yours is cheating. This includes one person looking at code and describing it to another. There is no notion of looking “too much”, since no looking is allowed at all.

What is not Cheating?

  • Clarifying ambiguities or vague points in class handouts or textbooks.
  • Using code from the textbook or from the class web pages is always OK.

These guidelines will be slightly relaxed for in-class activities, which will not be turned in. For these you will be put in groups; you still need to write your own code, but you can work together to come up with solutions and can look at each others’ code.

Please remember that I am here to help and that all you have to do is ask for assistance if you need it. You do not have to face the course alone; I just want to make sure that all students are best situated to learn and practice the course material.

Regrade Requests

Requests for regrades have a time window. They cannot be submitted until at least 48 hours have passed since the assignment was returned (a cool-down period), and then they will only be accepted within three weeks of an assignment being returned (a statute of limitations). To request a regrade, you must submit a written memo (two pages max) explaining what aspect of your original grade you think was in error.

Please note that you do not have to do this if you think there is an error in the assignment or in the calculation of a grade. Just bring this to my attention.

Electronic Policy

Please put away all cell phones while class is in session. You are permitted to use laptops in class. On most days we will be doing some sort of activity for which you will have to use a computer.5 Please realize that I can tell when you are looking at materials that are not related to the class. When taking notes, I strongly encourage you to not use your laptops in class, as some studies have shown that using pen and paper is better for comprehension and understanding (and provides less opportunity for distraction), while laptop use can decrease participation.

COVID Policies

Although the situation has improved markedly, COVID-19 is still a presence in our lives, and there are many individuals for whom COVID-19 is still a significant risk, for a variety of reasons. I will be wearing a mask to teach and strongly encourage all of you to wear a mask while in the classroom as well; the rooms in Rocky are small and air circulation is not the best.

I will require masks during in-person office hours because my office is quite small.

Teaching Philosophy

I view my role as a teacher as a support person for you, my students. Because of my background and education, I have knowledge that I will strive to communicate with my you, which is why lectures do form an important part of this course. My primary goal as a teacher, however, is to make you feel engaged and active and to help you learn skills that you will be able to use outside of the contexts of this course and even of this field of study. As such, I believe that active engagement with the course material is essential to helping you learn, and I structure the course in such a way that there are plenty of ways in which to participate and be active, as I recognize that not all students learn in the same way. At the same time, I do not believe that surface-level skimming of a topic is all that useful; therefore, this class is more detail-oriented than other introductory courses may be, without being overwhelming. Finally, I am always open to feedback—I want to make sure that you are getting both what you want and need from this course.

Discrimination and Harassment

I want to remind everyone that we are bound to abide by Vassar’s policies regarding discrimination and harassment, which you can read on page 16 in the Vassar College Regulations.

Names and Pronouns6

As noted by the Office of Equal Opportunity and Affirmative Action/Title IX, Vassar is committed to diversity, inclusion, equity, and non-discrimination. Many people might use a name that is different from their current legal name. In all areas of campus, we refer to people by the names, in addition to the pronouns, that they use for themselves. Students are invited to share their names and the pronouns that they use. Students are also encouraged to use gender-neutral language, if they aren’t sure of someone’s pronouns.

Resources for Students

Q-Center Peer Support Services

This course fulfills the quantitative analysis (QA) requirement for graduation. All Vassar students have access to free, drop-in, peer-to-peer quantitative tutoring at the Quantitative Reasoning Center (Q-Center). Quantitative tutors (Q-Tutors) excel in a variety of STEM courses. They are typically available Sunday-Thursday 3pm-11pm while classes are in session. Q-Tutors who specialize in Mathematics and Physics are located in the Main Library, Room 122 behind the Writing Center. Q-Tutors who specialize in Chemistry and Economics are located in the Main Library, Room 88 near Special Collections. If you have a quantitative question beyond these four disciplines, Q-Tutors are available to attempt to help you with this question or will help direct you to someone else who may be better able to help. Schedules and other important information can be found at https://ltrc.vassar.edu/qrc/.

Academic Accomodations

Academic accommodations are available for students registered with the Office for Accessibility and Educational Opportunity (AEO). Students in need of disability (ADA/504) accommodations should schedule an appointment with me early in the semester to discuss any accommodations for this course that have been approved by the Office for Accessibility and Educational Opportunity, as indicated in your AEO accommodation letter.

Title IX Resources

Vassar College is committed to providing a safe learning environment for all students that is free of all forms of discrimination and sexual harassment, including sexual assault, domestic violence, dating violence, and stalking. If you (or someone you know) has experienced or experiences any of these incidents, know that you are not alone. Vassar College has staff members trained to support you in navigating campus life, accessing health and counseling services, providing academic and housing accommodations, helping with legal protective orders, and more.

Please be aware all Vassar faculty members are “responsible employees,” which means that if you tell me about a situation involving sexual harassment, sexual assault, dating violence, domestic violence, or stalking, I must share that information with the Title IX Coordinator. Although I have to make that notification, you will control how your case will be handled, including whether or not you wish to pursue a formal complaint. Our goal is to make sure you are aware of the range of options available to you and have access to the resources you need.

If you wish to speak to someone privately, you can contact any of the following on-campus resources:

  • Counseling Service (counselingservice.vassar.edu, 845-437-5700)
  • Health Service (healthservice.vassar.edu, 845-437-5800)
  • Rachel Gellert, SAVP (Support, Advocacy, and Violence Prevention) director, 845-437-7863)
  • SAVP advocate, available 24/7 by calling the CRC at 845-437-7333

The SAVP website (savp.vassar.edu and the Title IX section of the EOAA website (eoaa.vassar.edu/title-ix/) have more information, as well as links to both on- and off-campus resources.

Changes to Syllabus and Schedule

I view the latter part of the schedule as somewhat tentative; I do not want to rush through the material and so may make adjustments as the semester goes on. Therefore, I reserve the right to make changes to this syllabus and schedule when necessary. I will always let you know when this occurs. For the most up-to-date syllabus, please always look on Moodle. I will never add more assignments, I will only ever remove them from the schedule (in some cases I may tweak readings).

Schedule

Please note that this schedule is filterable and searchable.

Footnotes

  1. If you choose to go this route, you must check with me before writing your proposal, otherwise I will not approve it.↩︎

  2. Hadley Wickham is a statistician, the creator of the ggplot2 package and the tidyverse, and Chief Scientist at RStudio. He’s from New Zealand, hence the .nz in the links here and above.↩︎

  3. Adapted from the DATA 144 syllabus created by Professors Monika Hu and Jason Waterman.↩︎

  4. There is to be no sharing between groups, to be clear.↩︎

  5. If you do not have access to a laptop, please let me know as soon as possible, and we will find a solution.↩︎

  6. Adapted from statement written by Professors Jacob Smith and Abbie Erler, Kenyon College.↩︎