Homework #2

Due 11:59 pm EST, Tuesday March 1st, 2022.

Email your solutions (both .ipnb and .html files) to: compscbio@gmail.com.

Background:

You have just joined a new lab that is interested in the mammalian cell cycle. During lab meeting, you mentioned that you are taking a class that purports to teach you how to analyze scRNAseq data, and thereby you unwittingly ‘volunteered’ to close a project that a departing postdoc left unfinished. The postdoc was trying (1) to determine the extent to which cell cycle is impacted by stage of differentiation, and (2) to identify novel genes associated with cell cycle. To do so, this postdoc sampled mouse embryonic stem cells (mESCs) undergoing directed differerentation to mesoderm every 24 hours for five days and subjected these cells to scRNAseq via the 10x Genomics system. However, before the data could be analyzed, the postdoc got an offer in the private sector that could not be refused. The PI of the lab knows that you are amazing, and expects you to work some magic in on the data to address the following questions:

    1. To what extent do the proportion of cells at each stage of cell cycle (CC) differ during directed differentiation?

    1. What novel genes are associated with CC?

    1. To what extent do these novel cell cycle genes vary across stages of differentiation?

However, because you are a relative newbie to this task, you want to use this project as an opportunity to also explore how the data cleaning steps impact the final results. Therefore, to recevie full credit, you will also need to answer the following question:

    1. How do the various ‘cleaning’ steps outlined in class impact the results of (A-C)?

Two wrinkles, or unfortunate things

One thing that makes analysis of this data problematic is that the sample annotation table has been lost. This means that you do not know, a priori, at which day of differentiation each cell was sampled. However, your lab has studied this system before and they know a handful of genes that are up-regulated specifcially at each stage of differentiation.

The other problem has to do with how this experiment is performed. Often, mESCs are maintained and expanded on a layer of mouse embryonic fibroblasts (MEFs), which help to maintain self-renewal. MEFs are mitotically inactivated either chemically or via irradiation, so they never outgrow the mESCs during passage. When mESCs are prepared for directed differentiation, MEFs are depleted via an imperfect method such that some MEFs remain at the onset of the differentiation protocol. This means some MEFs are likely to have been sequenced in this experiment and you should exclude them in silico prior to downstream analysis.

The data

scRNAseq in the form of an h5ad file This is the raw counts data. No data cleaning has been performed on it. There are 2,201 cells and 31,065 genes.

Gene lists

naive pluripotency Genes preferentially expressed in undifferentiated mESCs.

primed pluripotency Genes preferentially expressed in mESCs primed for differentiation.

primitive streak Genes preferentially expressed in the primitive streak.

mesoderm Genes preferentially expressed in the nascent mesoderm.

MEF Genes preferentially expressed in fibroblasts.

cell cycle This is the same gene list that we covered in lecture 7

Your mission: Part 1

Load the data, identify contaminating MEFs, and remove them

[10]:
###  Part 1 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Your mission: Part 2

Clean the data as we discussed in class, including, but not limited to, filtering out potential doublets and low quality barcodes, and excluding undetected genes.

[11]:
###  Part 2 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Your mission: Part 3

Predict the stage of differentiation for each cell.

[12]:
###  Part 3 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Your mission: Part 4

Predict the stage of cell cycle of each cell. Use this information to address question (A) above:

To what extent do the proportion of cells at each stage of cell cycle (CC) differ during directed differentiation?

[13]:
###  Part 4 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Your mission: Part 5

Write a function to identify genes that are correlated with cell cycle stage (S and G2M) scores. You can use Pearson correlation.

Use this function to determine the set of genes that are correlated with S and G2M during each stage of differentation. I recommend using a threshold of 0.45, but feel free to explore and select your own threshold, but if you do so, please justify.

Use the resulting information to answer questions (B) and (C):

(B): What novel genes are associated with CC?

(C):To what extent do these novel CC genes vary across stages of differentiation?

[14]:
###  Part 5 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Your mission: Part 6

Go back and re-do your analysis of the MEF-depleted cells, but this time evaluate how the major cleaning steps impact … - the predicted stage of differentiation - the predicted stage of CC, and how this varies by stage of differentiation - the identification of novel CC genes, and the extent to which they vary across stage of differentiation.

[15]:
###  Part 6 code, figures and explanatory text goes here and in subsequent cells. Show all code.

Bonus mission:

Write a function that will generate a heatmap of your novel CC genes and that orders and displays cells by predicted phase of CC.

[16]:
###  Bonus mission code, figures and explanatory text goes here and in subsequent cells. Show all code.