Homework #1

Due 11:59 pm EST, Sunday Feb 13th, 2022 Email your solutions (both .ipnb and .html files) to: compscbio@gmail.com.

Goal: Ensure that you …

  • have basic, practical knowledge of Python

  • are able to install Python packages

  • are able to use and create Jupyter notebooks

  • can apply this knowledge to explore one aspect of self-renewal


Preamble: functions

In class, we did not cover the syntax for defining your own functions. Let’s cover this now. Note that we use the randint function from the random module, and so need to import this module. Typically, you should import all of the modules that you will need at the start of your session. So, I am going to put some other import statements here that we won’t need until later in the homework. Please install modules, as needed, using either pip or anaconda.

[1]:
import random
import nltk
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

Here is a funky little function. It just prints a string passed to the function a random number of times between 1 and 10.

[3]:
def myFirstFunction(arg1):
    """This is just an example; it doesn't do anything useful"""
    times_to_repeat = random.randint(1,10)
    for i in range(0,times_to_repeat):
        print(arg1)

    return times_to_repeat

[4]:
myFirstFunction("howdy")

howdy
howdy
howdy
howdy
howdy
howdy
howdy
howdy
[4]:
8

When you create a function, you need to use the def keyword, then supply your function name, followed by the names of arguments that need to be passed to the function enclosed in parenthesis, followed by :. Then the following indented suite of code is where you define what the function will do. The first line should be text enclosed in three pairs of double quotes. This is what will be returned when help(yourFunction) is invoked. Therefore, give some thought to this text and make it a pithy description of your function. If you do not end the function suite with an explicit call to return, the the function will by default return None.


Part 1: Write a function prose_generator.

The first thing that you must do for this homeowrk is to define a function that writes a number of English sentences. The user will be asked to specify the number of sentences. The function will also need to take in as input lists of nouns and predicates. The function will construct and print out sentences of the pattern: (noun, predicate), by randomly selecting words from the corresponding lists. Below, we get you started with the skeleton of the function.

To get full credit for Part 1, your answer must include a function that takes as input the number of sentences to create, and lists of user-supplied nouns and predicates. The lists can be entered either by prompting the user, or by specifying a file(s) that contains the words. If the former, define your examples in your notebook. If the latter, inlcude them as supplemental files with your submission. Your function must randomly select from these lists to make sentences that start with the noun and end with the predicate. Finally, your function must output the specified number of sentences.

[12]:
# Part 1: modify the function skeleton below to meet the conditions stated above

import random

def prose_generator():
    """ a function that writes [beautiful] prose """
    # put your code here

    return

[11]:
# Part 1: Show us an example of running your function here

Part 2: Improve your function

Your function probably does not have much error checking so far. And, it is possible that a random sentence will be generated more than once. I don’t know about you, but I find it mildly annoying when people say the same thing repeatedly. So, your second task is to modify your function (or create a new one) so that the following conditions are met:

  1. That the nouns are actually nouns.

  2. That at least two valid nouns and two predicates are entered

  3. That a sentence is printed at most once. In other words, prevent the function from outputing duplicate sentences.

To achieve (A), you can compare the user-supplied nouns against the long list of nouns that we show you how to extract from the NLTK resource below. If the noun supplied does not occur in this list, then your function must at the very least, notify the user. To get full credit for Part 2, your new or modified sentence-generator function must do the same things as in Part 1, plus it must meet the conditions (A-C).

You will need to install the following resources from nltk (see next code cell). For more information about the NLTK resource: https://www.nltk.org/data.html

The post that we found useful to get a list of lots of Enlgish nouns: https://stackoverflow.com/questions/17753182/getting-a-large-list-of-nouns-or-adjectives-in-python-with-nltk-or-python-mad

[9]:
nltk.download('omw-1.4')
nltk.download("wordnet")
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Volumes/forest/moved_users/pcahan1/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Volumes/forest/moved_users/pcahan1/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[9]:
True
[10]:
from nltk.corpus import wordnet as wn
[11]:
all_nouns = []
for synset in wn.all_synsets('n'):
    all_nouns.extend(synset.lemma_names())
[12]:
len(all_nouns)
[12]:
146347
[13]:
"carrot" in all_nouns
[13]:
True
[14]:
"faba-laba" in all_nouns
[14]:
False

Part 3: Use Pandas (+ Numpy), and Seaborn (+ Matplotlib)

Next week (actually this week), we are going to start using Scanpy to analyze scRNAseq data. The Scanpy package is excellent at making it easy to analyze this type of data. But when you are just getting started, this can almost be too easy because it hides a lot of important details. The goal of this section is to expose you to some aspects of cleaning and selecting data, and to some basic visualization functions. This will give you a better idea of what is happening under-the-hood with packages such as Scanpy (or Seurat in R).

In this section, you need to:

    1. Load a gene expression data set (expression table; meta-data)

    1. Remove genes that are not expressed above a selected threshold

    1. Create boxplot of the expression of one of my favorite genes, Pou5f1, across cell/tissue types

    1. Identify the top 5 genes in terms of mean expression per cell/tissue type & plot a heatmap of these genes (ordered)

    1. Compute the similarity of samples & plot a heatmap of them (ordered)

To do this somewhat efficiently, you might want to look at the following documentation:

If it sounds like a lot, don’t worry it will be fun!

3A: Load the data

Download the expression matrix and meta-data from the links above. Use the Pandas read_csv() to load these. This will create a Pandas DataFrame.

This mouse data comes from one of our papers, where we compiled previously published RNA-Seq data for classification purposes. There are 16 cell/tissue types, each represented by approximately 60 or more individual samples. Later, in Step 3C, you should also load the metadata, which indicates the cell/tissue type for each sample.

[6]:
# 3A: Enter your code for loading the file hw1_df.csv as a Data Frame and displaying part of it.

3B: Filter out genes whose maximum expression < 2

You might consider using the pd.loc and pd.max() functions

[138]:
# 3B: Enter your code for removing genes and print out the size of the resulting data frame

Limit genes to those that have maximum expression > 2

3C: Make a box and whiskers plot of the gene Pou5f1, grouped by cell/tissue type.

To do this you will need to load the meta-data, linked to above. You might want to use the pd.concat() and sns.boxplot() functions.

[2]:
# 3C: Show you code for doing what you need to do to ultimatley make the boxplot.
# And, of course, display the boxplot below, too

3D: Determine the top 5 genes expressed per cell/tissue type

Once you have done so, make a heatmap showing expression of these genes across all samples. To get full credit for this, you need to

  • exclude duplicate genes (i.e. genes that are in the top 5 in more than one cell/tissue type)

  • order the samples by cell/tissue type

You might want to use the Pandas functions groupby and mean and sort_values here, and it may be useful to do some type conversions. Do use sns.heatmap() for plotting.

[140]:
# 3D: Put your code for this part here.

3E: Compute the similarity between samples.

Compute the Pearson correaltion coefficient between all pairs of samples. Plot this as a heatmap. Make sure the samples are ordered by cell/tissue type. You might want to use the Pandas corr() function. Be careful of what your are computing correlation of.

[7]:
# 3E: Put you code for 3E here

Part 4: Modeling Birth and death of CFUs

Recall the self-renewal assay from Siminovitch, Till and McCulloch 1963 (right) 34a87c18ee004d6d9ca59533e815f76a In this assay, nodules were dissected, cells were dispersed, resuspended, and tranplanted into secondary recipients. Then after 10 days (or later depending on the experiment), the number of spleen nodules were counted per recipient. Recall that the number of nodules (or colonies) per spleen was highly heterogeneous. To the authors, this indicated that heterogeneity was “… a feature either of colony-forming cells or of the process of colony-formation.”

7bfd50a6be514201b6d42b41540072dd In a follow-up paper: Till, McCulloch, and Siminovitch, 1963, this trio proposed that the heterotengeity could be due to inherent stochasticity in each colony-forming cell. In other words, a colony forming cell “… either give[s] rise to progeny like itself (“birth”), or [is] removed in some way (“death”), and these two events occur in a random fashion.” This is often referred to as a “Birth-and-death process”. They reasoned that a cell will give birth to another colony forming cell with Probability = p2 or to a differentiated cells with Probability = p0 (see panel at left). They further reasoned that if this model is correct, then simulations using the model must be aligned with their empircal observations.

The goal of Part 4 is for you to replicate some of what the authors did to test this reasoning:

A test of this assumption was made by the use of the Monte Carlo method. In this method, the choice between occurrence or nonoccurrence of a random event is made by drawing a number from a table of random digits. Consider as an example the case in which the birth and death probabilities, p2 and p0, are arbitrarily set equal to 0.6 and 0.4, respectively. Let the six digits, 0, 1, 2, 3, 4, and 5 signify a “birth” and the four digits 6, 7, 8, and 9 signify a “death.” If the first random number to be drawn was a 5, then the model would show a birth, or an increase from one cell to two cells. Each subsequent choice of a number would determine the fate of a cell in a similar manner.

Part 4A: Write a function to simulate the number of colony forming units.

Your task is to precisley mimic what the authors state above: write a function that takes as input:

  • the number of time steps, or generations

  • P2 = probability of birth, resulting in two colony forming units

and that outputs the number of CFUs remaining. Note that a death event results in a loss of one CFU, whereas a birth yields two CFUs. You may consider using the Numpy function random.uniform() here.

[2]:
# 4A: Write your function definition here
[4]:
# 4A: Show us an example of how it works here

Part 4B: Compare in silico experiments to empirical data

Use your function to determine how values of P2 influence the distribution of CFUs, and how this compares to real data. To do this, you will need to call your function many times. For this homework, let’s stay close to the original paper (n=52 recipient spleens). You will also need the original data, which you can load with the usual pd.read_csv() function. We recommend testing 3 - 5 values for P2. As you will see from the original paper, the simulated and experimentally derived data were compared by examining cumulative distribution functions. You can readily plot these using the Seaborn ecdfplot() function.

Bonus points: Compare these distrubitions to the Gamma distribution mentioned in the paper.

[8]:
# 4B: Show your code and figure(s) here.