{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# scRNAseq analysis part 2\n",
"\n",
"#### Outline\n",
"HW2 will be posted by tomorrow morning.\n",
"\n",
"Today, we are going to continue where we left off last Thursday. We will cover the following:\n",
"\n",
"- background information on example data\n",
"- more on quality control\n",
"- normalization\n",
"- variable gene selection\n",
"- principle component analysis\n",
"- predicting cell cycle status\n",
"- k-means clustering\n",
"- hierarchical clustering\n",
"- kNN\n",
"- leiden clustering\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"#### Background\n",
"First, let's start with a bit of background on the data that we have been using. In lecture 1, we talked about gastrulation:\n",
"\n",
"\n",
"
\n",
"
\n", " | sampleName | \n", "n_genes_by_counts | \n", "total_counts | \n", "total_counts_ribo | \n", "pct_counts_ribo | \n", "total_counts_mt | \n", "pct_counts_mt | \n", "
---|---|---|---|---|---|---|---|
AAACATACCCTACC-1 | \n", "mEB_day4 | \n", "1212 | \n", "2238.0 | \n", "629.0 | \n", "28.105453 | \n", "28.0 | \n", "1.251117 | \n", "
AAACATACGTCGTA-1 | \n", "mEB_day4 | \n", "1588 | \n", "3831.0 | \n", "1267.0 | \n", "33.072304 | \n", "34.0 | \n", "0.887497 | \n", "
AAACATACTTTCAC-1 | \n", "mEB_day4 | \n", "1538 | \n", "3381.0 | \n", "961.0 | \n", "28.423544 | \n", "2.0 | \n", "0.059154 | \n", "
AAACATTGCATTGG-1 | \n", "mEB_day4 | \n", "1221 | \n", "2489.0 | \n", "750.0 | \n", "30.132584 | \n", "24.0 | \n", "0.964243 | \n", "
AAACATTGCTTGCC-1 | \n", "mEB_day4 | \n", "2661 | \n", "9510.0 | \n", "3132.0 | \n", "32.933754 | \n", "71.0 | \n", "0.746583 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
TTTGACTGAGGCGA-1 | \n", "mEB_day4 | \n", "2446 | \n", "6908.0 | \n", "1999.0 | \n", "28.937466 | \n", "65.0 | \n", "0.940938 | \n", "
TTTGACTGCATTGG-1 | \n", "mEB_day4 | \n", "2906 | \n", "9558.0 | \n", "3067.0 | \n", "32.088303 | \n", "91.0 | \n", "0.952082 | \n", "
TTTGACTGCTGGAT-1 | \n", "mEB_day4 | \n", "1475 | \n", "3280.0 | \n", "1035.0 | \n", "31.554878 | \n", "22.0 | \n", "0.670732 | \n", "
TTTGACTGGTGAGG-1 | \n", "mEB_day4 | \n", "2808 | \n", "9123.0 | \n", "2923.0 | \n", "32.039898 | \n", "55.0 | \n", "0.602872 | \n", "
TTTGACTGTACAGC-1 | \n", "mEB_day4 | \n", "3518 | \n", "14918.0 | \n", "5091.0 | \n", "34.126560 | \n", "91.0 | \n", "0.610001 | \n", "
5405 rows × 7 columns
\n", "\n", " | gene_ids | \n", "mt | \n", "ribo | \n", "n_cells_by_counts | \n", "mean_counts | \n", "pct_dropout_by_counts | \n", "total_counts | \n", "n_cells | \n", "highly_variable | \n", "means | \n", "dispersions | \n", "dispersions_norm | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
Xkr4 | \n", "ENSMUSG00000051951 | \n", "False | \n", "False | \n", "37 | \n", "0.007031 | \n", "99.315449 | \n", "38.0 | \n", "36 | \n", "False | \n", "0.016160 | \n", "0.963224 | \n", "-0.793932 | \n", "
Sox17 | \n", "ENSMUSG00000025902 | \n", "False | \n", "False | \n", "214 | \n", "0.121369 | \n", "96.040703 | \n", "656.0 | \n", "200 | \n", "True | \n", "0.245262 | \n", "2.410920 | \n", "5.548414 | \n", "
Mrpl15 | \n", "ENSMUSG00000033845 | \n", "False | \n", "False | \n", "3083 | \n", "1.093617 | \n", "42.960222 | \n", "5911.0 | \n", "2812 | \n", "False | \n", "1.177424 | \n", "1.363713 | \n", "0.200615 | \n", "
Lypla1 | \n", "ENSMUSG00000025903 | \n", "False | \n", "False | \n", "1300 | \n", "0.289732 | \n", "75.948196 | \n", "1566.0 | \n", "1177 | \n", "False | \n", "0.522485 | \n", "1.201372 | \n", "-0.183017 | \n", "
Tcea1 | \n", "ENSMUSG00000033813 | \n", "False | \n", "False | \n", "2025 | \n", "0.507678 | \n", "62.534690 | \n", "2744.0 | \n", "1853 | \n", "False | \n", "0.782621 | \n", "1.229463 | \n", "-0.082616 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
Vamp7 | \n", "ENSMUSG00000051412 | \n", "False | \n", "False | \n", "810 | \n", "0.168178 | \n", "85.013876 | \n", "909.0 | \n", "729 | \n", "True | \n", "0.342715 | \n", "1.299276 | \n", "0.339964 | \n", "
Spry3 | \n", "ENSMUSG00000061654 | \n", "False | \n", "False | \n", "5 | \n", "0.000925 | \n", "99.907493 | \n", "5.0 | \n", "5 | \n", "False | \n", "0.002576 | \n", "1.157724 | \n", "-0.113710 | \n", "
PISD | \n", "ENSMUSG00000095041 | \n", "False | \n", "False | \n", "2648 | \n", "0.801295 | \n", "51.008326 | \n", "4331.0 | \n", "2480 | \n", "True | \n", "1.128138 | \n", "1.547616 | \n", "0.890088 | \n", "
DHRSX | \n", "ENSMUSG00000063897 | \n", "False | \n", "False | \n", "363 | \n", "0.071230 | \n", "93.283996 | \n", "385.0 | \n", "331 | \n", "False | \n", "0.147803 | \n", "1.127410 | \n", "-0.219726 | \n", "
CAAA01147332.1 | \n", "ENSMUSG00000095742 | \n", "False | \n", "False | \n", "13 | \n", "0.002405 | \n", "99.759482 | \n", "13.0 | \n", "10 | \n", "False | \n", "0.006262 | \n", "1.560301 | \n", "1.294218 | \n", "
15557 rows × 12 columns
\n", "\n", " | sampleName | \n", "n_genes_by_counts | \n", "total_counts | \n", "total_counts_ribo | \n", "pct_counts_ribo | \n", "total_counts_mt | \n", "pct_counts_mt | \n", "S_score | \n", "G2M_score | \n", "phase | \n", "
---|---|---|---|---|---|---|---|---|---|---|
AAACATACCCTACC-1 | \n", "mEB_day4 | \n", "1212 | \n", "2238.0 | \n", "629.0 | \n", "28.105453 | \n", "28.0 | \n", "1.251117 | \n", "-0.025818 | \n", "0.066043 | \n", "G2M | \n", "
AAACATACGTCGTA-1 | \n", "mEB_day4 | \n", "1588 | \n", "3831.0 | \n", "1267.0 | \n", "33.072304 | \n", "34.0 | \n", "0.887497 | \n", "0.193831 | \n", "0.545027 | \n", "G2M | \n", "
AAACATACTTTCAC-1 | \n", "mEB_day4 | \n", "1538 | \n", "3381.0 | \n", "961.0 | \n", "28.423544 | \n", "2.0 | \n", "0.059154 | \n", "0.453981 | \n", "0.562032 | \n", "G2M | \n", "
AAACATTGCATTGG-1 | \n", "mEB_day4 | \n", "1221 | \n", "2489.0 | \n", "750.0 | \n", "30.132584 | \n", "24.0 | \n", "0.964243 | \n", "0.001577 | \n", "0.840160 | \n", "G2M | \n", "
AAACATTGCTTGCC-1 | \n", "mEB_day4 | \n", "2661 | \n", "9510.0 | \n", "3132.0 | \n", "32.933754 | \n", "71.0 | \n", "0.746583 | \n", "0.446344 | \n", "0.671123 | \n", "G2M | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
TTTGACTGACTCTT-1 | \n", "mEB_day4 | \n", "2941 | \n", "11419.0 | \n", "4406.0 | \n", "38.584812 | \n", "44.0 | \n", "0.385323 | \n", "-0.164811 | \n", "0.345561 | \n", "G2M | \n", "
TTTGACTGAGGCGA-1 | \n", "mEB_day4 | \n", "2446 | \n", "6908.0 | \n", "1999.0 | \n", "28.937466 | \n", "65.0 | \n", "0.940938 | \n", "0.704966 | \n", "0.669519 | \n", "S | \n", "
TTTGACTGCATTGG-1 | \n", "mEB_day4 | \n", "2906 | \n", "9558.0 | \n", "3067.0 | \n", "32.088303 | \n", "91.0 | \n", "0.952082 | \n", "1.178114 | \n", "0.898342 | \n", "S | \n", "
TTTGACTGCTGGAT-1 | \n", "mEB_day4 | \n", "1475 | \n", "3280.0 | \n", "1035.0 | \n", "31.554878 | \n", "22.0 | \n", "0.670732 | \n", "0.246748 | \n", "0.693743 | \n", "G2M | \n", "
TTTGACTGGTGAGG-1 | \n", "mEB_day4 | \n", "2808 | \n", "9123.0 | \n", "2923.0 | \n", "32.039898 | \n", "55.0 | \n", "0.602872 | \n", "-0.162199 | \n", "1.935829 | \n", "G2M | \n", "
5109 rows × 10 columns
\n", "