An Introduction to scRNA-seq
a meaningful application of computational biology
scRNA-seq is a process used to differentiate and learn about cells. This guide (hopefully) condenses enough information so that other beginners like myself can replicate this process and understand how computer science can assist vital research areas such as cancer medicine, organ development, and drug effect studies.
Sequencing can be categorized under bioinformatics, a “multidisciplinary” field with a fair share of problems). In order to avoid useless data and wasted time, computer scientists must be aware of the underlying biological processes and the impact of the processed data. So the goal here isn’t to dive into the Python but to understand the flow of data; where it came from, and what it can imply in a research setting.
Necessary Background Information
1. A cell can interpret DNA in different ways
This diagram displays a cell’s DNA turning into mRNA, which then creates proteins (known as the ‘central dogma’ in biology). The types of proteins mRNA creates determines a lot about the organism as a whole – controlling cell division, creation of other proteins, hormones, and enzymes, and the flow of information and substances in and out of the cell.
However, mRNA does not create all possible proteins that are encoded in DNA; it creates a select few based on the exons and introns (the translatable and nontranslatable parts) of DNA. Depending on what parts of the DNA strand are used as exons or introns, the same block of DNA can be transcribed into mRNA differently. Whatever the final mRNA strand excludes is post-classified as an ‘intron’, but this does not mean that the intron is always an unusable piece of DNA. The same section of DNA can change; introns can be recruited into exons, and some exons are later unused (becoming introns), changing what mRNA is transcribed.
So, exons, which are used in the final strand of mRNA, can change. But not every exon is used either, and sometimes they are pieced together in different orders (a process called alternative splicing).
There are a variety of factors that could change what mRNA, and therefore what proteins, a cell creates:
- light, heat, radiation
- molecular availability (oxygen, hydrogen), diet
- other proteins
- mutagens
- age
By studying these factors, we learn how they affect cell ‘expression’; that is, a cell’s current attributes based on the mRNA it is currently creating. Notice that this ‘expression’ is only a snapshot of a cell’s mRNA at a given time. However, this snapshot gives us a lot of information.
2. These different ways determine a cell's gene expression profile
The part of the DNA strand that is transcribed is called a gene. We now know that these genes are transcribed to create proteins that build who we are, and can change depending on their environment. We are now adding another layer of complexity – just like how some parts of the gene are used or unused (the exons and introns), whole genes themselves may be used or unused; ‘on’ or ‘off’.
If a gene is being used to make mRNA, it is considered ‘on’; if it is not being used to make mRNA, it is considered ‘off’. scRNA-seq measures which genes are ‘on’, thereby defining a cell’s current gene expression profile.
Gene expression profiles can tell us about the cell, what kind of environment it is in, and what it is doing.
Some great examples can be found on this post that introduces gene expression profiling.
3. scRNA-seq uses the gene expression profile to differentiate cells
scRNA-seq, or single cell RNA sequencing, measures the gene expression profile of a cell in order to differentiate it from other cells in a sample.
- Preprocessing: The creation of a gene expression matrix from raw data. I do not go into the details about how this is done in this post, but I may upload another guide should I learn how live lab data is processed.
- Normalization: removing strange cells, outliers, etc.
- Dimensionality reduction: there are too many genes for us to map each cell on a graph. This step uses an unsupervised learning algorithm to flatten cell features but still retain information. More information on this algorithm can be found in this bioinformatics lecture from Harvard.
- Clustering: identifying patterns of gene expression by grouping genes based on their distance and allowing us to visualize relationships between groups of genes. This is the focus of the replicated experiment paper below.
- Differential Gene Expression: comparing cells of different clusters or from different groups
- Gene Expression Dynamics: algorithms that attempt to determine the developmental progression of cells based on their profiles. I’ll also likely be saving this topic for another post.
Replicating a Research Paper
This guide uses Dean Lee’s Figure 1 Lab; I will be recreating a figure 1. graph using Jupyter labs and data from “Pan-cancer single cell RNA-seq uncovers recurring programs of cellular heterogeneity”.
Abstract
Cultured cell lines are cells that have grown from a single cell, and are supposed to duplicate the genetic properties of the cell it grew from. Cancer cell lines can be indefinitely experimented on and regrown as if one were experimenting on the cancer itself – as long as the genetic properties of the original cell were properly duplicated.
Data Processing
The raw data can be downloaded from the Broad Institute Single Cell Portal using the identifier listed on the paper.
1. Prepare the data
# Download these in Jupyter using 'pip install x'
import os
import numpy as np
import pandas as pd
import scipy
import anndata
import scanpy as sc
import pybiomart
import scvi
import torch
import random
import seaborn as sns
warnings. filterwarnings('ignore')
cwd = os.getcwd()
# Use pandas to read the data we downloaded
# This data is about the cells themselves and their expressions
meta = pd.read_csv(cwd+'/Metadata.txt', sep='\t')
# Drop the first row that names types
meta.drop([0], axis=0, inplace=True)
# Rename first few columns
meta.rename(columns={'NAME':'CellID', 'Cell_line':'CellLine', 'Pool_ID':'Pool', 'Cancer_type':'Indication'}, inplace=True)
# Download the UMICount data which contains the
# actual genetic code of the mRNA transcriptions
# for each cell
pd.read_csv(cwd+'/UMIcount_data.txt', nrows=10, sep='\t', header=None)
# Skip the cell metadata rows and transpose
# so that the columns line up with our new
# cell id column as rows
counts = pd.read_csv(cwd+'/UMIcount_data.txt', sep='\t', skiprows=3, header=None, index_col=0)
counts = counts.transpose()
# Set the first column as our new column which
# provides a cell id for each row
counts.index = counts_cellid[0]
counts.index.name = None
# Remove cell ids in UMI that are not in our
# metadata
a = counts.index.isin(meta['CellID'])
counts = counts[a]
# Ensure that metadata and counts dataframes
# have the same index of cell IDs
meta = meta.set_index('CellID')
meta = meta.reindex(index=counts.index)
# Create an anndata object (annotated)
# X is our matrix of observations and variables
# Obs is our one-dimensional observations
# Var is our one-dimensional variables which
# in this case is our types of genome markers (the columns of count) transformed into a dataframe of one column
adata = anndata.AnnData(X=counts,
obs=meta,
var=counts.columns.to_frame())
adata.var.drop(columns=[0], inplace=True)
adata.var.index.name = None
adata.X = scipy.sparse.csr_matrix(adata.X.copy())
adata.layers['counts'] = scipy.sparse.csr_matrix(adata.X.copy())
sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=200)
Which should create output similar to:
2. Train the data
# Create an scvi model to reduce dimensions
# and create clusters
scvi.model.SCVI.setup_anndata(adata, layer='counts', batch_key='Pool')
scvi_model = scvi.model.SCVI(adata, n_layers=2, n_latent=30, n_hidden=128, gene_likelihood='nb')
scvi_model.train()
# Assign data types
adata.obs['CellLine'] = adata.obs['CellLine'].astype(str)
adata.obs['Pool'] = adata.obs['Pool'].astype(str)
adata.obs['Indication'] = adata.obs['Indication'].astype(str)
adata.obs['Genes_expressed'] = adata.obs['Genes_expressed'].astype(int)
adata.obs['Discrete_cluster_minpts5_eps1.8'] = adata.obs['Discrete_cluster_minpts5_eps1.8'].astype(str)
adata.obs['Discrete_cluster_minpts5_eps1.5'] = adata.obs['Discrete_cluster_minpts5_eps1.5'].astype(str)
adata.obs['Discrete_cluster_minpts5_eps1.2'] = adata.obs['Discrete_cluster_minpts5_eps1.2'].astype(str)
adata.obs['CNA_subclone'] = adata.obs['CNA_subclone'].astype(str)
adata.obs['SkinPig_score'] = adata.obs['SkinPig_score'].astype(float)
adata.obs['EMTI_score'] = adata.obs['EMTI_score'].astype(float)
adata.obs['EMTII_score'] = adata.obs['EMTII_score'].astype(float)
adata.obs['EMTIII_score'] = adata.obs['EMTIII_score'].astype(float)
adata.obs['IFNResp_score'] = adata.obs['IFNResp_score'].astype(float)
adata.obs['p53Sen_score'] = adata.obs['p53Sen_score'].astype(float)
adata.obs['EpiSen_score'] = adata.obs['EpiSen_score'].astype(float)
adata.obs['StressResp_score'] = adata.obs['StressResp_score'].astype(float)
adata.obs['ProtMatu_score'] = adata.obs['ProtMatu_score'].astype(float)
adata.obs['ProtDegra_score'] = adata.obs['ProtDegra_score'].astype(float)
adata.obs['G1/S_score'] = adata.obs['G1/S_score'].astype(float)
adata.obs['G2/M_score'] = adata.obs['G2/M_score'].astype(float)
adata.obs['n_genes'] = adata.obs['n_genes'].astype(int)
adata.obs['_scvi_batch'] = adata.obs['_scvi_batch'].astype(str)
adata.obs['_scvi_labels'] = adata.obs['_scvi_labels'].astype(str)
# Add data to our anndata model
adata.obsm['X_scvi'] = scvi_model.get_latent_representation()
adata.layers['counts_scvi'] = scvi_model.get_normalized_expression(library_size=10000)
# Create a graph
sc.pp.neighbors(adata, use_rep='X_scvi', key_added='neighbors_scvi', n_neighbors=20)
sc.tl.leiden(adata, neighbors_key='neighbors_scvi', key_added='leiden_scvi', resolution=3)
sc.tl.umap(adata, neighbors_key='neighbors_scvi')
sc.pl.umap(adata, color=['leiden_scvi'], legend_loc='on data')
The graph, similar to Figure 1 in the paper but not exact;
3. Categorize the clusters
# Create a graph using our anndata graph
def PlotUMAP(adata, markers, layer='log2_counts_scvi', size=2, vmin='p0', vmax='p99'):
for i in range(len(markers)):
sc.pl.umap(adata,
color=markers[i],
layer=layer,
size=size,
cmap=sns.blend_palette(['lightgray', sns.xkcd_rgb['red orange']], as_cmap=True),
vmin=vmin, vmax=vmax)
# Layers are dictionary objects with values that are the same
# dimensions as X
adata.layers['log2_counts'] = sc.pp.log1p(adata.layers['counts'].copy(), base=2)
adata.layers['log2_counts_scvi'] = sc.pp.log1p(adata.layers['counts_scvi'].copy(), base=2)
# Color the graph based on cell line data (the type of cancer)
sc.pl.umap(adata, color=['CellLine'], legend_loc=None, palette=list(plt.colors.CSS4_COLORS.values()))
Our graph:
The paper’s graph:
Conclusions
The cellular diversity in the samples appeared to be similar across the cultured cell lines and the tumor cells themselves; the authors “identified 12 RHPs (2 cell cycle and 10 others), 9 of which were highly similar to programs of heterogeneity observed within tumors, indicating that they are retained in the absence of a native microenvironment”(Kinker).
Slight differences in our Fig 1. graphs may be due to differing methods of data cleansing or data training models (we did not clean outliers, and we used scvi for training).