1 Dataset Introduction

1.1 CosMx SMI Instrument Background

CosMx SMI is currently the only spatial platform enabling quantification of 1000+ targets at subcellular resolution for maximum biological insight. The CosMx SMI in situ chemistry is designed to enable high-plex and high-resolution with probes and fluorophores designed to reduce optical crowding limitations, allowing multiplexing probes to 1000 unique transcript species and beyond. In addition to high-plex, CosMx SMI also achieves high resolution. Target localization accuracy of ∼50 nm in the XY plane allows for targeting of <1kb transcripts with no loss in performance.

1.2 Cell Segmentation

CosMx uses a combination of in situ hybridization (ISH) and antibody-conjugated fluorescent imaging methods to both accurately assign RNA transcripts to an X-Y coordinates, and identify the cells in which those transcripts reside. Antibody-based tissue labeling, and multi-modal AI-based cell segmentation approach enables precise cell boundary detection.

1.3 CosMx RNA assay Panel

The 1000-plex CosMx Human Universal Cell Characterization panel is designed to provide granular cell type mapping and detailed insight into cell states, signaling interactions, hormone activity, and microenvironments in healthy and diseased human organs. It provides robust tissue mapping and cell typing to reveals ligand-receptor interactions, hormone activity, and signaling pathways and profiles cell states and functions.

1.4 Liver Dataset

The liver is immensely important for human health, but the diversity and spatial features of individual liver cell types, their cell-cell interactives in context, and the associated molecular and cellular processes in both healthy and diseased tissue have not yet been fully uncovered

We present here two tissue sections from 5µm Formalin fixed Paraffin Embedded (FFPE) sections:

  • Normal liver, 35yo Male, RINe: 1.4, DV200 58%
  • Hepatocellular Carcinoma (Liver Cancer), stage II, Grade G3, 65yo Female, RINe 2.3, DV200 68%

Scan areas:

  • Normal Liver: 76mm2
  • Cancer Liver: 100mm2

Fields of View (FOV)

  • Normal Liver: 301
  • Cancer Liver: 383

Panel Used: CosMx Human Universal Cell Characterization Panel (1000-plex)

  • Validated and available commercial product

Segmentation markers:

  • Normal Liver: DAPI, CD298/B2M, Ck8/18, PanCK, CD45
  • Cancer Liver: DAPI, CD298/B2M, CD68, PanCK, CD45


Key Performance Metrics
FFPE Normal Liver 76mm2 FFPE Hepatocellular Carcinoma 100mm2
Total Cells Analyzed ~340,000 ~464,000
% of cell passed QC 99% 99%
Sensitivity Total Transcripts detected ~200M ~500M
Cell Segmentation % of transcripts assigned to cells 89% 93%
Sensitivity Mean total transcripts/cell 583 1150
Max total transcripts/cell 5,916 27,491
Mean total transcripts/um3 0.67 1.41
Specificity Mean Total Negative probe counts/cell 0.7 0.7
Mean Total FalseCode* counts/cell 3.95 4.16
Mean Negative probe counts/target/cell 0.04 0.07
Mean FalseCode counts/target/cell 0.008 0.008
% counts on targeted genes ~99% ~99%
Diversity Total unique genes detected above background 643 676

*FalseCodes are fluorescent sequences not present in the panel and are used to estimate decoding error

2 Read In Raw Data

library(ggplot2)
library(ggpubr)
library(ggrepel)
library(gridExtra)
library(matrixStats)
library(patchwork)
library(pheatmap)
library(Seurat)
library(RColorBrewer)
library(reshape2)

2.1 TileDB Array data

“TileDB is a powerful engine for storing and accessing dense and sparse multi-dimensional arrays, which can help you model any complex data efficiently.” TileDB GitHub

In addition to formats which store data in memory, the AtoMx SIP exports support TileDB formats which access and load data as memory is needed. This format, while less commonly used for single-cell projects, allows for scalability to very high-density analysis. As many CosMx studies will be well in excess of 1 million cells, this new format will enable robust and scalable computations across very large studies without requiring all data to be loaded into memory. CosMx SMI datasets can be large enough to quickly run out of RAM when doing analyses if the entire dataset is read into memory. By saving data in TileDB arrays we don’t need to have all of the data in memory, only the specific parts we are actively working on.

TileDBsc is an R implementation of the Stack of Matrices, Annotated (SOMA) API based on TileDB. Using the tiledbsc R package, one can access the data stored as TileDB arrays in the SOMA data model from our analysis pipeline. The TileDB organization, in association with the Chan Zuckerberg Initiative, is currently creating a supplemental library that will enable users to convert the two most popular single cell format, Seurat and AnnData, to and from SOMA to make data analysis in multiple languages like R and python simple. While all of the data is stored in TileDB arrays, it can be saved into a Seurat object to work in a more familiar format and to be able to use existing Seurat functions, see the Seurat Object section below for more information.


For this section of the tutorial, we will be focusing on two packages, TileDB-R and tiledbsc. TileDB-R allows us to use the low-level functions from TileDB like reading and writing matrices and tiledbsc is built on top of this framework specifically for single cell data that allows easy conversion to a Seurat object.

if (!require("tiledb", quietly = TRUE))
  remotes::install_github("TileDB-Inc/TileDB-R", force = TRUE, 
  ref = "0.17.0")

if (!require("tiledbsc", quietly = TRUE))
  remotes::install_github("tiledb-inc/tiledbsc", force = TRUE, 
  ref = "8157b7d54398b1f957832f37fff0b173d355530e")

library(tiledb)
library(tiledbsc)

These packages are still under development so it is good practice to know which version is being installing.

tiledbsc 0.1.4.9001+ is recommended

tiledb: 0.17.0
tiledbsc: 0.1.5

Set the tiledbURI variable below to point to the file path of the TileDB folder. For the example below, the TileDB folder is located in LiverDataRelease_TileDBArray folder. You can also set this to an Amazon Web Serivices (AWS) S3 bucket if your TileDB object is stored in an S3 bucket.

# File path to TileDB array on local machine
tiledbURI <- "LiverDataRelease_TileDBArray/"

If your data is in an Amazon S3 bucket, we need to set up the environment for TileDB to be able to interact with AWS.

# AWS region of the URI
s3Region <- "us-west-2"

config <- tiledb::tiledb_config()
config["vfs.s3.region"] <- s3Region
ctx <- tiledb::tiledb_ctx(config)

The main object in TileDB is a Stack of Matrices, Annotated (SOMA). SOMA is an open data model for representing annotated matrices, like those commonly used for single-cell data analysis. A SOMACollection contains multiple SOMAs. In CosMx SMI data, there is a SOMA for targets RNA species, negative control probes, and FalseCodes.

An example of the basic TileDB data structure can be seen below:

directory
|-- __group
|   |-- __data_specific_alphanumeric_name_here
|   |-- __data_specific_alphanumeric_name_here
|-- __meta
|   |-- __data_specific_alphanumeric_name_here
|-- __tiledb_group.tdb
|-- soma_RNA
|   |-- X
|   |-- __group
|   |-- __meta
|   |-- __tiledb_group.tdb
|   |-- obs
|   |-- obsm
|   |-- obsp
|   |-- uns
|   |-- var
|   |-- varm
|   |-- varp
|-- uns
    |-- __group
    |-- __meta
    |-- __tiledb_group.tdb
    |-- commands

These files are not touched except through TileDB accessor functions, shown below.

2.1.1 Reading into R

You can read the arrays stored in the TileDB object using the tiledbsc R package which offers accesor functions to easily read the data into memory or access only.

# read in SOMACollection
tiledb_scdataset <- tiledbsc::SOMACollection$new(uri = tiledbURI, 
                                                 verbose = FALSE)

The TileDB object is a collection of pointers to the RNA counts, normalized RNA counts, negative control probes, and falsecode SOMAs. Each SOMA follows the AnnData shape.

AnnData Shape

names(tiledb_scdataset$somas)
## [1] "falsecode"      "negprobes"      "RNA"            "RNA_normalized"
names(tiledb_scdataset$somas$RNA$members)
## [1] "X"    "varp" "obs"  "varm" "obsp" "uns"  "obsm" "var"

Having an object of pointers allows for small memory usage. This TileDB array with an entire study only used 3.21 MB of memory even though it contains nearly 800,000 cells!

The main parts of CosMx SMI data are:

  1. Cell-by-target counts which are stored in the X slot
  2. Cell metadata which is stored in the obs slot
  3. Target transcript coordinates which is stored in the obsm slot
  4. Results from analysis modules which are stored in various slots and will be detailed below

These parts of the data can be loaded into memory separately or all together.

Individual TileDB objects might have different orders of cells. Please pay attention to these orders.

2.1.1.1 Cell-by-target counts

All matrices in TileDB are stored as sparse matrices. This structure is used for querying the data.

Count matrices currently have targets on rows and cells on columns like a standard Seurat object. In future TileDB releases, these will be transposed to look more like AnnData. This change will decrease load times in TileDB but any coercion to Seurat, shown below, will not be affected.

2.1.1.1.1 Raw Counts

The raw data cell-by-target expression data are stored in the X slot and can be retrieved and stored in memory using the below code.

# batch_mode parallelizes the readin, decreasing computation time
counts <- tiledb_scdataset$somas$RNA$X$members$counts$to_matrix(batch_mode = TRUE) 
dim(counts)
## [1]   1000 793318
counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgTMatrix"
##       c_1_100_1000 c_1_100_1017 c_1_100_1028 c_1_100_1043
## AATK             1            1            2            1
## ABL1             .            .            .            .
## ABL2             .            .            .            .
## ACACB            .            .            .            .

Memory used: 2.4 GB

2.1.1.1.2 Normalized Counts

Use the code below to retrieve and save the normalized counts in memory.

This data was normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.

In future releases normalized data will be located tiledb_scdataset$somas$RNA$X$members$data and not in a separate SOMA.

norm <- tiledb_scdataset$somas$RNA_normalized$X$members$data$to_matrix(batch_mode = TRUE)
dim(norm)
## [1]   1000 793318
norm[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgTMatrix"
##        c_1_100_1  c_1_100_10 c_1_100_100 c_1_100_1000
## AATK  -0.1429034 -0.06889529  -0.2645790    7.5780320
## ABL1  -0.1972264 -0.09509172  -0.3650734   -0.1790312
## ABL2  -0.1667603 -0.08039924  -0.3087212   -0.1513743
## ACACB -0.2951553 -0.14233400  -0.5460241   -0.2679370

Memory used: 12.75 GB

2.1.1.2 Cell Metadata

Cell-level metadata are stored in obs slot. It consists of cell information read from the output of CosMx SMI. Some worth noting are the x-y locations, flow cell, fov and other cell stats from the Cell_Stats csv files from raw data inputs. The obs slot will also store results from module run from the pipeline. Some examples are the QC module and clustering modules. The code below can be used to retrieve and to store the cell metadata from obs slot into memory.

metadata <- tiledb_scdataset$somas$RNA$obs$to_dataframe()
dim(metadata)
## [1] 793318     63
metadata[1:4,1:10]
##              RNA_pca_cluster_default RNA_pca_cluster_default.1 orig.ident nCount_RNA nFeature_RNA nCount_negprobes nFeature_negprobes nCount_falsecode
## c_1_100_1                         16                        20          c        142           77                0                  0                0
## c_1_100_10                        13                        15          c         33           26                0                  0                5
## c_1_100_100                        3                         7          c        487          143                0                  0                1
## c_1_100_1000                      13                        16          c        117           69                0                  0                2
##              nFeature_falsecode fov
## c_1_100_1                     0 100
## c_1_100_10                    5 100
## c_1_100_100                   1 100
## c_1_100_1000                  2 100

Memory used: 422.05 MB


If you do not want all the cell metadata to be read into memory, you can read specific columns by specifying the attrs parameter. We can use this metadata to visualize the tissue structure by specifically plotting where each cell is in physical coordinate space.

cellCoords <- tiledb_scdataset$somas$RNA$obs$to_dataframe(
  attrs = c("x_FOV_px", "y_FOV_px", "x_slide_mm", "y_slide_mm", 
            "slide_ID_numeric", "Run_Tissue_name", "fov"))
head(cellCoords)
##              x_FOV_px y_FOV_px x_slide_mm y_slide_mm slide_ID_numeric Run_Tissue_name fov
## c_1_100_1          42       36    8.70804    9.73368                1     NormalLiver 100
## c_1_100_10       2737       25    9.03144    9.73500                1     NormalLiver 100
## c_1_100_100      2888      409    9.04956    9.68892                1     NormalLiver 100
## c_1_100_1000     3457     3742    9.11784    9.28896                1     NormalLiver 100
## c_1_100_1001     1298     3763    8.85876    9.28644                1     NormalLiver 100
## c_1_100_1002     1001     3738    8.82312    9.28944                1     NormalLiver 100
ggplot(cellCoords, aes(x=x_slide_mm, y=y_slide_mm))+
  geom_point(alpha = 0.05, size = 0.01)+
  facet_wrap(~Run_Tissue_name)+
  coord_equal()+
  labs(title = "Cell coordinates in XY space")

2.1.1.3 Target Transcript Coordinates

Transcript coordinates are x-y locations of individual transcript. Note that one cell contains many transcripts. These are currently stored in obsm slot. In future releases transcript coordinates will be stored in the uns slot: tiledb_scdataset$somas$RNA$uns$members$transcriptCoords

To read the transcript coordinated into memory:

transcriptCoords <- tiledb::tiledb_array(
  tiledb_scdataset$somas$RNA$obsm$members$transcriptCoords$uri,
  return_as="data.frame")[]
head(transcriptCoords)
##   slideID fov x_FOV_px y_FOV_px z_FOV_slice  target CellId     cell_id  CellComp
## 1       1  15     1375        9           1   HMGN2   1271 c_1_15_1271   Nuclear
## 2       1  15       25       10           6  MALAT1      1    c_1_15_1   Nuclear
## 3       1  15       25       10           5    XBP1      1    c_1_15_1   Nuclear
## 4       1  15       36       10           3 SLCO2B1      1    c_1_15_1   Nuclear
## 5       1  15      167       10           6   RPL32     23   c_1_15_23 Cytoplasm
## 6       1  15      201       10           7   APOA1     23   c_1_15_23 Cytoplasm

One can visualize this by plotting transcripts from an individual sample (flow cell 1, normal liver) and a specific region within the tissue (FOV 15). In the graph below the centroid of each cell is shown as a black point and each transcript is shown as a colored point.

slide <- 1
fov <- 15

slideName <- unique(cellCoords$Run_Tissue_name[cellCoords$slide_ID_numeric == 
                                                 slide])

fovCoords <- cellCoords[cellCoords$slide_ID_numeric == slide & 
                          cellCoords$fov == fov,]
fovTranscriptCoords <- transcriptCoords[transcriptCoords$slideID == slide & 
                                          transcriptCoords$fov == fov,]

targetCounts <- table(fovTranscriptCoords$target)

targets <- names(targetCounts[which(targetCounts >= 2500 & 
                                      targetCounts <= 3000)])
fovTranscriptCoords <- fovTranscriptCoords[fovTranscriptCoords$target %in% 
                                             targets,]

ggplot(fovCoords, aes(x=x_FOV_px, y=y_FOV_px))+
  geom_point(alpha = 0.6, size = 0.1, color = "black")+
  geom_point(data = fovTranscriptCoords, 
             mapping = aes(x=x_FOV_px, 
                           y=y_FOV_px, 
                           color = target), 
             size = 0.3, alpha = 0.2)+
  theme_bw()+
  coord_equal()+
  guides(colour = guide_legend(override.aes = list(size=1,
                                                   alpha=1)))+
  labs(color = "RNA Target", title = paste0("RNA Transcripts in\n", 
                                            slideName, "\nFOV", fov))

2.1.1.4 Analysis Results

In addition to storing data about count information, the AtoMx SIP enables downstream analysis as well including methods such as differential expression and ligand-receptor analysis. Metadata from these analyses can also be included in the exported results from AtoMx. See Analysis Results Section for full description of each analysis module that can be run in AtoMx SIP.

The results from analysis modules can be found in slots dependent on their output shape. They follow the AnnData object shape shown above.

slot type of data results
obs cell metadata
obsm data on cells requiring more than 1 column PathwayCellsAUC, dimreduction_pca, dimreduction_approximateumap, latest.fovs, RNA_nbclust_loglikes, RNA_normalized_cluster_normExpr, cellScores, transcriptCoords, RNA_normalized_markers_result, RNA_normalized_top_markers, dimreduction_approximateUMAP_bySlide
obsp paired result with cells on rows and columns graph_RNA_pca_snn, graph_RNA_pca_nn, spatial_distances, RNA_spatialDE_W
var target metadata
varm data on targets requiring more than 1 column dimreduction_pca
varp paired result with targets on rows and columns RNA_LeesLRes, RNA_nbclust_log_likelyhood_profiles
uns unstructured data, doesn’t have to follow cells or targets spatEnrichRes, cellTypeScores, cellTypePairs, linkScores, cellProximity

Transcript Coordinates are currently in obsm but will be moved to uns in a later release

Here is a small example on how to read in two results arrays. All results follow a similar readin pattern, the only thing that needs to change is the slot and result matrix names:

tiledb_scdataset$somas$RNA$<put slot name here>$members$<put result matrix name here>$to_matrix()

To get the result for PCA, note from the table above, PCA is saved in the obsm slot and the name of the result matrix is dimreduction_pca:

tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix()

pca <- tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix()
pca[1:4,1:4]
##                   PCA_1       PCA_2       PCA_3      PCA_4
## c_1_100_1    -1.6417182 -2.33161620 -2.99827387 -0.1049817
## c_1_100_10   -1.1434024 -0.08528228 -1.04553850  0.3209358
## c_1_100_100  -7.0135208  0.52200934  0.09413956  0.1917221
## c_1_100_1000 -0.2238483 -2.22115908 -0.78119275  1.5374492
neighbs <- tiledb_scdataset$somas$RNA$obsp$members$graph_RNA_pca_nn$to_seurat_graph()
neighbs[1:4,1:25]
## 4 x 25 sparse Matrix of class "dgCMatrix"
##                                                               
## c_1_100_1    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . . . . .
## c_1_100_10   . . . . . . . . . . . . . . . . . . . . 1 1 1 1 1
## c_1_100_100  . . . . . . . . . . . . . . . . . . . . . . . . .
## c_1_100_1000 . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1.5 Read Data into Seurat

The TileDB objects can also be read into Seurat objects. The TileDBsc native to_seurat function does not read in data from slots like uns so some analysis results will not be attached to the Seurat object and need to be read in individually from the TileDB array like shown above. NanoString’s internal pipeline, not released publicly, has a Seurat converter that attaches all results to Seurat object. The Seurat Object section below uses this custom function so Seurat objects will be different between sections. We currently recommend using the exported Seurat object, but will be making functions available for use with the TileDB object to enable seamless conversion in future releases.

# Seurat object with all SOMAs 
# seurat <- tiledb_scdataset$to_seurat(batch_mode = TRUE)
# seurat

RNA_seurat <- tiledb_scdataset$to_seurat(somas = c("RNA"), batch_mode = TRUE)
RNA_seurat
## An object of class Seurat 
## 1000 features across 793318 samples within 1 assay 
## Active assay: RNA (1000 features, 0 variable features)
##  3 dimensional reductions calculated: pca, approximateumap, approximateUMAP_bySlide

These Seurat objects can be used in Seurat functions for calculations or visualizations.

RNA_seurat@meta.data <- metadata
Idents(RNA_seurat) <- RNA_seurat$cellType

markers <- FindMarkers(RNA_seurat, ident.1 = unique(Idents(RNA_seurat))[1],
                       logfc.threshold = 0.25, test.use = "roc",
                       only.pos = TRUE)

VlnPlot(RNA_seurat, features = head(rownames(markers), 2),
        log = TRUE, pt.size = 0)

While a Seurat object might be easier to use, please be aware that reading an entire dataset into memory can be RAM exhaustive. This Seurat object requires 3.16 GB of RAM vs the TileDB array which only uses 3.26 MB at loading.

2.1.2 Reading into Python

TileDB objects can also be read into python.

This section is only used in R to set up python environment.

# install packages to use python in R
if (!require("rminiconda", quietly = TRUE))
  remotes::install_github("hafen/rminiconda")

library(rminiconda)
library(reticulate)

# create python environment
if(!rminiconda::is_miniconda_installed("tiledb_python")){
  rminiconda::install_miniconda(version = 3, name = "tiledb_python")
}

# use created python environment
py <- rminiconda::find_miniconda_python("tiledb_python")
reticulate::use_python(py, required = TRUE)

# install necessary python packages into environment
for(i in c("tiledb", "anndata", "tiledbsoma==0.1.19", "scanpy", "gc-python-utils")){
  rminiconda::rminiconda_pip_install(pkg_name = i, name = "tiledb_python")
}

The rest of this section is in python.

If your dataset is on S3, you will need to run the commented out code for TileDB to be able to interact with AWS.

import tiledb
import tiledbsoma
from anndata import AnnData
import scanpy as sc

s3Region = "us-west-2"
tiledbURI = "LiverDataRelease_TileDBArray/"

# set up s3 environment
config = tiledb.Config()
# config.update({"vfs.s3.region" : s3Region})
ctx = tiledb.Ctx(config)

# read in SOMACollection
pySoma = tiledbsoma.SOMACollection(tiledbURI, ctx=ctx)
pySoma.keys()
## ['uns', 'falsecode', 'negprobes', 'RNA', 'RNA_normalized']

2.1.2.1 Cell-by-target counts

All matrices in TileDB are stored as sparse matrices. This structure is used for querying the data.

Count matrices currently have targets on rows and cells on columns like a standard Seurat object. In future TileDB releases, these will be transposed to look more like AnnData. This change will decrease load times in TileDB but any coercion to Seurat will not be affected.

2.1.2.1.1 Raw Counts
counts = pySoma['RNA'].X['counts'].csr()
counts
## <793318x1000 sparse matrix of type '<class 'numpy.float64'>'
##  with 146503383 stored elements in Compressed Sparse Row format>
2.1.2.1.2 Normalized Counts

This data was normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.

In future releases normalized data will be located pySoma['RNA'].X['data'] and not in a separate SOMA.

norm = pySoma['RNA_normalized'].X['data'].csr()
norm
## <793318x1000 sparse matrix of type '<class 'numpy.float64'>'
##  with 793317999 stored elements in Compressed Sparse Row format>

2.1.2.2 Cell Metadata

Cell-level metadata are stored in obs slot. It consists of cell information read from the output of CosMx SMI. Some worth noting are the x-y locations, flow cell, fov and other cell stats from the Cell_Stats csv files from raw data inputs. The obs slot will also store results from module run from the pipeline. Some examples are the QC module and clustering modules. The code below can be used to retireve and to store the cell metadata from obs slot into memory.

obs = pySoma['RNA'].obs.df()
obs.head()
##              RNA_pca_cluster_default  ...    niche
## obs_id                                ...         
## c_1_100_1                         16  ...  Zone_2a
## c_1_100_10                        13  ...  Zone_3a
## c_1_100_100                        3  ...  Zone_2a
## c_1_100_1000                      13  ...  Zone_2b
## c_1_100_1001                      17  ...  Zone_2b
## 
## [5 rows x 63 columns]

2.1.2.3 Target Transcript Coordinates

Transcript coordinates are x-y locations of individual transcript. Note that one cell contains multiple transcripts. These are currently stored in obsm slot. In future releases transcript coordinates will be located here: pySoma['RNA'].uns["transcriptCoords"].uri

transcriptCoords = tiledb.open_dataframe(pySoma['RNA'].obsm["transcriptCoords"].uri, ctx=ctx)
transcriptCoords
##          slideID  fov  x_FOV_px  ...  CellId      cell_id   CellComp
## 0              1   15      1375  ...    1271  c_1_15_1271    Nuclear
## 1              1   15        25  ...       1     c_1_15_1    Nuclear
## 2              1   15        25  ...       1     c_1_15_1    Nuclear
## 3              1   15        36  ...       1     c_1_15_1    Nuclear
## 4              1   15       167  ...      23    c_1_15_23  Cytoplasm
## ...          ...  ...       ...  ...     ...          ...        ...
## 1409703        2   15      2226  ...     590   c_2_15_590   Membrane
## 1409704        2   15      2237  ...     590   c_2_15_590  Cytoplasm
## 1409705        2   15      2477  ...     278   c_2_15_278   Membrane
## 1409706        2   15      2497  ...     285   c_2_15_285  Cytoplasm
## 1409707        2   15      3143  ...     586   c_2_15_586  Cytoplasm
## 
## [1409708 rows x 9 columns]

2.1.2.4 Analysis Results

See R read in section for full list of analysis results and Analysis Results Section for full description of module that can be run in AtoMx SIP.

pcad = pySoma['RNA'].obsm['dimreduction_pca'].df()
pcad.head()
##                  PCA_1     PCA_2     PCA_3  ...    PCA_48    PCA_49    PCA_50
## obs_id                                      ...                              
## c_1_100_1    -1.641718 -2.331616 -2.998274  ... -0.208275  0.535591 -0.390931
## c_1_100_10   -1.143402 -0.085282 -1.045539  ... -0.334928 -0.035539  0.606723
## c_1_100_100  -7.013521  0.522009  0.094140  ... -1.244416 -0.536143  1.016428
## c_1_100_1000 -0.223848 -2.221159 -0.781193  ... -2.616751  0.784868 -0.295499
## c_1_100_1001 -3.051702 -4.481670  0.119539  ...  0.903006  0.003687  1.911351
## 
## [5 rows x 50 columns]

This data can easily be coerced into AnnData format that can be plugged into squidpy or other python packages.

coordinates = obs[["x_slide_mm", "y_slide_mm"]]
adata = AnnData(norm, obs = obs, obsm={"spatial": coordinates}, dtype = "float32")
adata
## AnnData object with n_obs × n_vars = 793318 × 1000
##     obs: 'RNA_pca_cluster_default', 'RNA_pca_cluster_default.1', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_negprobes', 'nFeature_negprobes', 'nCount_falsecode', 'nFeature_falsecode', 'fov', 'Area', 'AspectRatio', 'Width', 'Height', 'Mean.PanCK', 'Max.PanCK', 'Mean.CK8.18', 'Max.CK8.18', 'Mean.Membrane', 'Max.Membrane', 'Mean.CD45', 'Max.CD45', 'Mean.DAPI', 'Max.DAPI', 'cell_id', 'assay_type', 'Run_name', 'slide_ID_numeric', 'Run_Tissue_name', 'Panel', 'Mean.Yellow', 'Max.Yellow', 'Mean.CD298_B2M', 'Max.CD298_B2M', 'cell_ID', 'x_FOV_px', 'y_FOV_px', 'x_slide_mm', 'y_slide_mm', 'propNegative', 'complexity', 'errorCtEstimate', 'percOfDataFromError', 'qcFlagsRNACounts', 'qcFlagsCellCounts', 'qcFlagsCellPropNeg', 'qcFlagsCellComplex', 'qcFlagsCellArea', 'median_negprobes', 'negprobes_quantile_0.9', 'median_RNA', 'RNA_quantile_0.9', 'nCell', 'nCount', 'nCountPerCell', 'nFeaturePerCell', 'propNegativeCellAvg', 'complexityCellAvg', 'errorCtPerCellEstimate', 'percOfDataFromErrorPerCell', 'qcFlagsFOV', 'cellType', 'niche'
##     obsm: 'spatial'

2.1.3 TileDB Documentation

Other documentation for accessing TileDB objects can be found here:

2.2 Seurat Object data

2.2.1 Reading In Object

Because the TileDBsc native to_seurat function does not attach data from slots like uns we also export data in a native Seurat format. The AtoMx SIP export function has a Seurat converter that attaches all results to Seurat object, and future releases will be focused on expanding compatibility of TileDB data with Seurat. Here we will walk through the resulting full Seurat object.

Working off of S3 paths are a little trickier than using local files. In the comments is an example of how to read in an RDS file from S3. If you are using local data, as shown here, you can simply use readRDS()

fullSeuratPath <- "LiverDataReleaseSeurat.RDS"
# bucket <- strsplit(fullSeuratPath,"/")[[1]][3]
# file_key <- strsplit(fullSeuratPath, paste0(bucket,"/"))[[1]][2]
# system2(command = "aws", arg = c("s3api", "get-object","--bucket",bucket,
#                                  "--key", file_key,paste0("/tmp/tmp.RDS")),
#         stdout = FALSE)
# fullSeurat <- readRDS("/tmp/tmp.RDS")

fullSeurat <- readRDS(fullSeuratPath)
fullSeurat
## An object of class Seurat 
## 1197 features across 793318 samples within 3 assays 
## Active assay: RNA (1000 features, 0 variable features)
##  2 other assays present: falsecode, negprobes
##  3 dimensional reductions calculated: pca, approximateumap, approximateUMAP_bySlide

While a Seurat object can be easier to use, please be aware that reading an entire dataset into memory can be RAM exhaustive. This full Seurat object, containing all of the count data, metadata, and analysis results data is 70.87 GB in size.

2.2.1.1 Cell-by-target counts

2.2.1.1.1 Raw Counts
fullSeurat[["RNA"]]@counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##       c_1_100_10 c_1_100_1078 c_1_100_1135 c_1_100_267
## AATK           .            1            .           .
## ABL1           .            2            .           .
## ABL2           .            .            .           .
## ACACB          .            2            .           1
2.2.1.1.2 Normalized Counts

This data was normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.

One major difference between our custom toSeurat function is that the normalized data is in the expected location for later releases. It is located in the RNA assay under the data slot instead of in its own assay.

fullSeurat[["RNA"]]@data[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##        c_1_100_10 c_1_100_1078 c_1_100_1135 c_1_100_267
## AATK  -0.06889529    1.5266149   -0.1577291  -0.3114469
## ABL1  -0.09509172    2.2435234   -0.2176835  -0.4296907
## ABL2  -0.08039924   -0.5759492   -0.1840596  -0.3633908
## ACACB -0.14233400    0.9318722   -0.3257528   0.9076156

2.2.1.2 Cell Metadata

fullSeurat@meta.data[1:4,1:10]
##              RNA_pca_cluster_default RNA_pca_cluster_default.1 orig.ident nCount_RNA nFeature_RNA nCount_negprobes nFeature_negprobes nCount_falsecode
## c_1_100_10                        13                        15          c         33           26                0                  0                5
## c_1_100_1078                       7                        11          c       1699          295                0                  0                7
## c_1_100_1135                      12                        10          c        173          131                1                  1                9
## c_1_100_267                       15                        17          c        675          204                2                  2                3
##              nFeature_falsecode fov
## c_1_100_10                    5 100
## c_1_100_1078                  7 100
## c_1_100_1135                  8 100
## c_1_100_267                   3 100

2.2.1.3 Target Transcript Coordinates

head(fullSeurat@misc$transcriptCoords)
##   slideID fov x_FOV_px y_FOV_px z_FOV_slice       target CellId   cell_id  CellComp
## 1       1   1      162     2985           6 FalseCode142    473 c_1_1_473   Nuclear
## 2       1   1     1912     3564           7 FalseCode148    313 c_1_1_313   Nuclear
## 3       1   1     2025     3712           7 FalseCode169    322 c_1_1_322 Cytoplasm
## 4       1   1     2249     2857           6  FalseCode36    472 c_1_1_472   Nuclear
## 5       1   1     2309     2596           7  FalseCode50    461 c_1_1_461   Nuclear
## 6       1   1     2580     3272           6 FalseCode136    479 c_1_1_479   Nuclear

2.2.1.4 Analysis Results

See Analysis Results Section for full description of module that can be run in AtoMx SIP.

slot type of data results
meta.data cell metadata
graphs nearest neighbor graphs RNA_pca_nn, RNA_pca_snn, spatialDE_W, spatial_distances
reductions DimReduc objects pca, approximateumap, approximateUMAP_bySlide
misc additional data on cells transcriptCoords, latest.fovs, MarkerID, InSituSort_cell_loglikes, cellProximity, LigandReceptor, PathwayCellsAUC, preDE, de
assay@misc additional data on target InSituSort_log_likelyhood_profiles, SpatialDE_LeesLRes
assay@meta.features target metadata

2.2.2 Seurat Documentation

2.3 Comparison of Methods

In this section we will demonstrate how to perform several common actions using TileDB in R and python, as well as with Seurat in R.

2.3.1 UMAP

This section shows how to plot a UMAP colored by the cell type metadata column. We create a simple plotting function to be used in multiple sections below (plotting).

colorColumn = "cellType"

Read in UMAP from TileDB R

umap_TileDB <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap$to_matrix())
umap_TileDB$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1]]

Read in UMAP from TileDB python

umap_py = pySoma['RNA'].obsm['dimreduction_approximateumap'].df()
metadata = pySoma['RNA'].obs.df()
umap_py["colorBy"] = metadata[r.colorColumn]

Read in UMAP from Seurat

umap_Seurat <- as.data.frame(fullSeurat@reductions$approximateumap@cell.embeddings)
umap_Seurat$colorBy <- fullSeurat[[colorColumn]][match(rownames(umap_Seurat), 
                                                       rownames(fullSeurat@meta.data)), 1]

umap_Seurat$APPROXIMATEUMAP_2 <- -umap_Seurat$APPROXIMATEUMAP_2

Note that the Seurat UMAP coordinates identical to the TileDB, but UMAP2 has a sign flip in the coordinates we will correct that here.

colrs <- brewer.pal.info[brewer.pal.info$colorblind == TRUE, ]
colorSchemes <- c("PuOr", "Dark2", "Set2", "BrBG")
colrs <- colrs[colorSchemes,]
col_vec <- unique(unlist(mapply(brewer.pal, colrs$maxcolors, colorSchemes)))

col_vec <- col_vec[-grep(col_vec, pattern = "#F|#E|#D")]

plotting <- function(data, title, Xcol, Ycol, Xname, Yname, color, 
                     size = 0.02, alpha = 0.05){
  gp <- ggplot(data, aes(x = .data[[Xcol]], y = .data[[Ycol]], 
                         color = .data[[color]]))+
    geom_point(size = size, alpha = alpha)+
    coord_equal()+
    labs(title = title, color = colorColumn,
         x = Xname, y = Yname)+
    scale_color_manual(values = col_vec)+
    guides(colour = guide_legend(override.aes = list(size=1,
                                                     alpha=1)))
  
  return(gp)
}

umapPlot <- function(data, title, colorBy = "colorBy", ...){
  return(plotting(data, title, Xcol = "APPROXIMATEUMAP_1", 
                  Ycol = "APPROXIMATEUMAP_2", Xname = "UMAP1", 
                  Yname = "UMAP2", color = colorBy, ...))
}

pcaPlot <- function(data, title, colorBy = "colorBy", ...){
  return(plotting(data, title, Xcol = "PCA_1", 
                  Ycol = "PCA_2", Xname = "PCA_1", 
                  Yname = "PCA_2", color = colorBy, ...))
}

xySlidePlot <- function(data, title, colorBy = "colorBy", ...){
  return(plotting(data, title, Xcol = "x_slide_mm", 
                  Ycol = "y_slide_mm", Xname = "x_slide_mm", 
                  Yname = "y_slide_mm", color = colorBy, ...))
}

All of the UMAPs are the same data regardless of how you read in the data.

Readin method Time Memory
TileDB R 1.234 seconds 22.3 MB
TileDB python 5.592 seconds 19.04 MB
Seurat 14.848 minutes 70.87 GB
umapGP_TileDB <- umapPlot(umap_TileDB, "TileDB - R")
umapGP_python <- umapPlot(reticulate::py$umap_py, "TileDB - python")
umapGP_Seurat <- umapPlot(umap_Seurat, "Seurat")

ggpubr::ggarrange(umapGP_TileDB, umapGP_python, umapGP_Seurat,
          common.legend = TRUE, legend = "right", ncol = 1)

3 NanoString Analysis Pipeline

These are examples of current and future analysis modules available in AtoMx SIP. They provide analytical methods for adding results to a TileDB study. Please note that each study may only use a subset of these modules, and that users can create their own analysis modules which may have different formats and data specifications. Each section below shows how to work with each of the resulting features in the TileDB and Seurat objects.

3.1 Load data from flat files

3.2 Spatial Network

  • Create a network or graph structure of the physical distribution of cells. Cells are converted to nodes in the graph, and connections between cells (e.g. nearest neighbors) are represented as edges. Network can be built in one of three ways: radius-based (all cells connected within a given radius), nearest neighbors, or Delaunay triangulation.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsp$members$spatial_distances
    • Seurat: seurat@graphs$spatial_distances

3.3 Quality Control

  • This module is to flag unreliable negative probes, cells, FOVs, and target genes. The user can choose to remove those flagged negative probes, cells, target genes, and FOVs and generate a filtered dataset which is the input of the down-stream analyses.
  • Output:
    • Failing cells and targets are flagged in metadata and can be auto-removed
      • colnames may change due to changes in QC variables
      • colnames = c(“propNegative”, “complexity”, “errorCtEstimate”, “percOfDataFromError”, “qcFlagsRNACounts”, “qcFlagsCellCounts”, “qcFlagsCellPropNeg”, “qcFlagsCellComplex”, ” qcFlagsCellArea”, “median_negprobes”, “negprobes_quantile_0.9”, “median_RNA”, “RNA_quantile_0.9”, “nCell”, “nCount”, “nCountPerCell”, “nFeaturePerCell”, “propNegativeCellAvg”, ” complexityCellAvg”, “errorCtPerCellEstimate”, “percOfDataFromErrorPerCell”, “qcFlagsFOV”)

3.4 Normalization

  • Normalization is based on the concept of adjusting for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and, more dramatically, between samples, does not influence downstream visualization and analysis of data. There are two normalization methods available:
    • Pearson’s residual normalization (default) is based on the estimated mean and variance: (raw counts - mean)/sd
    • The alternative normalization method is based on total count size factors: raw counts/total counts
  • Output:
    • TileDB (current): tiledb_scdataset$somas$RNA_normalized$X$members$data
    • TileDB (future): tiledb_scdataset$somas$RNA$X$members$data
    • Seurat: seurat[["RNA"]]@data

3.5 Principal Component Analysis (PCA)

  • Principal component analysis (PCA) provides an orthogonally constrained dimensional reduction analysis of the count data across all cells in a dataset. It produces an output values (PCs) which represent axes of variation within the data which are a combined value of weighted expression within a given cell. PCs are ordered by decreasing variation explained in the data. These can be used to better understand variation within a dataset, but is most commonly used in single-cell analysis as an input for the UMAP analysis.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca
    • Seurat: seurat@reductions$pca
pca <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix())
pca$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1]]

pcaPlot(pca, "PCA")

3.6 Uniform Manifold Approximation and Projection (UMAP)

  • UMAP provides a means to visualize high-plex complex datasets in 2-dimensional space using a non-linear approach estimating related groups of cells or features. This method is a common way of visualizing single-cell data to identify clusters of related cells which may be from the same lineage.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap
    • Seurat: seurat@reductions$approximateumap
umap_TileDB <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap$to_matrix())
umap_TileDB$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1]]
umapPlot(umap_TileDB, "TileDB - R", size = 0.2)

3.7 InSituType cell typing

  • A cell typing algorithm designed for statistical and computational efficiency in spatial transcriptomics data. Insitutype is based on a likelihood model that weighs the evidence from every expression value, extracting all the information available in each cell’s expression profile. This likelihood model underlies a Bayes classifier for supervised cell typing, and an Expectation-Maximization algorithm for unsupervised and semi-supervised clustering. Insitutype also leverages alternative data types collected in spatial studies, such as cell images and spatial context, by using them to inform prior probabilities of cell type calls.
  • Output:
    • Added metadata column with cell type & target log likelyhood profiles
      • colnames = c(“RNA_nbclust_clusters”, “RNA_nbclust_posterior_probability”)
    • TileDB: tiledb_scdataset$somas$RNA$varp$members$RNA_nbclust_log_likelyhood_profiles
    • Seurat: seurat[["RNA"]]@misc$InSituSort_log_likelyhood_profiles
  • Pre-print
  • For this dataset, we used the Liver_HCA profile matrix generated using Human Cell Atlas data. MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat Commun 9, 4383 (2018).
slideMetadata <- metadata[metadata$Run_Tissue_name == slideName & 
                            metadata$fov %in% c(1:5, 22:26, 43:47),]

xySlidePlot(data = slideMetadata, title = "Cell Types in Space",
            colorBy = "cellType", size = 0.3, alpha = 0.5)

3.8 Nearest Neighbor

  • Construct a KNN (k-Nearest Neighbor) graph default is based on the eucledian distance in PCA space and then construct the SNN (Shared Nearest Network) graph with edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance) and pruning of distant edges.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsp$members$graph_RNA_pca_nn & tiledb_scdataset$somas$RNA$obsp$members$graph_RNA_pca_snn
    • Seurat: seurat@graphs$RNA_pca_nn & seurat@graphs$RNA_pca_snn

3.9 Leiden Clustering

  • Leiden clustering is an unsupervised clustering method that is used to identify groups of cells which are related based on how similar they are in a graph structure. Clusters are defined by moving cells to identify groups of cells that can be aggregated without changing the overall relationship of the graph and looking for unstable nodes which serve as bridges between related communities to help define the boundaries of different clusters.
  • Output:
    • Added metadata column with cluster
      • colnames may change due to changes in clustering variables
      • colnames = "RNA_pca_cluster_default"
xySlidePlot(data = slideMetadata, title = "Cell Clusters in Space",
            colorBy = "RNA_pca_cluster_default",
            size = 0.3, alpha = 0.5)+
  labs(color = "Cell Cluster")

3.10 Neighborhood Analysis

  • Identify distinct cellular neighborhood clusters based on cell type composition across tissue. This module helps define the structural composition of a tissue automatically by looking for regional differences in cell type composition. Structures can be repeated structures that are frequently found within a tissue but which are not contiguous (e.g. glomeruli in the kidney, germinal centers in the lymph node) or which are physically connected across a tissue (e.g. epithelial layer in the colon).
  • Output:
    • added metadata columns with niche and neighborhood matrix
      • colnames = c(“RNA_spatialClusteringNeighbours_*“,”spatialClusteringAssignments”)
  • Niches must be named by user, NanoString’s pathologist named these niches.
xySlidePlot(data = slideMetadata, title = "Cell Neighborhoods in Space",
            colorBy = "niche",
            size = 0.3, alpha = 0.5)+
  labs(color = "Cell Niche")

3.11 Spatial Expression Analysis

  • This module is used to identify genes which have a spatial distribution that is non-uniform/autocorrelated throughout a tissue, and which may be associated with specific tissue structures, microenvironment niches, or cell types. Use this module to look for genes which carry spatially-relevant information. The module also measures associated spatial expression between genes which can be used to group genes into different spatial expression patterns. The two statistics calculated related to spatial expression patterns are Moran’s I and Lee’s L.
  • Output:
    • var columns I, pval & padj
    • TileDB: tiledb_scdataset$somas$RNA$obsp$members$spatialDE_W, tiledb_scdataset$somas$RNA$varp$members$RNA_LeesLRes, & tiledb_scdataset$somas$RNA$var
    • Seurat: seurat@graphs$spatialDE_W, seurat[["RNA"]]@misc$SpatialDE_LeesLRes, & seurat[["RNA"]]@meta.features
LeesLRes <- tiledb_scdataset$members$RNA$varp$get_member("RNA_LeesLRes")$to_dataframe()

# reformat the names of genes (e.g. change 4-1BB to 4.1BB)
LeesLRes$var_id_i <- gsub('-', '.', LeesLRes$var_id_i)
LeesLRes$var_id_i <- gsub('/', '.', LeesLRes$var_id_i)
LeesLRes$var_id_j <- gsub('X4.1BB', '4.1BB', LeesLRes$var_id_j)

# convert to square matrix
namegenes <- sort(unique(unlist(LeesLRes[1:2])))
LeesLRes_matrix <- matrix(0, length(namegenes), length(namegenes),
                          dimnames = list(namegenes, namegenes))
LeesLRes_matrix[as.matrix(LeesLRes[c("var_id_i", "var_id_j")])] <-
  LeesLRes[["value"]]

# heatmap
leesHeatmap <- function(spat_lees_l, n_clust) {

  spat_l_hm <- pheatmap::pheatmap(as.matrix(spat_lees_l), scale="none",
                                  color=viridis::viridis(100, option="D"),
                                  border_color=NA, show_rownames=TRUE,
                                  cluster_rows=TRUE, cluster_cols=TRUE,
                                  cutree_rows=n_clust, cutree_cols=n_clust,
                                  main="Lee's L Spatial Association")

  return(spat_l_hm)
}

# leesHeatmap(LeesLRes_matrix, 10)

# we plot the top 20 genes with the largest magnitude off-diagonal values.

plot_n = 20
LeesLRes_off <- LeesLRes[LeesLRes$var_id_i != LeesLRes$var_id_j,]
LeesLRes_off <- LeesLRes_off[order(abs(LeesLRes_off$value), 
                                   decreasing = TRUE),]
plot_genes <- LeesLRes_off[!duplicated(LeesLRes_off$var_id_i),][1:plot_n,
                                                               "var_id_i"]
leesHeatmap(LeesLRes_matrix[plot_genes, plot_genes], 5)

3.12 Cell Type Co-localization

  • This module examines the tendency of different cell types to be located near each other. Each pair of cell types defined from supervised or unsupervised clustering is tested using Ripley’s K-function (a function of the distance between the different cell types) for whether the cells’ spatial distribution differs from a theoretical Poisson point process where a cell’s location is not dependent on another cell’s location. The results are summarized in a heatmap indicating which cell types tend to cluster together or isolate from each other. In addition, a more granular view is shown when plotting the pair correlation function for a given cell type pairing as a function of the radius which can reveal specific distances at which the cells of each type are co-localized.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$uns$members$cellProximity
    • Seurat: seurat@misc$cellProximity
cell_pairs <- tiledb::toMatrix(tiledb_scdataset$members$RNA$uns$members$cellProximity$list_object_uris()[1])

# generate heatmap
pheatmap::pheatmap(cell_pairs, cluster_cols = TRUE, cluster_rows = TRUE,
                   breaks = seq(-2, 2, length.out = 100),
                   color = colorRampPalette(c("blue", "white", "red"))(100),
                   show_rownames = TRUE, show_colnames = TRUE)

3.13 Marker Genes

  • This module will identify marker genes associated with each cell type or cluster previously identified within a dataset. This module looks for genes which are expressed above background consistently, but also most specifically restricted to each cell type or cluster within the dataset. The module acts on each gene independently. This module may also be used to look for marker genes in neighborhoods that have been identified, but these genes will be related to the overall cellular composition of those neighborhoods.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_markers_result & tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_top_markers
    • Seurat: seurat@misc$MarkerID
# Extract gene X cluster matrix for normalized expression of selected markers
cluster_normExpr <- tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_cluster_normExpr$to_matrix()
rownames(cluster_normExpr) <- rownames(tiledb_scdataset$somas$RNA$var$to_dataframe())
topMarkers <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_top_markers$to_matrix())
topMarkers <- topMarkers$feats[as.numeric(topMarkers$final_rank) %in% 1:50]
cluster_normExpr <- cluster_normExpr[rownames(cluster_normExpr) %in%
                                       unique(topMarkers),]

# Heatmap
pheatmap::pheatmap(cluster_normExpr,
                   color = colorRampPalette(c("blue", "white", "red"))(100),
                   scale = 'row',
                   fontsize = 10, cellwidth = 10, border_color = NA,
                   show_rownames = TRUE, show_colnames = TRUE)

3.14 Ligand-Receptor Analysis

  • Score pairs of cells and individual cells for ligand-receptor signaling, and explore results. Ligand-receptor target expression in adjacent cells is used to calculate a co-expression score. A test is then performed to determine if the overall average of the scores for each ligated-receptor pair is enriched by the spatial arrangement of cells.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$uns$members$spatEnrichRes & tiledb_scdataset$somas$RNA$uns$members$cellTypePairs & tiledb_scdataset$somas$RNA$uns$members$linkScores& tiledb_scdataset$somas$RNA$uns$members$cellTypeScores & tiledb_scdataset$somas$RNA$obsm$members$cellScores
    • Seurat: seurat@misc$LigandReceptor
## Retrieval of results
# For cellScores result from obsm folder
cellScores <- tiledb_scdataset$somas$RNA$obsm$members$cellScores$to_matrix()

# For LR results from UNS
uri_use <- tiledb_scdataset$members$RNA$uns$members$cellTypeScores$uri
cellTypeScores <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]

uri_use <- tiledb_scdataset$members$RNA$uns$members$cellTypePairs$uri
cellTypePairs <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]
cellTypePairs <- cellTypePairs[,-1]
cellTypePairs <- as.matrix(cellTypePairs[,-c("fromCT", "toCT"),with=FALSE], 
                          rownames = cellTypePairs[,paste0(fromCT, "_", toCT)])

pheatmap::pheatmap(cellTypePairs[head(order(matrixStats::rowVars(cellTypePairs),
                                            decreasing = TRUE), 25),
                                 head(order(matrixStats::colVars(cellTypePairs),
                                            decreasing = TRUE), 10)])

uri_use <- tiledb_scdataset$members$RNA$uns$members$spatEnrichRes$uri
spatEnrichRes <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]
maxPadj <- 0.05 #significance level
spatEnrichRes$PadjSig <- spatEnrichRes$padj < maxPadj
spatEnrichRes$lrname <- sapply(1:nrow(spatEnrichRes),
                               function(i){paste0(spatEnrichRes$Ligand[i], "/",
                                                  spatEnrichRes$Receptor[i])})
spatEnrichRes$intScoreRatio <- spatEnrichRes$intScore/spatEnrichRes$intScore_simAvg

ggplot(spatEnrichRes, aes(x=intScoreRatio, y=-log10(padj), color=PadjSig)) +
  geom_point(size=2) +
  labs(x = "L-R Interaction Score Ratio",
       y = "-log10(Padj)",
       color = "Significant",
       title = paste0(" Padj < ", maxPadj)) +
  geom_line(aes(y=-log10(maxPadj), color = "black"),linetype="dashed") + 
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
  geom_text_repel(aes(label=lrname)) +
  theme_bw()

3.15 Signaling Pathways

  • Signaling Pathway analysis is calculated on a per-cell basis using gene sets of pre-defined pathways. The method we use calculated relative enrichment of a pathway of the genes within each pathway. Gene sets which do not have sufficient coverage are excluded from analysis.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA$obsm$members$PathwayCellsAUC
    • Seurat: seurat@misc$PathwayCellsAUC
pathway <- tiledb_scdataset$members$RNA$obsm$members$PathwayCellsAUC$to_matrix()
maxPath <- order(matrixStats::colSds(pathway[rownames(pathway) %in%
                                                slideMetadata$cell_id,]),
                 decreasing = TRUE)[1]

pathwayMeta <- cbind(metadata, umap_TileDB, maxPathway=pathway[,maxPath])

lims <- range(pathway[,maxPath])
maxPathName <- colnames(pathway)[maxPath]
plot_umap_pathway <- ggplot(pathwayMeta, aes(x = APPROXIMATEUMAP_1 ,
                                            y = APPROXIMATEUMAP_2,
                                            color = maxPathway)) +
  geom_point(size = 0.3, alpha = 0.3) +
  theme_bw() +
  labs(color = maxPathName, title = "UMAP") +
  viridis::scale_color_viridis(limits = lims, option = "A")

plot_xy_pathway <- ggplot(pathwayMeta[rownames(pathwayMeta) %in%
                                        rownames(slideMetadata),],
                         aes(x_slide_mm, y_slide_mm)) +
  geom_point(aes(color = maxPathway), size = 0.2, alpha = 0.3) +
  coord_fixed() +
  viridis::scale_color_viridis(limits = lims, option = "A") +
  labs(color = maxPathName,
       title = paste("Pathway Spatial Plot")) +
  theme_bw()

(plot_umap_pathway | plot_xy_pathway ) +
  plot_layout(guides = "collect", nrow = 2)

3.16 Differential Expression

  • Estimates and summarizes generalized linear (mixed) models for single cell expression. We control for the expression of neighboring cells on a tissue by including ‘neighboring cell expression’ of the analyzed gene as a fixed-effect control variable in the DE model. Controlling for expression in neighboring cells is motivated by the observation that cells in close proximity on a tissue are not independent, and comparisons of DE between groups may be affected by cell-segmentation and cell-type uncertainty. Often, single cell DE analyses may test whether groups are DE within a specific cell-type. In practice, imperfect cell-segmentation can result in ‘contamination’ or ‘bleed-over’ of transcripts from neighboring cells, which can also increase uncertainty in downstream cell-typing analyses. For these reasons, including the expression of the analyzed gene in neighboring cells can be a useful control variable.
  • Output:
    • TileDB: tiledb_scdataset$somas$RNA_normalized$uns$members$prede_* & tiledb_scdataset$somas$RNA_normalized$uns$members$deresults_*
    • Seurat: seurat@misc$preDE & seurat@misc$de
quantile_breaks <- function(xs, n = 10) {
  breaks <- quantile(xs, probs = seq(0, 1, length.out = n))
  breaks[!duplicated(breaks)]
}
all_de_results <- grep("deresults",
                      tiledb_scdataset$members$RNA_normalized$uns$list_member_uris(),
                      value = TRUE)

comparison <- "one.vs.rest" #select comparison
one.vs.rest <- tiledb::tiledb_array(
        uri =grep(comparison, all_de_results, value = TRUE),
        return_as = "data.table")[]

one.vs.rest <- one.vs.rest[which(one.vs.rest$p.value < 0.05),]

one.vs.rest_fc <- reshape2::acast(one.vs.rest, target~contrast,
                                  value.var = "fold_change")
one.vs.rest_fc[is.na(one.vs.rest_fc)] <- 0

one.vs.rest$log10p <- -log10(one.vs.rest$p.value)
one.vs.rest_p <- reshape2::acast(one.vs.rest, target~contrast,
                                 value.var = "log10p")
one.vs.rest_p[is.na(one.vs.rest_p)] <- 0

plot_list <- list()

mat_breaks <- quantile_breaks(one.vs.rest_fc, n = 11)
plot_list[[1]] <- pheatmap::pheatmap(one.vs.rest_fc,
                   color = colorRampPalette(c("white", "blue"))(length(mat_breaks)-1),
                   fontsize = 10, cellwidth = 15, border_color = NA,
                   show_rownames = TRUE, show_colnames = TRUE,
                   main = "Fold Change", cluster_rows = FALSE,
                   silent = TRUE, breaks = mat_breaks)[[4]]

plot_list[[2]] <- pheatmap::pheatmap(one.vs.rest_p,
                   color = colorRampPalette(c("white", "red"))(length(mat_breaks)-1),
                   fontsize = 10, cellwidth = 15, border_color = NA,
                   show_rownames = TRUE, show_colnames = TRUE,
                   main = "-log10(p value)", cluster_rows = FALSE,
                   silent = TRUE, breaks = mat_breaks)[[4]]

do.call("grid.arrange", c(plot_list, ncol =2))

4 Raw Files Folder Structure

4.1 Most Important Files for Analysis

Flowcell Folder/
      Logs/
              SpatialProfiling_[sequence_name].fovs
                      - Cell Coordinates for study, slide number matches SlideNum in next folder
      [Flowcell]_[SlideNum]/
              CellStatsDir/
                      CellComposite/
                              CellComposite_FOV[FOV].jpg
                                      - Composite Colored Image for each FOV
                      CellOverlay/
                              CellOverlay_[FOV].jpg
                                      - B&W image with cell segmentation overlays for each FOV
                      FOV[FOV]/
                              CellLabels_F[FOV].TIF
                                      - cell segmentation labels
                                      - pixel intensity values correspond with the unique cell_id
                              CompartmentLabels_F[FOV].TIF
                                      - cell subcellular compartment labels
                                      - pixel intensity values correspond with the identified compartment label
                                      - Nuclear = 1, Membrane = 2, Cytoplasmic = 3
                              [run_name]_[sequence_name]_S[slot]_Cell_Stats_F[FOV].csv
                                      - cell info from image like area, centerX,Y coordinates and fluorescence values
                      Morphology2D/
                              [sequence_name]_S[slot]_C[cycle]_P[pool]_N[spot]_F[FOV].TIF
                                      - stacked TIFF of fluorescence values
                      Morphology3D/
                              FOV[FOV]
                                      [sequence_name]_S[slot]_C[cycle]_P[pool]_N[spot]_F[FOV]_Z[z].TIF
                                              - stacked TIFF of fluorescence values at each Z plane
                      RnD/
                              Run_[GUID]_[sequence_name]_S[slot]_Summary_F[FOV].csv
                                      - FOV summary statistics
              RunSummary/
                              Run_[GUID]_[date]_S[slot]_[instrument_name]_ExptConfig.txt
                                      - Instrument Config - includes the pixel to nm ratio
              AnalysisResults/[processing_id]/
                      FOV[FOV]/
                              FOV[FOV]_Analysis_Summary.txt
                                      - Limits of Detection
                              Run_[GUID]_FOV[FOV]__complete_code_cell_target_call_coord.csv
                                      - Target coordinates and counts per cell
                                      - other coord files are intermediate files

If you have any questions, please reach out to NanoString:

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8      
##  [8] LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rminiconda_0.0.1     RcppSpdlog_0.0.11    tiledbsc_0.1.5       tiledb_0.17.0        patchwork_1.1.2.9000 pheatmap_1.0.12      reshape2_1.4.4      
##  [8] reticulate_1.27      remotes_2.4.2        RColorBrewer_1.1-3   ggpubr_0.5.0         matrixStats_0.63.0   SeuratObject_4.1.3   Seurat_4.3.0        
## [15] ggrepel_0.9.2        gridExtra_2.3        ggplot2_3.4.0        pryr_0.1.5           BiocStyle_2.22.0    
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.4.1        plyr_1.8.8             igraph_1.3.5           lazyeval_0.2.2         sp_1.5-1               splines_4.1.2          listenv_0.9.0         
##   [8] scattermore_0.8        urltools_1.7.3         digest_0.6.31          htmltools_0.5.4        magick_2.7.3           viridis_0.6.2          fansi_1.0.3           
##  [15] magrittr_2.0.3         tensor_1.5             cluster_2.1.2          ROCR_1.0-11            globals_0.16.2         spatstat.sparse_3.0-0  colorspace_2.0-3      
##  [22] lobstr_1.1.2           xfun_0.36              dplyr_1.0.10           jsonlite_1.8.4         progressr_0.13.0       spatstat.data_3.0-0    survival_3.3-1        
##  [29] zoo_1.8-11             glue_1.6.2             RcppCCTZ_0.2.12        polyclip_1.10-4        gtable_0.3.1           leiden_0.4.3           car_3.1-1             
##  [36] future.apply_1.10.0    spdl_0.0.3             abind_1.4-5            scales_1.2.1           rstatix_0.7.1          spatstat.random_3.0-1  miniUI_0.1.1.1        
##  [43] Rcpp_1.0.9             viridisLite_0.4.1      xtable_1.8-4           bit_4.0.5              htmlwidgets_1.6.1      httr_1.4.4             ellipsis_0.3.2        
##  [50] ica_1.0-3              farver_2.1.1           pkgconfig_2.0.3        sass_0.4.4             uwot_0.1.14            deldir_1.0-6           here_1.0.1            
##  [57] utf8_1.2.2             labeling_0.4.2         tidyselect_1.2.0       rlang_1.0.6            later_1.3.0            munsell_0.5.0          tools_4.1.2           
##  [64] cachem_1.0.6           cli_3.6.0              generics_0.1.3         broom_0.7.12           ggridges_0.5.4         evaluate_0.20          stringr_1.5.0         
##  [71] fastmap_1.1.0          yaml_2.3.6             goftest_1.2-3          knitr_1.41             bit64_4.0.5            fs_1.5.2               fitdistrplus_1.1-8    
##  [78] purrr_1.0.1            RANN_2.6.1             pbapply_1.7-0          future_1.30.0          nlme_3.1-155           mime_0.12              compiler_4.1.2        
##  [85] rstudioapi_0.13        plotly_4.10.1          png_0.1-8              ggsignif_0.6.3         spatstat.utils_3.0-1   tibble_3.1.8           bslib_0.4.2           
##  [92] stringi_1.7.12         nanotime_0.3.7         highr_0.10             lattice_0.20-45        Matrix_1.5-3           vctrs_0.5.1            pillar_1.8.1          
##  [99] lifecycle_1.0.3        BiocManager_1.30.16    spatstat.geom_3.0-3    triebeard_0.3.0        lmtest_0.9-40          jquerylib_0.1.4        RcppAnnoy_0.0.20      
## [106] data.table_1.14.6      cowplot_1.1.1          irlba_2.3.5.1          httpuv_1.6.8           R6_2.5.1               bookdown_0.30          promises_1.2.0.1      
## [113] KernSmooth_2.23-20     parallelly_1.34.0      codetools_0.2-18       MASS_7.3-55            rprojroot_2.0.3        withr_2.5.0            sctransform_0.3.5     
## [120] parallel_4.1.2         grid_4.1.2             tidyr_1.2.1            rmarkdown_2.19         carData_3.0-5          Rtsne_0.16             spatstat.explore_3.0-5
## [127] shiny_1.7.4

For Research Use Only. Not for use in diagnostic procedures.