8 Downloads & Appendix
Within this section is extra information related to QC, sample processing, and analysis, which may be informative for the purpose of qualifying results. In a full data analysis report, there will also be sections included that allow for download of the report R code.
8.1 Sample Intake Appendix
8.1.1 Annotation Distributions
We used the psych::describe
function to get an overview of the annotation data. For description of the function, please see the documentation. Note that summary statistics for categorical variables (i.e., those with an asterisk) are done by first converting them to numeric so results should be interpreted with caution.
dt_params$buttons <- list(list(extend = "copy"),
list(extend = "csv", filename = "SampleSummaryStatistics"),
list(extend = "excel", filename = "SampleSummaryStatistics"))
round_cols <- c('mean', 'sd', 'median', 'trimmed', 'mad', 'min', 'max', 'range', 'skew', 'kurtosis', 'se')
DT::formatRound(
DT::datatable(
psych::describe(pheno_data),
extensions = c("Buttons", "Scroller", "FixedColumns"),
options = dt_params),
columns = round_cols,
digits = 1)
8.1.2 Correlations Among Factors of Interest
Correlation is defined as the Pearson correlation coefficient between a pair of factors that are both numeric, as the polychoric correlation between two ordinal factors, and as the polyserial correlation between a numeric and an ordinal factor.
if(length(factors_of_interest)>1){
het_cor <- polycor::hetcor(pheno_data, std.err = FALSE)$correlations
het_cor_df <- t(combn(colnames(het_cor), 2))
het_cor_df <- data.frame(het_cor_df, Correlation=het_cor[het_cor_df])
het_cor_df$X1 <- factor(het_cor_df$X1, levels=colnames(pheno_data), order=TRUE)
het_cor_df$X2 <- factor(het_cor_df$X2, levels=colnames(pheno_data), order=TRUE)
p <- ggplot(het_cor_df, aes(x=X1, y=X2, fill=Correlation)) +
geom_tile() +
geom_text(aes(label=round(Correlation, 3))) +
scale_fill_gradientn(limits = c(-1,1),
colors = c("red", "#ded86a", "#c3dcfa", "#ded86a", "red")) +
theme_bw(base_size = 16) +
ylab("") + xlab("") +
coord_fixed() +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
ggsave(p, filename = file.path(qc_dir, "correlation_among_factors_of_interest.svg"), width=7, height=5)
p
} else {
message(paste0("Only ", length(factors_of_interest), " used. No correlations estimated."))
}
Analyst Notes: From the correlations,
slide name
andclass
are highly correlated. This is expected since a slide is only of one class type
## class
## slide name DKD normal
## disease1B 24 0
## disease2B 24 0
## disease3 59 0
## disease4 24 0
## normal2B 0 22
## normal3 0 59
## normal4 0 23
segment
andregion
are also highly correlated.
## region
## segment glomerulus tubule
## Geometric Segment 151 0
## PanCK- 0 42
## PanCK+ 0 42
This is because glomerulus ROIs used Geometric Segmentation and tubules were segmented based on PanCK staining.
class
andpathology
are also correlated.
8.2 QC Appendix
This section provides the suite of pre-processing steps that were used during Quality Control steps of the analysis. These steps can be summarized in the following figure.

8.2.1 Segment QC Details
Every ROI/AOI segment was tested for:
- Raw sequencing reads: segments with <1000 raw reads are removed.
- % Aligned, % Trimmed, or % Stitched sequencing reads: segments below ~80% for one or more of these QC parameters are removed.
- % Sequencing saturation: Defined as ([1-deduplicated reads/aligned reads]%). Segments below ~50% require additional sequencing to capture full sample diversity and are not typically analyzed until improved.
- Negative Count: this is the geometric mean of the several unique negative probes in the GeoMx panel that do not target mRNA and establish the background count level per segment; segments with low negative counts (1-10) are not necessarily removed but may be studied closer for low endogenous gene signal and/or insufficient tissue sampling.
- No Template Control (NTC) count: values >1000 could indicate contamination for the segments associated with this NTC; however, in cases where the NTC count is between 1000-10000, the segments may be used if the NTC data is uniformly low (e.g. 0-2 counts for all probes).
- Nuclei: >100 nuclei per segment is generally recommended; however, this cutoff is highly study/tissue dependent and may need to be reduced; what is most important is consistency in the nuclei distribution for segments within the study.
- Area: generally correlates with nuclei; a strict cutoff is not generally applied based on area.
More information about these parameters can be found in the GeomxTools Bioconductor page and reference manual.
8.2.1.1 QC parameters used in this study
Below are QC parameter values and the default values used in the GeomxTools package.
dt_params$autoWidth <- FALSE
dt_params$buttons <- list(list(extend = "copy"),
list(extend = "csv", filename = "QCParameterTable.csv"),
list(extend = "excel", filename = "QCParameterTable.xlsx"))
DT::datatable(
qc_param_tab,
extensions = c("Buttons", "Scroller", "FixedColumns"),
options = dt_params,
rownames = FALSE)
8.2.1.2 Non-template control counts
Here is a breakdown of the number of NTC counts for a given plate and the number of biological (i.e., non-NTC) samples in the plate.
dt_params$autoWidth <- FALSE
dt_params$buttons <- list(list(extend = "copy"),
list(extend = "csv", filename = "NTC_Counts.csv"),
list(extend = "excel", filename = "NTC_Counts.xlsx"))
DT::datatable(
ntc_summary_tab,
extensions = c("Buttons", "Scroller", "FixedColumns"),
options = dt_params,
rownames = FALSE)
8.2.1.3 Segment QC Plots
The tabs below visualize the segment QC distributions for various sequence and segment quality metrics.
The plot below shows the distribution of trimmed reads in the data with the percent threshold of 80%.
# Plot trimmed reads with helper function makeQCHistograms
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = "Trimmed (%)",
fill_by = "segment",
bins = 50,
xintercept = QC_params$percentTrimmed)
ggsave(p, filename = file.path(qc_dir, "trimmed.svg"), width=6, height=6)
p
The plot below shows the distribution of aligned reads in the data with the percent threshold of 75%.
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = "Aligned (%)",
fill_by = "segment",
bins = 50,
xintercept = QC_params$percentAligned)
ggsave(p, filename = file.path(qc_dir, "aligned.svg"), width=6, height=6)
p
The plot below shows the distribution of aligned reads in the data with the percent threshold of 50%. Saturation is defined as:
\[100* \left( 1 - \frac{deduplicated}{aligned} \right)\]
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = "Saturated (%)",
fill_by = "segment",
bins = 50,
xintercept = QC_params$percentSaturation) +
labs(title = "Sequencing Saturation (%)",
x = "Sequencing Saturation (%)")
ggsave(p, filename = file.path(qc_dir, "saturated.svg"), width=6, height=6)
p
The plot below shows the distribution of area in the data with the threshold of 1000μm2.
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = "area",
fill_by = "segment",
bins = 50,
xintercept = QC_params$minArea)
ggsave(p, filename = file.path(qc_dir, "area.svg"), width=6, height=6)
p
The plot below shows the distribution of nuclei counts in the data with the threshold of 20.
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = "nuclei",
fill_by = "segment",
bins = 50,
xintercept = QC_params$minNuclei)
ggsave(p, filename = file.path(qc_dir, "nuclei.svg"), width=6, height=6)
p
Plot the negative probe geometric mean. The vertical bars represent a value of two. Typically geometric mean background values of less than 2 are considered “unstable”.
for(ann in modules) {
p <- makeQCHistogram(probe_data_qc_for_seg_qc_plots,
annotation_col = ann,
fill_by = "segment",
bins = 50,
xintercept = 2,
scale_trans = "log10")
ggsave(p, filename = file.path(qc_dir, paste0(ann, ".svg")), width=6, height=6)
print(p)
}
8.2.2 Normalization - Evaluation of Signal Intensity
This section evaluates the “signal intensity” of gene targets relative to each segment’s background (i.e., geometric mean of background probes for each panel). Ideally, there should be a separation between these two values to ensure we can robustly estimate our signal for normalization. If you do not see sufficient separation between these values, you may consider more aggressive filtering of low signal segments/genes, or more advanced methods for data analysis.
facets <- as.formula(paste0("~", paste(facs_to_graph, collapse='*')))
p <- ggplot(signal_intensity_long,
aes(x=value, fill=Metric)) +
geom_histogram(bins=40) + theme_bw() +
facet_wrap(facets) +
scale_x_continuous(trans = "log2") +
xlab("Counts") + ylab("Segments, #")
ggsave(p, filename = file.path(qc_dir, "intensities_histogram.svg"), width=10, height=6)
p
if(length(negative_probes_dots)>1L){
signal_intensity$combinedNegMean <-
as.numeric(apply(signal_intensity[, all_of(negative_probes_dots)], 1, mean))
xlab_msg <- "Background (geometric Mean; combined)"
} else {
signal_intensity$combinedNegMean <- (signal_intensity %>%
dplyr::select(negative_probes_dots) %>%
as.data.frame())[,1]
xlab_msg <- "Background (geometric Mean)"
}
p <- ggplot(signal_intensity %>% as.data.frame(),
aes(x=combinedNegMean, y=Q3/combinedNegMean)) +
geom_point(aes_string(colour=paste0("interaction(",
paste(facs_to_graph, collapse=", "),")"))) +
scale_x_continuous(trans = "log2") + theme_bw() +
geom_hline(yintercept = 1, colour="black", lty="dashed") +
geom_vline(xintercept = 2, colour="black", lty="dashed") +
xlab(xlab_msg) +
ylab("Dynamic Range (Q3 / background)") +
theme(aspect.ratio = 1) +
guides(colour = guide_legend(override.aes = list(size=4)))
ggsave(p, filename = file.path(qc_dir, "Q3_signal_vs_background.svg"), width=6, height=6)
p
8.3 Contrasting groups with linear mixed models
In a given contrast, we are interested in comparing the magnitude of differences between groups and quantifying the significance. In many GeoMx® experiments, multiple areas of illumination (AOIs) or regions of interest (ROIs) are sampled in a given slide. To account for the nested, non-independent sampling, we often use a linear mixed effect model (LMM) with slide as a random effect. The LMM accounts for the subsampling per tissue, allowing us to adjust for the fact that the multiple regions of interest placed per tissue section are not independent observations, as is the assumption with other traditional statistical tests.
Overall, there are two main types of the LMM models when used with GeoMx® data: A) with random slope and B) without random slope.
When comparing features that co-exist in a given tissue section, a random slope is included in the LMM model. When comparing features that are mutually exclusive in a given tissue section the LMM model does not require a random slope. We represent the two variations on the LMM in the schematic below. In this example, there are a number of slides (“Slide”). A slide is classified as either healthy or a disease in the “Disease” category. Within a slide, different regions are present and these are illustrated as pink or green ROIs.
Example A: When estimating the effect of “Region”, “Slide” is used as a random effect. Because the slope of “Region” may differ across slides, “Region” is used as a random slope.
Example B: In this example study design, Disease is mutually exclusive to a slide. In other words, a slide can only be a “disease” or “healthy”. In this case, there is no need to include the additional term in the model.
8.4 R Information
Print R session information for reference:
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sDAS_1.1.0 org.Mm.eg.db_3.17.0 org.Hs.eg.db_3.17.0 ComplexHeatmap_2.16.0
## [5] SpatialDecon_1.10.0 GeoMxWorkflows_1.6.0 GeomxTools_3.4.0 svglite_2.1.1
## [9] NanoStringNCTools_1.8.0 FactoMineR_2.8 GSVA_1.48.2 GSEABase_1.62.0
## [13] graph_1.78.0 annotate_1.78.0 XML_3.99-0.14 AnnotationDbi_1.62.1
## [17] IRanges_2.34.1 S4Vectors_0.38.1 Biobase_2.60.0 BiocGenerics_0.46.0
## [21] parcats_0.0.4 ggprism_1.0.4 reshape2_1.4.4 tidyr_1.3.0
## [25] ggupset_0.3.0 pals_1.7 circlize_0.4.15 ggrepel_0.9.3
## [29] magick_2.7.4 GGally_2.1.2 kableExtra_1.3.4 psych_2.3.6
## [33] openxlsx_4.2.5.2 ggforce_0.4.1 ggplot2_3.4.3 dplyr_1.1.2
## [37] plyr_1.8.8
##
## loaded via a namespace (and not attached):
## [1] matrixStats_1.0.0 bitops_1.0-7 lubridate_1.9.2
## [4] doParallel_1.0.17 httr_1.4.6 webshot_0.5.5
## [7] RColorBrewer_1.1-3 numDeriv_2016.8-1.1 tools_4.3.1
## [10] utf8_1.2.3 R6_2.5.1 DT_0.28
## [13] HDF5Array_1.28.1 mgcv_1.8-42 GetoptLong_1.0.5
## [16] rhdf5filters_1.12.1 withr_2.5.1 sp_2.0-0
## [19] gridExtra_2.3 polycor_0.8-1 progressr_0.13.0
## [22] textshaping_0.3.6 cli_3.6.1 flashClust_1.01-2
## [25] labeling_0.4.3 sass_0.4.6 mvtnorm_1.2-2
## [28] ggridges_0.5.4 askpass_1.2.0 systemfonts_1.0.4
## [31] R.utils_2.12.2 dichromat_2.0-0.1 parallelly_1.36.0
## [34] maps_3.4.1 readxl_1.4.2 rstudioapi_0.14
## [37] RSQLite_2.3.1 shape_1.4.6 generics_0.1.3
## [40] crosstalk_1.2.0 zip_2.3.0 leaps_3.1
## [43] Matrix_1.6-5 ggbeeswarm_0.7.2 fansi_1.0.4
## [46] abind_1.4-5 R.methodsS3_1.8.2 lifecycle_1.0.3
## [49] scatterplot3d_0.3-44 yaml_2.3.7 SummarizedExperiment_1.30.2
## [52] rhdf5_2.44.0 recipes_1.0.6 Rtsne_0.16
## [55] blob_1.2.4 crayon_1.5.2 lattice_0.21-8
## [58] cowplot_1.1.1 beachmat_2.16.0 KEGGREST_1.40.0
## [61] mapproj_1.2.11 pillar_1.9.0 knitr_1.43
## [64] GenomicRanges_1.52.0 rjson_0.2.21 boot_1.3-28.1
## [67] estimability_1.4.1 admisc_0.32 future.apply_1.11.0
## [70] codetools_0.2-19 glue_1.6.2 ggiraph_0.8.7
## [73] outliers_0.15 data.table_1.14.8 vctrs_0.6.3
## [76] png_0.1-8 spam_2.10-0 cellranger_1.1.0
## [79] gtable_0.3.4 cachem_1.0.8 gower_1.0.1
## [82] xfun_0.39 S4Arrays_1.2.0 prodlim_2023.03.31
## [85] coda_0.19-4 survival_3.5-5 timeDate_4022.108
## [88] SingleCellExperiment_1.22.0 pheatmap_1.0.12 iterators_1.0.14
## [91] hardhat_1.3.0 lava_1.7.2.1 ellipsis_0.3.2
## [94] ipred_0.9-14 nlme_3.1-162 bit64_4.0.5
## [97] EnvStats_2.7.0 GenomeInfoDb_1.36.1 R.cache_0.16.0
## [100] bslib_0.5.0 irlba_2.3.5.1 vipor_0.4.5
## [103] rpart_4.1.19 colorspace_2.1-0 DBI_1.1.3
## [106] nnet_7.3-19 mnormt_2.1.1 tidyselect_1.2.0
## [109] repmis_0.5 logNormReg_0.5-0 emmeans_1.8.7
## [112] bit_4.0.5 compiler_4.3.1 rvest_1.0.3
## [115] xml2_1.3.4 DelayedArray_0.26.3 bookdown_0.34
## [118] scales_1.2.1 multcompView_0.1-9 stringr_1.5.0
## [121] digest_0.6.32 minqa_1.2.5 rmarkdown_2.22
## [124] XVector_0.40.0 htmltools_0.5.5 pkgconfig_2.0.3
## [127] lme4_1.1-35.1 umap_0.2.10.0 sparseMatrixStats_1.12.1
## [130] MatrixGenerics_1.12.2 highr_0.10 fastmap_1.1.1
## [133] GlobalOptions_0.1.2 rlang_1.1.0 htmlwidgets_1.6.2
## [136] ggthemes_4.2.4 DelayedMatrixStats_1.22.1 farver_2.1.1
## [139] jquerylib_0.1.4 jsonlite_1.8.7 BiocParallel_1.34.2
## [142] R.oo_1.25.0 BiocSingular_1.16.0 RCurl_1.98-1.12
## [145] magrittr_2.0.3 GenomeInfoDbData_1.2.10 dotCall64_1.1-1
## [148] Rhdf5lib_1.22.0 munsell_0.5.0 Rcpp_1.0.12
## [151] reticulate_1.32.0 stringi_1.7.12 ggalluvial_0.12.5
## [154] zlibbioc_1.46.0 MASS_7.3-60 parallel_4.3.1
## [157] listenv_0.9.0 forcats_1.0.0 Biostrings_2.68.1
## [160] splines_4.3.1 uuid_1.1-0 ScaledMatrix_1.8.1
## [163] evaluate_0.21 SeuratObject_5.0.1 BiocManager_1.30.21
## [166] foreach_1.5.2 nloptr_2.0.3 tweenr_2.0.2
## [169] openssl_2.1.1 purrr_1.0.1 polyclip_1.10-4
## [172] clue_0.3-64 future_1.32.0 reshape_0.8.9
## [175] rsvd_1.0.5 xtable_1.8-4 RSpectra_0.16-1
## [178] ragg_1.2.5 viridisLite_0.4.2 class_7.3-22
## [181] easyalluvial_0.3.1 tibble_3.2.1 lmerTest_3.1-3
## [184] memoise_2.0.1 beeswarm_0.4.0 cluster_2.1.4
## [187] timechange_0.2.0 globals_0.16.2 BiocStyle_2.28.0
Demo Report version: 1.0
8.5 Spatial Deconvolution Appendix
8.5.1 Cell Profile Matrix
A cell profile matrix is a pre-defined matrix that specifies the expected expression profiles of each cell type. A complete list of pre-processed profile matrices can be found in the Cell Profile Library.
For this analysis, we used this profile matrix which contains 33 cell types.
The following upset plot shows the number of features uniquely present in either the GeoMx® data or profile data and the number that intersect.
Analyst Notes: The majority of genes are present in both the DSP data as well as the profile matrix. There were 150 genes that were only found in the profile matrix. The table below lists these genes as well as fuzzy matching results to genes in the DSP data (i.e., maximum overal distance of 2). This can be useful for identifying misspelled genes or small differences in nomenclature but not similarity among genes. For the cell deconvolution analysis below, these 150 are removed.
only_in_profile_summary <- do.call(rbind,
lapply(only_in_profile, function(x){
return(data.frame(`InProfile`=x,
`SimilarGeoMxMatches`=paste(
agrep(x, dsp_features, max=list(all=2),
ignore.case=TRUE, value=TRUE, ),
collapse=",")))
}))
dt_params$autoWidth <- FALSE
dt_params$buttons <- list(list(extend = "copy"),
list(extend = "csv", filename = "FeaturesOnlyInProfileMatrix.csv"),
list(extend = "excel", filename = "FeaturesOnlyInProfileMatrix.csv"))
DT::datatable(
only_in_profile_summary,
extensions = c("Buttons", "Scroller", "FixedColumns"),
options = dt_params,
rownames = FALSE)
The heatmap below is a visualization of the profile matrix itself (i.e., not the GeoMx® data).
8.6 Pathway Analysis Appendix
# reload data if not in working environment, throw warning if not found
plot_overlap <- TRUE
if(length(ls(pattern = "^geneSet_res")) == 0) {
gs_files <- dir(path = ge_data_dir, pattern = ".RDS")
if(length(gs_files) == 0) {
plot_overlap <- FALSE
msg <- "Warning: No gene sets were detected. Please confirm you have not moved directories during report generation and have run the appropriate modules."
} else {
gs_results <- list()
for(i in length(gs_files))
gs_results[[i]] <- readRDS(file.path(ge_data_dir, gs_files[i]))
}
} else {
gs_results <- lapply(ls(pattern = "^geneSet_res"), get)
}
# re-calculate top features at above thresholds
gs_tests <- list()
for(i in 1:length(gs_results)) {
gs_tests[[i]] <- unlist(lapply(gs_results[i], function(x) {
strsplit(x[1, 'Comparison'], ' vs ')
}))
}
Below we will compare the overlap between the most significant gene sets identified in each contrast in the Gene Set Analysis section of the report.
for(i in 1:length(gs_results)) {
cat(paste0('\n### Contrast ', i, ': ', gs_tests[[i]][1], ' vs ', gs_tests[[i]][2], '\n\n'))
cat(paste0('::: {#gs_overlap', i,' .tabgroup}\n'))
cat('::::: {.tab}\n')
cat(paste0('<button class="tablinks active" onclick="unrolltab(event, \'upset', i, '_up\', \'gs_overlap', i, '\')">', gs_tests[[i]][1], ' upset plot</button>\n'))
cat(paste0('<button class="tablinks" onclick="unrolltab(event, \'upset', i, '_down\', \'gs_overlap', i, '\')">', gs_tests[[i]][2], ' upset plot</button>\n'))
cat(paste0('<button class="tablinks" onclick="unrolltab(event, \'ocheatmap', i, '\', \'gs_overlap', i, '\')">overlap coefficient heatmap</button>\n'))
cat(paste0('::::::: {#upset', i, '_up .tabcontent style="display: block"}\n'))
cat(paste0('The upset plot below is specific for the gene sets most significantly associated with ', gs_tests[[i]][1], '. Numbers indicate the numbers of targets within.'))
# we use a helper function (getGeneSetOverlap) to create upset plots and heatmaps.
top_feat <- getTopFeatures(gs_results[[i]],
n_features = 10,
p_adjust_column = p_adjust_method,
p_adjust_thr = p_adj_thr_op)
overlap_plots <- getGeneSetOverlap(geneSet_data, top_feat, gs_tests[[i]])
cat('\n \n')
print(overlap_plots$upset_up)
cat('\n \n')
cat(':::::::\n')
cat(paste0('::::::: {#upset', i, '_down .tabcontent}\n'))
cat(paste0('The upset plot below is specific for the gene sets most significantly associated with ',gs_tests[[i]][2], '.\n \n'))
print(overlap_plots$upset_down)
cat('\n \n')
cat(':::::::\n')
cat(paste0('::::::: {#ocheatmap', i, ' .tabcontent}\n'))
cat('In the below heatmap, color indicates the relative overlap ([Szymkiewicz–Simpson](https://en.wikipedia.org/wiki/Overlap_coefficient){target="_blank"} Overlap coefficient) of the different gene sets. Higher values, shown in red, have more overlap and should be considered to be capturing similar biology. The overlap coefficient is calculated as:\n\n')
cat('$$Overlap = \\frac{|Set1 \\bigcap Set2|}{min(|Set1|, |Set2|)}$$\n\n')
cat('$|Set1 \\bigcap Set1|$ is the number of overlapping genes in the two sets, and $min(|Set1|,|Set2|)$ is the number of genes in the smaller set of genes being scored. The relative size of the gene set is shown by the bubble next to each gene set, while the color of the block indicates the overlap coefficient.\n \n')
print(overlap_plots$heatmapOC)
cat('\n \n')
cat(':::::::\n \n')
cat(':::::\n')
cat(':::\n')
}
8.6.1 Contrast 1: tubule vs glomerulus
The upset plot below is specific for the gene sets most significantly associated with tubule. Numbers indicate the numbers of targets within.
The upset plot below is specific for the gene sets most significantly associated with glomerulus.
In the below heatmap, color indicates the relative overlap (Szymkiewicz–Simpson Overlap coefficient) of the different gene sets. Higher values, shown in red, have more overlap and should be considered to be capturing similar biology. The overlap coefficient is calculated as:
\[Overlap = \frac{|Set1 \bigcap Set2|}{min(|Set1|, |Set2|)}\]
\(|Set1 \bigcap Set1|\) is the number of overlapping genes in the two sets, and \(min(|Set1|,|Set2|)\) is the number of genes in the smaller set of genes being scored. The relative size of the gene set is shown by the bubble next to each gene set, while the color of the block indicates the overlap coefficient.
8.6.2 Contrast 2: normal vs DKD
The upset plot below is specific for the gene sets most significantly associated with normal. Numbers indicate the numbers of targets within.
The upset plot below is specific for the gene sets most significantly associated with DKD.
In the below heatmap, color indicates the relative overlap (Szymkiewicz–Simpson Overlap coefficient) of the different gene sets. Higher values, shown in red, have more overlap and should be considered to be capturing similar biology. The overlap coefficient is calculated as:
\[Overlap = \frac{|Set1 \bigcap Set2|}{min(|Set1|, |Set2|)}\]
\(|Set1 \bigcap Set1|\) is the number of overlapping genes in the two sets, and \(min(|Set1|,|Set2|)\) is the number of genes in the smaller set of genes being scored. The relative size of the gene set is shown by the bubble next to each gene set, while the color of the block indicates the overlap coefficient.