nCounter Knowledge Base: Data Analysis
This Knowledge Base serves as a technical resource specifically to answer common questions and assist with troubleshooting regarding nCounter® data analysis; NanoSting University is the primary source for manuals, guides and other documentation.
For additional assistance, email support@nanostring.com
nCounter Data Analysis
General
The nSolver Analysis Software is a data analysis program that offers nCounter users the ability to quickly and easily QC, normalize, and analyze their data without having to purchase additional software packages. The nSolver software also provides seamless integration and compatibility with other software packages designed for more complex analyses and visualizations. It is free for all nCounter customers and available for download. Please consult the nSolver User Manual for instructions on how to analyze your data using this software or watch our video tutorials. For additional assistance, please contact support@nanostring.com.
The positive controls are spike-in oligos used for quality control. The positive control counts in each sample are influenced by a number of factors: pipetting accuracy, hybridization efficiency (e.g. inaccurate temperature or presence of contaminants from sample input that inhibit hybridization), as well as sample processing and binding efficiency.
Positive controls serve three general QC purposes:
- Assess the overall assay efficiency. nSolver raises a warning flag when the geometric mean of positive controls is >3 fold different from the mean of all samples.
- Assess assay linearity. Decreasing linear counts are expected from POS_A to POS_F.
- Assess limit of detection (LOD). It is expected that counts for POS_E will be higher than the mean of negative controls plus two standard deviations.
Some level of variability among positive control counts is expected. If you receive no positive/negative control QC flags in nSolver, you may rest assured that the assay worked as expected. Even if you do receive warning flags, it does not necessarily mean the assay has failed. You may send your RCC files to support@nanostring.com, and we will be happy to check for root cause of the flags for you.
The total surface area of each lane in a cartridge is scanned in multiple discrete units called fields of view (FOV). After scanning is complete, the FOV within each lane are aggregated together to generate total counts across the entire surface area within each lane. The “Imaging QC” metric quantifies the performance of this imaging process. Specifically, it is a fraction that is calculated by dividing the number of FOVs that have successfully been scanned (i.e., “FOV Counted” within nSolver) by the number of FOVs that were attempted to be scanned (i.e., “FOV Count” within nSolver). Significant discrepancy between the number of FOV for which imaging was attempted (“FOV Count”) and for which imaging was successful (“FOV Counted”) may indicate an issue with imaging performance.
Within nSolver, a sample that has an Imaging QC value less than 0.75 (or 75%) will be flagged. The threshold of 0.75 was selected based on internal testing that evaluated performance over a range of FOV values. The scanner is more likely to encounter difficulties near the edge of the slide. Therefore, when the maximum scan setting is selected for MAX or FLEX systems (the SPRINT instrument has one scan setting), it is more likely that some FOV will be dropped. Reduction in number of FOV counted does not compromise data quality and is accounted for during data normalization. However, when a substantial percentage of FOVs are not successfully counted, there may be issues with the resulting data. Consistent large reductions in percentages can be indicative of an issue associated with the instrumentation.
If Imaging QC is greater than 0.75, then a re-scan may be performed, if desired, in attempt to increase number of FOV counted, though as a routine practice this is not necessary or recommended. If Imaging QC is less than 0.75, then clean the bottom of the cartridge with a lint-free wipe, and re-scan the cartridge, being sure that the cartridge lays flat in the scanner. Please note that the re-scan option is currently available for MAX and FLEX systems only; it is not available for the SPRINT system (as of October 25, 2016). If re-scan does not improve imaging performance in samples with Imaging QC less than 0.75, then email the raw data (RCC files) and instrument log files to support@nanostring.com. The data and logs will be examined for hardware or assay problems.
A QC flag does not necessarily mean that data from a flagged lane cannot be used. The thresholds for QC flags are set at a conservative level in order to both catch samples which may have failed, and also to identify samples with usable data which happened to experience a reduction in assay efficiency.
To determine whether a QC flag is indicating a critical problem, examine the raw and normalized data and check whether the flagged samples have a poorer limit of detection for low count transcripts when compared to non-flagged samples. For some genes, differences in expression level between samples will be caused by differences in treatment or pathology, so it may be more appropriate to determine if the expression of only the low count genes for any flagged lane falls within the range of expression values observed across a number of unflagged samples which come from different treatments or pathologies.
One can approach this potential limit of detection question in a number of ways. First, a simple visual scan of the data may suffice to detect problems in the flagged samples. This can be performed on raw data which have been background subtracted in nSolver to identify targets that are below the background. Alternatively, outlier samples could be identified by generating a heat map of normalized data from all samples to see if the flagged samples in question are strongly divergent from other samples with similar pathology. Another option would be to examine the calculated QC metrics within nSolver (right click or command click on one of the table columns in the raw data table, and choose ‘select hidden columns’). If these QC metrics have only exceeded the threshold by a very small margin (i.e., the FOV registration is 74% instead of 75%), then the resultant data are generally going to be quite robust and usable.
More details on QC flags can be found in the nSolver manual. If QC flags become more than a rare anomaly, we encourage you to contact our support team (support@nanostring.com and/or your local Applications Scientist) in order to assist you in tracking down the root cause of these potential problems with the assay consistency.
A positive control normalization flag indicates that the POS controls for the lane (sample) in question are more than three-fold different (greater or smaller) than the POS control counts from the other samples in the experiment. High POS control counts are rarely problematic, so a flag usually only indicates a problem when the POS controls are particularly low for a sample. Such low POS counts are indicative of relatively low assay efficiency at capturing and counting targets, which may lower sensitivity or introduce bias into the assay.
To determine whether a POS control normalization flag is indicating a critical problem, examine the raw and normalized data and check whether the flagged samples have a poorer limit of detection for low count transcripts when compared to non-flagged samples. For some genes one should anticipate differences in expression level between samples due to differences in treatment or pathology, so it may be more appropriate to see if the expression of the low count genes for any flagged lane falls in the range of expression values observed across a number of unflagged samples which come from different treatments or pathologies.
One can approach this potential limit of detection question in a number of ways. First, a simple visual scan of the data may suffice to detect problems in the flagged samples. This can be performed on raw data which have been background subtracted in nSolver to identify targets that are below the background. Alternatively, outlier samples could be identified by generating a heat map of normalized data from all samples to see if the flagged samples in question are strongly divergent from other samples with similar pathology. Another option would be to examine the calculated POS control normalization factors within nSolver (found in the normalized data table on the far right). If these factors have only exceeded the threshold by a very small margin (i.e., the POS control normalization factor is 3.2), then one can usually assume that the resultant data are generally going to be quite robust and usable for the majority of data sets.
More details on POS control normalization flags can be found in the nSolver manual. If POS control normalization flags become more than a rare anomaly, we encourage you to contact our support team (support@nanostring.com and/or your local Applications Scientist) in order to assist you in tracking down the root cause of these potential problems with the assay consistency.
A QC flag for content normalization indicates that the flagged sample had a content (or housekeeping gene) normalization factor more than 10-fold different from the average sample in the same experiment. In other words, the flagged sample had significantly lower or higher counts in the Housekeeping genes which are used to normalize sample input. Although unusually high housekeeping gene counts would not typically be problematic, it is much more common to see samples with lower housekeeping gene counts, and these would be flagged if the content correction factor for that sample were greater than 10.
Content normalization flags can be caused by either a significant reduction in overall assay efficiency for that sample, or because of an effective reduction in quantity or quality (fragmentation) of the input RNA. The likelihood of a reduction in assay efficiency can be assessed by the presence of any other QC flags for that sample. If the lane failed the QC specifications by a large margin for any of the other QC metrics (including POS control normalization), then overall counts may be reduced enough to also cause a Content normalization flag. Essentially, in this scenario the assay is working so poorly that the counts for endogenous and housekeeping genes are dramatically reduced even if sufficient RNA targets are present. If, however, the sample had no other QC flags except that for Content normalization, this usually means that the assay is working well, but there were insufficient RNA targets to count. This can be caused either by low RNA concentrations or highly fragmented RNA, such as from an archival FFPE sample.
To determine whether a Content normalization flag is creating a critical problem, examine the raw and normalized data and check whether the flagged samples have a poorer limit of detection for low count transcripts when compared to non-flagged samples. For some genes one should anticipate differences in expression level between samples due to differences in treatment or pathology, so it may be more appropriate to see if the expression of the low count genes for any flagged lane falls in the range of expression values observed across a number of unflagged samples which come from different treatments or pathologies.
One can approach this potential limit of detection question in a number of ways. First, a simple visual scan of the data may suffice to detect problems in the flagged samples. This can be performed on raw data which have been background subtracted in nSolver to identify targets that are below the background. Alternatively, outlier samples could be identified by generating a heat map of normalized data from all samples to see if the flagged samples in question are strongly divergent from other samples with similar pathology. Another option would be to examine the calculated Content normalization factors within nSolver (found in the normalized data table on the far right). If these factors have only exceeded the threshold by a very small margin (i.e., the Content normalization factor is 10.6), then one can usually assume that the resultant data are generally going to be quite robust and usable for the majority of data sets.
More details on Content normalization flags can be found in the nSolver manual. If QC flags become more than a rare anomaly, we encourage you to contact our support team (support@nanostring.com and/or your local Applications Scientist) in order to assist you in tracking down the root cause of these potential problems with the assay consistency.
Binding Density (BD) is affected by several different factors:
- The input amount. More sample input will result in an increased BD.
- The expression level of the targets in the CodeSet. If the targets in the CodeSet are highly expressed, BD will go up simply because there are more mRNA molecules present in your samples that are targeted by the probes in the CodeSet
- The size of the CodeSet. If a CodeSet contains probes for more targets, then BD will usually be higher.
A binding density refers to the number of barcodes/μm2. The recommended range is from 0.05 to 2.25 for MAX and FLEX instruments and 0.5 to 1.8 for SPRINT. If the density is less than 0.05, the instrument may not be able to focus on the cartridge due to a lack of optical information. If the density is greater than 2.25, the barcodes will begin to overlap resulting in a loss of data, as overlapping barcodes are excluded from the analysis. As a general rule of thumb, one lane can accurately detect a total of about 2 million barcodes.
NanoString provides several options for performing background subtraction using the nSolver Analysis Software. To estimate background, NanoString provides several probes in each Codeset for which no target is present. These negative controls can be used to estimate background levels in your experiment. Background levels may be estimated using either the average of the negative controls for that lane or the average of the negative controls plus a multiple of the standard deviation of all the negative controls in a lane. Alternatively, background levels may also be estimated by running a blank lane in which nuclease-free water instead of RNA is added as input; this will generate a background measurement that will estimate probe-specific background levels instead of general background levels, as estimated from a set of negative controls. Once the appropriate background level has been determined, the background counts are subtracted from the raw counts to determine the true counts.
The type of normalization strategy that you employ depends on your experiment. If you expect only a few gene targets to change, then either a reference gene normalization or a global normalization will be sufficient. However, if you expect the majority of your gene targets to change, then you should not perform a global normalization. In this case, a reference gene normalization would be most appropriate given that the group of reference genes selected is stably expressed across your experimental conditions.
Yes. The fold change data obtained from an nCounter analysis correlates well with fold change results obtained from microarray analyses. The level of concordance between the nCounter and microarray results is similar to comparisons of different microarray platforms.
Yes, we’ve found that there is an excellent correlation between nCounter analyses and qPCR analyses, both in terms of relative expression levels and fold changes. Moreover, the multiplexing capabilities of nCounter analyses increase the efficiency with which data can be obtained at qPCR levels of sensitivity. We therefore recommend using nCounter analyses to extend your current set of qPCR data.
While many mRNAs demonstrate low variance across tissues, there simply is no single set of mRNAs that can be used across all experimental conditions and tissues.
It is recommended that every CodeSet design have at least 3 – 6 “reference” or “housekeeping” targets to use for technical variance normalization. Characteristics of effective reference targets are 1) minimal variance across samples, and 2) high correlation with each other (assuming technical variance is much lower than biological variance).
If you have generated data on nCounter or other platforms previously that show certain targets do not vary across your treatment conditions, and that they fit the above criteria, these would be ideal targets to start with as reference mRNAs. However, if you haven’t characterized candidate reference targets yet, it is important to measure the expression of at least 6 – 8 candidate genes in a pilot experiment. Starting with this number of candidates should allow you to identify a set of 3 or more useful targets, as some may drop out due to higher than expected variance or biological effects across your samples and treatments.
To select candidate genes, potential reference targets can be gleaned from online reference gene tools (such as Refgenes or NormFinder), pre-existing data, or the literature in your field. Please note the reference gene tools are not affiliated with NanoString; please see the linked websites for support.
Pathway scores are designed to summarize expression level changes of biologically related groups of genes. This score can help identify pathways that are being altered by the pathology or treatment under study, and thus can help contextualize differential expression changes observed for individual genes. Pathway scores are derived from the first Principle Components Analysis (PCA) scores (1st eigenvectors) for each sample based on the individual gene expression levels for all the measured genes within a specific pathway. Although expression levels from multiple genes will generally comprise this first PC, some of these genes will have much higher weight applied to them if they capture a greater proportion of the variability in the data.
Typically, Pathway Scores will be positive for pathways containing many up-regulated genes, and negative for those containing more down regulation of expression. One can generally make direct comparisons between scores of an individual pathway across samples within an experiment (that is, a comparison of one sample’s cell cycle pathway score to another sample’s cell cycle pathway score), and a higher score for the same pathway will generally mean greater levels of up-regulation. However, comparisons between different pathways within the same experiment or across different experiments is not recommended. Moreover, because of the complexity of the calculations for Pathways scores, interpretation of these should never be performed without correlating them to other analysis results to ensure that they are placed in the correct biological context. Thus, before concluding that a pathway has been upregulated in a group of samples, it is advised to correlate the pathway level findings to the expression levels of individual genes within that pathway.
Gene Set Analysis scores are essentially an averaging of the significance measures across all the genes in the pathway, as calculated by the differential expression module. The exact calculations for these GSA scores can be found in the Advanced Analysis User Manual.
GSA scores do come in two different flavors, Global Significance Scores, and Directed Global Significance Scores. The former (GSS) measures the overall significance of the changes within a pathway and will always be positive regardless of whether genes are up- or down-regulated. The Directed Global Significance Scores are more akin to Pathway Scores as they may have either negative or positive values (for down- and up-regulated pathways, respectively). Directed or undirected GSA Scores of greater magnitude will generally indicate a stronger pattern in the pathway level expression changes, and because these scores have been scaled to the same distribution (that of the t-statistic), they will be more robust to comparisons between different pathways or experiments. A high score indicates that a large proportion of the genes in a pathway are exhibiting changes in expression across groups of samples.
Both Pathway Analysis and Gene Set Analysis (GSA) are higher level assessments of expression changes that may be occurring within related sets of genes from the same pathway. Because both scores are generated from differences in expression between samples across many genes, the scores should be roughly concordant with each other. However, differences in the way that the calculations are performed may lead to some divergence between scores, as well as differences in the interpretation of these higher-level measurements.
One important difference is that Pathway scores are generated for individual samples, while GSA scores are ‘population’ or ‘group’ level statistics and thus measure patterns between sample groups. A subtler difference is that a Pathway score uses results from only the first PC of the PCA, meaning that it can explain only some proportion of the variance in the data, which may also cause some differences when making comparisons to the GSA scores.
Notably, Pathway scores are generated from weighted expression level data, while the genes from any pathway are given equal weight in the calculation of GSA scores. The Differential gene weights in Pathway scores can allow them to detect changes that affect only a small portion of the genes in a pathway, which may be obscured in GSA if most genes in the pathway do not show significant changes in expression (that is, have a small t-statistic). Similarly, if many genes in a pathway show consistent trends in expression which are not individually significant, Pathway scores may have better sensitivity to detect these trends compared to the statistical summation approach of GSA.
In summary, comparing pathway scores directly to those from the GSA module should be performed with caution, and should always be correlated or cross-referenced with expression level changes in individual genes to ensure that biological interpretations are supported.
Parametric statistical tests operate on the assumption that the data conforms to some expected distribution, such as a normal distribution if performing a t-test. Transforming linear data into log2 values will generally satisfy this requirement, so it would be advised to transform normalized count data prior to any parametric statistical analyses.
Both the basic nSolver and the Advanced Analysis module automatically perform these log transformations in the background before performing any statistical testing, and as such all the reported p-values are based on log-transformed data. It should be noted that while data in basic nSolver are still generally displayed after being back transformed into linear space, results in the advanced analysis module (counts as well as fold changes) are displayed only in log2 space.
If performing any data analysis outside of the nSolver software, it is recommended to work from log-transformed nCounter® data.
Cell type profiling scores are generated for immune cell types using expression levels of cell-type specific mRNAs as described in the literature. For details of the selection and validation process for these markers, see Danaher et al 2016 (Gene expression markers of Tumor Infiltrating Leukocytes BioRxiv August 11, 2016).
The cell type score itself is calculated as the mean of the log2 expression levels for all the probes included in the final calculation for that specific cell type. Because the scores are dependent on probe-specific counting and capturing efficiencies, these should only be interpreted as relative cell abundance values compared to the same cell type within other samples or groups of samples. The scores should not be used as measures for the abundance of a cell type relative to other cell types within the same sample, nor should it be used to quantitate cell abundance within a single sample.
Cell type scores may be calculated as raw or relative scores. The raw cell scores will measure the overall cell abundance for that type of cell, whereas the relative cell scores measure the specific cell abundance relative to (essentially normalized to) the abundance of Tumor Infiltrating Leukocytes (TILs) in that sample. These are defined as the average of B-cell, T-cell, CD45, Macrophage, and Cytotoxic cell scores. This relative score can alternatively be customized to incorporate a baseline cell type or mixed population other than TILs.
The genes used for immune cell scoring comprise a subset of high confidence markers validated by co-expression patterns via a large survey of TCGA samples (N=9986), and confirmed by nCounter and protein analysis (Danaher et al, 2016 http://dx.doi.org/10.1101/068940). To some extent, these markers thus already represent high confidence markers for these cell types.
An additional level of quality control is by default performed within the Advanced Analysis module, whereby correlations are calculated between the expression levels of these candidate cell type markers. Those markers which do not correlate with other cell type-specific markers are discarded from the estimates of abundance. Such markers may be expressed at low levels in another cell type, or they may show highly variable expression levels within their specific cell type, in either case making the gene a poor marker for cell type abundance.
The Advanced Analysis module will also, by default, utilize a resampling technique to generate a significance level for confidence in the individual cell type scores. Cell scores with p-values below a threshold level of confidence (e.g., 0.05) would be considered higher confidence stand-alone estimates of abundance. Note that some cell scores will only ever be based on a single literature-validated cell-specific marker, and the statistical resampling method can only ever return a p value of ‘1.0’ for these scores (i.e. Tregs are only characterized by expression of the gene FOXP3). Importantly, the cell abundance levels for these and other cell type scores with p-values greater than 0.05 should not necessarily be ignored, nor should they be considered unrelated to immune cell abundance. Rather, scores without this independent confirmation should be considered hypotheses, with a confidence level based on the strength of these marker associations with cell type from the literature. The marker for Tregs, for example, is considered quite robust for this cell type and can therefore be reliably used as an estimate of Treg cell abundance, despite the single gene abundance score having a p-value of 1.0 in the software.
Multi-RLF experiments come in two different types: CrossRLF/Batch Calibration and MultiRLF Merge.
The CrossRLF/Batch Calibration option allows you to consolidate datasets of primarily distinct samples that were each run on multiple CodeSets (RLF) or on different CodeSet or reagent lots. At least one calibrator sample must be run across all CodeSets/lots. An ideal calibrator sample has robust counts (>200) for all genes of interest.
The Multi-RLF Merge option allows for data from a set of identical samples run across two or more CodeSets to be aggregated.
To create a CrossRLF experiment in nSolver, both RLFs needed to be uploaded before creating the experiment. nSolver will normalize within each RLF (i.e. each batch) followed by CodeSet calibration using the calibrator samples run in each lot. The manual goes over this process in the section entitled “Multi-RLF Experiments & Batch Calibration”, on pages 92-94
To use Advanced Analysis with a CrossRLF/batch corrected experiment starts with the “Normalized Data,” choose a Custom Analysis, and make sure to uncheck the “Normalize mRNA” box in the Normalization module options.
To create a multi-RLF Merge experiment in the Advanced Analysis module, one must first create a basic nSolver experiment for each separate CodeSet. It is critical to ensure that sample names are well annotated so identical samples can be easily matched up in the combined dataset. Since the geNorm algorithm for automatic housekeeping gene selection is not available in the Advanced Analysis module for multi-RLF Merge experiments, normalization should be finalized in the initial experiment before proceeding to the next stage. If housekeeping genes have been validated, these can be picked manually in each respective basic nSolver experiment. Alternatively, each experiment should be run separately through the Advanced Analysis module with the sole purpose of identifying the most stably expressed targets using the geNorm module. Thereafter, the basic nSolver experiment should be re-run using these selected genes for normalization.
Next, upload the RLF files for each CodeSet into nSolver.
Create a multi-RLF Merge experiment in nSolver. The manual goes over this process in the section entitled “Multi-RLF Experiments & Batch Calibration”, on pages 95-97. Here you will need to align the various nSolver files (one for each sample for each CodeSet), specifically matching up the sample names across CodeSets (panels).
Lastly, from the new multi RLF experiment, select all the samples from the normalized data table, and run them through the Advanced Analysis Module as normal. The option to normalize the data here is automatically disabled, as the software expects that the data have already been normalized according to the methods outlined above.
Data normalization is designed to remove sources of technical variability from an experiment, so that the remaining variance can be attributed to the underlying biology of the system under study. The precision and accuracy of nCounter Gene Expression assays are dependent upon robust methods of normalization to allow direct comparison between samples. There are many sources of variability that can potentially be introduced into nCounter assays. The largest and most common categories of variability originate from either the platform or the sample. Both types of variability can be normalized using standard normalization procedures for Gene Expression assays.
Standard normalization uses a combination of Positive Control Normalization, which uses synthetic positive control targets, and CodeSet Content Normalization, which uses housekeeping genes, to apply a sample-specific correction factor to all the target probes within that sample lane. These correction factors will control for sources of variability such as pipetting errors, instrument scan resolution, and sample input variability that affect all probes equally.
Note that Positive Control Normalization will not correct for sample input variability, and thus should usually be used in combination with CodeSet Content (housekeeping gene) Normalization. Performing such a two-step normalization will usually not differ mathematically from Content Normalization alone, and thus is mathematically somewhat redundant. Nevertheless, normalizing to both target classes will provide a good indicator of how technical variability is partitioned between the two major sources of assay noise (platform and sample), and thus may provide a good tool for troubleshooting low assay performance. Normalization workflows are described below.
nCounter Reporter probe (or TagSet) tubes are manufactured to contain six synthetic ssDNA control targets. The counts from these targets may be used to normalize all platform-associated sources of variation (e.g., automated purification, hybridization conditions, etc.).
The procedure is as follows:
- Calculate the geometric mean of the positive controls for each lane (POS_E to POS_A).
- Calculate the arithmetic mean of these geometric means for all sample lanes.
- Divide this arithmetic mean by the geometric mean of each lane to generate a lane-specific normalization factor.
- Multiply the counts for every gene by its lane-specific normalization factor.
It is expected that some noise will be introduced into the nCounter assay due to variability in sample input. For most experiments, normalization of sample input is most effectively done using so-called housekeeping genes. These are mRNA targets included in a CodeSet which are known to or are suspected to show little-to-no variability in expression across all treatment conditions in the experiment. Because of this, these targets will ideally vary only according to how much sample RNA was loaded.
Using the geometric mean of three housekeeping genes, at minimum, to calculate normalization factors is highly recommended. This is done in order to minimize the noise from individual genes and to ensure that the calculations are not weighted towards the highest expressing housekeeping targets. It is important to note that some previously-identified housekeeping genes may, in fact, behave poorly as normalizing targets in the current experiment, and may therefore need to be excluded from normalization.
The procedure is the same as that for Positive Control Normalization:
- Calculate the geometric mean of the selected housekeeping genes for each lane.
- Calculate the arithmetic mean of these geometric means for all sample lanes.
- Divide this arithmetic mean by the geometric mean of each lane to generate a lane-specific normalization factor.
- Multiply the counts for every gene by its lane-specific normalization factor.
Samples with normalization flags have counts for either the positive controls and/or housekeeping genes that are much lower or higher than most of the samples included in the analysis. Samples with a Positive Control Normalization flag may indicate a notable difference in hybridization/assay performance as compared to most of the samples included in the analysis. In certain situations, samples with these Positive Control Normalization flags may need to be re-run/excluded. Samples with a CodeSet Content Normalization flag may indicate a notable difference in RNA quality and/or input amount as compared to most samples included in the analysis. Samples will have CodeSet Content Normalization flags if the CodeSet Content Normalization factor is < 0.1 or >10, as anything beyond these values will result in inaccurate normalization. As such, samples with CodeSet Content Normalization flags may need be excluded or (if possible) re-run at higher or lower input amounts depending on the normalization factor.
The best approach for normalizing miRNA data will depend mostly on the sample type they represent. For everything except biofluids (such as plasma or serum), using a “global” normalization method which normalizes to total counts of the 100 most highly expressed (on average) miRNA targets across all samples is recommended. This is called the TOP 100 method in the software. Importantly, this method does not use the Positive Control or Positive Ligation Control probes for any of these calculations.
However, it does get more complicated with biofluids (or any other sample) where the number of expressed targets drops below ~150-200 targets. As a frame of reference, targets expressed above background are usually identified by comparison to the Negative control probes (either the mean, mean +2 Standard Deviation, the maximum value of the NEG probes, or 100 to be conservative).
When normalizing samples from biofluids, a judgement call can be made depending on how many targets are expressed above background. In the miRNA assay, background would usually be ~30 counts, but will vary from one experiment to the next. Therefore, sometimes a global approach (TOP 100 method) can still work with biofluids if samples express 100-150 miRNA targets above this cutoff.
However, if this is not the case, the identification of good “housekeeper” miRNAs will likely allow you to normalize and obtain robust results. There are not many well-characterized housekeeper miRNA targets from plasma or other biofluids, as they do seem to vary depending on extraction kits and pathologies being studied. Consequently, a literature search would not necessarily help you determine appropriate housekeepers and a more data-driven approach would be better suited. Using third party software or algorithms can identify the most stably expressed targets within the particular experiment. It is recommended that this method of identifying housekeeping genes be repeated as more data is generated to confirm these are appropriate for the entirety of the study and not just for the initial experiment.
The path of least resistance on published algorithms for Stable Housekeeper gene identification is NormFinder, because it is free and easy to use.
Claus Lindbjerg Andersen, Jens Ledet Jensen and Torben Falck Ørntoft. Cancer Res 2004;64:5245-5250.
Supplemental Methods
Software download:
geNorm is another program that uses slightly different principles. Specifically, NormFinder chooses targets with the lowest within and between group variance, while geNorm also picks multiple targets that give the lowest estimates of variance when they are used together (NormFinder only picks them individually or gives the best two together). geNorm can be obtained with a license.
If Spike-In synthetic miRNAs are used to normalize variance introduced in purification of samples, it is assumed and highly recommended that equal volume inputs are used across samples. Synthetic oligos must be spiked in before sample extraction, and it is strongly recommended that Spike-Ins are used for all samples in that experiment.
THREE METHODS for NORMALIZATION
- Normalize using only the Spike-In control probes
- Normalize using only the Housekeeping miRNA targets as identified by the user.
- First normalize all the endogenous counts (including the putative miRNA housekeepers) to the Spike-In control probes. Then use the spike-In normalized miRNA housekeeper counts to normalize the endogenous miRNA targets. This option is not available in the nSolver software so it would need to be performed in Excel. The basic workflow is below:
WORKFLOW in EXCEL for normalization (only for step 3 above)
1. For each lane calculate the geometric mean of the Spike-In controls.
2. Calculate the arithmetic mean of these geometric means across all lanes.
3. Divide this arithmetic mean by the geometric mean in each lane (calculated in #1) to get a lane-specific normalization factor.
4. Multiply all the endogenous counts in a lane its lane-specific normalization factor.
5. Repeat 1 through 4 using the Spike-In normalized housekeeper miRNA targets.
The three methods for normalization may yield similar results. Typically, the better normalization approaches will result in overall lower variance. Below is an example graph depicting what would be expected of a typical normalization method. For each of the three methods, variance should be calculated, and the lowest variance method should be chosen. Theoretically, the third method provides the best reduction in technical and sample input variance.
The Housekeeping (HK) Gene selection in the Advanced Analysis is performed by default using the geNorm algorithm shown in the below paper:
Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome biology. 2002;3(7):research0034.
The geNorm algorithm assumes that HK gene expression does not change across all samples, irrespective of the experimental condition. Based on that assumption, geNorm expects that the ratio between HK Gene A and HK Gene B within sample 1 will be the same as the ratio between HK Gene A and HK Gene B in sample 2, and sample 3, etc. If that is not true for the dataset, the aberrant gene is not used as a HK gene for normalization. As such, geNorm looks at the different ratios between the potential HK genes and iteratively removes HK genes that do not perform as expected. In the end, it retains a set of optimal HK genes to be used for the final normalization. A more detailed explanation can be found in the Advanced Analysis User Manual (MAN-10030) on page 41 here.
Within a typical nSolver workflow, HK genes can be selected by the user based on their variability across samples (%CV) and/or their average expression levels. Overall, this is a relatively good approach. However, there is a risk that less optimal HK genes are chosen due to variability in the data that artificially decrease the %CV. Consequently, %CV can sometimes lead to a bias due to variability in the data that can coincidentally result in artificially stable genes. Similarly, this can make genes that are normally stable appear unstable due to input variability. In the latter situation, geNorm will still be able to identify that gene as a good housekeeping gene (as it is ratio-based within samples), whereas the %CV method will discard the gene. Therefore, relying solely on %CV is not recommended. A consistent trend amongst the annotated housekeeping genes must also be considered.
- FOV (field of view) registration is as close to 100% as possible, but minimally 75%.
- Binding density is in the linear dynamic range (between 0.05-2.25 for MAX/FLEX, 0.1-1.8 for SPRINT)
- POS controls (POS_A to POS_E) have robust counts and are in a linear range (R^2 higher than 0.95)
- NEG controls have low counts (average < 50 is expected)
- At least three housekeeping genes have reasonable counts that are above background and cover the range of gene expression (counts in the thousands and counts in the hundreds, etc)
The nCounter systems monitor the amount of barcode bound to each sample lane in a cartridge by calculating a metric called Binding Density. More precisely, the binding density for each sample is equal to the number of fluorescent spots per square micron that are bound to each sample lane in the cartridge. Saturation of barcodes bound to the cartridge surface can potentially compromise data quantification and linearity of the assay.
Within nSolver 4.0 (if using a version other than 4.0, consult the user manual for that version), the default, optimal range for Binding Density is:
- 0.1 – 2.25 for MAX/FLEX instruments
- 0.1 – 1.8 for SPRINT instruments
Binding densities flagged for being greater or less than the optimal range do not necessarily indicate assay failure. Closer inspection the data to determine the specific impact, if any, of high or low binding density on the data is highly recommended.
A combination of several factors can affect binding density, including:
- Assay input quantity: the higher the amount of input used for the assay, the higher the Binding Density will be. The relationship between input amount and Binding Density is linear until the point of assay saturation. Conversely, if the amount of sample input is too low, the Binding Density will likely be flagged for being less than the optimal range.
- Expression level of genes: if the target genes have high expression levels, there will be more molecules on the lane surface which will increase the Binding Density value.
- Size of the CodeSet: a large CodeSet with probes for many targets is more likely to have high Binding Density values than a CodeSet with probes for fewer targets. A small CodeSet with a limited number of targets is more likely to have low Binding Density values.
The normalized data in Advanced Analysis can be downloaded by navigating to the “Normalization” tab in the HTML report containing the analysis results and clicking on the green button labeled “All Normalized Data” (see regions circled in red below). The normalized data is formatted as a CSV file which can then be opened in Excel. Additionally, the normalized data file can be found in the “Normalization” folder within the Advanced Analysis output file written to the user’s computer.
For Windows:
- First, make sure that you “unhide” hidden folders on your computer. Go to the control panel and, using the search bar, look for “Folder Options” (if using Win 7 or 8) or “Folder Browser Options” (if using Win 10). Open Folder Options and click the “View” tab. Within the “View” tab, find “Show hidden files, folders, and drives” and select it. Click Apply and exit. Hidden files and folders should now be visible.
- Navigate to
c:\users\<username>\appdata\roaming\nSolver4
- Copy the folder and zip the copy to USB
- On the computer being transferred to, if nSolver is already present, find the nSolver4 folder at the above location and rename (or move to another location).
- Unzip then place the nSolver4 folder to be transferred into the same location on the computer.
For Mac:
(A) Using Finder
- Open terminal Type in the following command:
defaults write com.apple.finder AppleShowAllFiles YES
– this will make the invisible files show up using Finder (warning if you have never done this there are lots of invisible files and directories). - Copy the nSolver4 directory to be transferred (
C:/Users/<user name>/.nSolver4
… don’t forget the preceding dot in .nSolver4) to a flash drive or zip the copy for transport via file share. - On the other Mac, type in same terminal command to show hidden folders, and then re-name the existing .nSolver4 directory (e.g. .nSolver4a, etc.), then unzip and drag and drop the replacement one in the same location.
- When finished go back to terminal on both systems and type:
defaults write com.apple.finder AppleShowAllFiles NO
– and now everything will be back to normal.
(B) Using Terminal
You can use terminal to do all of this, which means that you don’t have to tell finder to show hidden files, type ls -la list all files
Use the ditto command to copy (but copy/paste should work as well)
Making hidden files and folders viewable for Mac:
- Open a terminal window by finding “Terminal” in the Utilities folder in the Applications directory. At the prompt copy and paste the following:
defaults write com.apple.finder AppleShowAllFiles TRUE
killall Finder
- Hit Enter.
We do not recommend changing the RLF name as this can cause difficulties with data collection and analysis as well as lead to confusion if the data are analyzed in the future by someone unaware of the RLF modification. We strive to maintain the single correct version of each RLF file within our bioinformatics database. If you are seeing differences in content within a single RLF version, please contact support@nanostring.com with the RLFs in question.
AA installation MAC
Here are some detailed instructions including MAC specific information for getting the AA module working:
- Install R 3.2.2, using this download link: R-3.3.2.pkg
- Install XQuartz, which you can obtain here: https://www.xquartz.org/
- Download the zipped file titled, nCounter Advanced Analysis 2.0.115.
Do not extract or unzip this compressed file. Since Macs will often do this automatically, you will likely need to re-zip this folder to install the module.
Save/move this file to a location of your choosing on your computer. - Launch the nSolver™ Analysis Software.
- From the Analysis menu at the top of the home screen, select Advanced Analysis Manager.
- In the Advanced Analysis Manager window, click Import New Advanced Analysis.
- Browse to the directory where the nCounter Advanced Analysis 2.0.115 .ZIP file is located and click OK.
- Wait for import to complete and the name to populate automatically.
- Click OK to exit the Advanced Analysis Manager.