Last updated: 2025-03-24

Checks: 7 0

Knit directory: McConville_Lab_Microbiome_PMC255A/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20250117)

The command set.seed(20250117) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3a65c0b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3a65c0b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/raw_data/

Untracked files:
    Untracked:  analysis/PMC255A_ScreenQC_v4.Rmd
    Untracked:  analysis/prelim/

Unstaged changes:
    Modified:   analysis/PMC255A_ScreenComparison.Rmd
    Deleted:    analysis/PMC255A_ScreenQC_v3.Rmd
    Modified:   analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/VCFG_analysis_methods.Rmd) and HTML (docs/VCFG_analysis_methods.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	4df8175	annrann.wong	2025-02-13	Update Viability Bins
html	fc42795	annrann.wong	2025-02-12	Update MeanSpotCount readout
html	25b8bb6	annrann.wong	2025-02-12	Update Analysed Data Table
html	6af885f	annrann.wong	2025-02-10	Build site.
html	9906a06	annrann.wong	2025-02-07	Build site.
html	a9b3cea	WongAnnRann	2025-02-07	Update analysis
Rmd	3cc6a61	WongAnnRann	2025-02-07	Updated analysis
html	3cc6a61	WongAnnRann	2025-02-07	Updated analysis
Rmd	66cc268	annrann.wong	2025-01-30	Initial Commit

Identify outliers

For data with more than two dimensions, two dimensional scatterplot is produced using the first two Principal Components. The “PCAproj” function computes a desired number of (robust) Principal Components using the algorithm of Croux and Ruiz-Gazen (JMVA, 2005).

Screen quality

Heat maps

Heat maps show the raw cell counts on every plate, with the scale calculated across all plates on a per-cell line basis.
blue = 0, white = median, red = max

Expected result: The pattern on each plate should look relatively random (depends on experimental design). Suspicious patterns include: death in the same negative control wells across every plate (potential drug plating issue); consistently lower values in the outside wells of the plate (potential edge effects); and two adjacent rows having dramatically lower/higher values than other rows (potential instrument issue when seeding cells).

Screen QC Metrics

The mean, stdev and %CV were calculated across all plates in the screen (using normalised values).

Expected result: In terms of variability, CV’s > 24% are considered unacceptable. This does not apply to postive controls that cause an extremely high level of death, because a high degree of variablity is unavoidable when working with very small numbers. Negative controls, on the other hand, should ideally have CV’s < 15%.

Plate QC Metrics

The mean, stdev and %CV were calculated across each plate separately (using raw values). The purpose of this section is to identify if particular plates are more variable than others.

The Z’ Factor was also calculated to assess the degree of overlap between each postive and negative control pair.
Formula: Z’ = 1 - ((3 * (negctrl_stdev + posctrl_stdev)) / negctrl_mean - posctrl_mean)

Note: if the postive control is supposed to increase rather than decrease the value, the formula will be the other way around:
Formula: Z’ = 1 - ((3 * (posctrl_stdev + negctrl_stdev)) / posctrl_mean - negctrl_mean)

Expected result: In terms of variability, CV’s > 24% are considered unacceptable. This does not apply to postive controls that cause an extremely high level of death, because a high degree of variablity is unavoidable when working with very small numbers. Negative controls, on the other hand, should ideally have CV’s < 15%.

Z’ Factor values are categorised as follows: Z’ = 0.5-1 excellent, 0.3-0.5 good, 0-0.3 acceptable, < 0 unacceptable (there is too much overlap between positive and negative controls).

Notched box plots

The controls were plotted in notched box plot format. The values shown are normalised to the negative control median on a per-plate basis. The notch displays a confidence interval around the median (based on the median +/- 1.58*IQR/sqrt(n)). Outliers are shown in black.

Dot plots

The raw and normalised values for all wells were plotted in dot plot format. Raw cell counts should be consistent across multiple plates for the same cell line. Normalised cell counts should be consistent across all plates, with values of ~1 for negative controls and consistently low values for positive controls.

Viability

Robust Z-Scoring

To the median of all sample wells

The normalised values can be Robust Z-Scored to the median of all of the samples (on a per-cell line basis) in order to assess the relative strength of each sample. The values must always be normalised on a per-plate basis prior to Z-Scoring to account for batch effects. This method can be used to select the strongest and most robust hits from a large dataset (Z-Scores of >= 2 are considered to be significant) and is often used as a method of hit identification in viability screens. Because this a relative measure, a sample that had a moderate impact could result in a small Z-Score (if there are many hits in the dataset) or a large Z-Score (if there are very few hits in the dataset). For this reason, it is important to use Z-Scoring in combination with fold change (normalised value) when selecting hits.

Formula:
Robust Z-Score = (normalised sample value - median of all sample values) / median absolute deviation of all sample values

To the median of the negative control wells

The normalised values can also be Robust Z-Scored to the median of the negative controls (on a per-plate basis) in order to assess the strength and robustness of each sample, while taking into account the amount of variation in the negative control. This method can be used to calculate Z-Scores in smaller viability datasets. It is also often used to centre and scale high content imaging data prior to morphological profiling. This ensures that values reflect the change in a particular sample compared to the negative control, while also converting all morphological features to the same scale, which prevents features with large fold changes (eg. area) from obscuring features with small fold changes (eg. eccentricity).

Formula:
Robust Z-Score = (normalised sample value - negative control median) / negative control median absolute deviation

B-Score

Normalisation of raw data removes systematic plate-to-plate variation, making measurements comparable across plates. However, fold change and Robust Z-Score calculations assume a random error distribution that is common to all measurements within a single plate. Even with automation, sometimes systematic errors can occur within a plate, decreasing the validity of results by either over- or underestimating true values. These biases can affect all measurements equally or can depend on factors such as well location, liquid dispensing and signal intensity. The B-Score is a robust analog of the Z-Score, which is resistant to statistical outliers and can be used to minimise measurement bias due to positional effects by removing row, column, or well-level effects by iterative application of the Tukey median polish algorithm. Importantly, B-Score assumes a low hit rate and random plate layout, so is not usually suitable for compounds screens in which drug dilution curves are present. See Malo et al. 2006 and Mpindi et al. 2015 for more information.

Growth rate

Relative cell count at the end of treatment is commonly used to assess drug response. Across a range of concentrations, the measured cell count or a surrogate of it (e.g. measurement of ATP using CellTiter-Glo®) is normalized to the cell count of vehicle controls grown on the same plate under the same conditions. For each condition (combination of cell line, drug, and drug concentration), relative cell count is defined as x(c)/xctrl, where x(c) is the count in the presence of drug and xctrl is the 50%-trimmed mean of the count for control cells. Based on a sigmoidal curve fitted to the relative cell count across different concentrations, one can define:

IC₅₀ (50% relative cell count), which is the most commonly used metric but one that can only be defined if the relative cell count decreases during the assay to below 50% of the starting value;
EC₅₀ (half-point of the sigmoidal curve);
E_inf (the asymptotic efficacy), which differs from E_max (the maximal measured efficacy); and
AUC (area under the curve), which captures both IC₅₀ and E_max to some extent and is more robust to experimental noise than other metrics.
h: Hill coefficient of the sigmoidal curve, which indicates how steep the dose-response curve is.

In addition, the GRmetrics package reports the r-squared of the fit and evaluates the significance of the sigmoidal fit based on an F-test.

Further information regarding dose-response assays and traditional metrics can be found in Sebaugh (2010).

htmltools::img(src = knitr::image_uri(file.path("docs/GR_diagram1.png")), 
               style = 'padding-left:15px;height:340px;width:700px')

Examples of dose-response curves based on relative cell count. In the first example (left), a strong drug response results in an E_inf close to 0 and a well-defined IC₅₀. In the second (right), a partial response results in an E_inf above 0.5 and an undefined IC₅₀.

The following new metrics are also calculated:

GR₅₀: The concentration at which the effect reaches a growth rate (GR) value of 0.5 based on interpolation of the fitted curve.
GR_max: The effect at the highest tested concentration.
GR_inf: the effect of the drug at infinite concentration (GR_inf = GR(c→∞)). GR_inf lies between -1 and 1. Negative values correspond to cytotoxic responses (i.e., induction of cell death), and a value of 0 corresponds to a fully cytostatic response.
GEC₅₀, the drug concentration at half-maximal effect, which reflects the potency of the drug.
h_GR: Hill coefficient of the sigmoidal curve, which indicates how steep the dose-response curve is.

Another common metric for quantifying dose response is the area under the response curve (AUC), which is based on integrating the dose-response curve over the range of tested concentrations. In the case of GR curves, which can have negative values, it is more intuitive to use the area over the curve.

GR_AOC has the benefit that, in the case of no response, it has a value of 0. It is important to note that GR_AOC values (like conventional AUC) can only be used to compare responses evaluated across the same drug concentration range. The GR_AOC value captures variation in potency and efficacy at the same time. The calculation of GR_AOC at discrete (experimentally determined) concentrations has the advantage that it does not require curve fitting and is therefore free of fitting artifacts. This is especially useful for assays where fewer than five concentrations are measured and curve fitting is unreliable. GR_AOC values are also more robust to experimental noise than metrics derived from curve fitting; e.g. GR_max is particularly sensitive to outlier values when directly obtained from data.

htmltools::img(src = knitr::image_uri(file.path("docs/GR_diagram2.png")), 
               style = 'padding-left:15px;height:700px;width:800px')

Examples of dose-response curves and fits. The upper panels depict strong responses to drugs for which all sensitivity parameters can be defined. In contrast, in the case shown in the lower left panel, GR_inf is above 0.5, so GR₅₀ cannot be defined (and thus is set to ∞). In the case shown in the lower right panel, the response is weak and noisy, so the sigmoidal fit is not significant, and a straight flat line is fitted. Nevertheless, only GR_AOC and GR_inf can be defined.

In addition, the the GRmetics package reports the r-squared of the fit and evaluates the significance of the sigmoidal fit based on an F-test.

Synergy

To assess whether two drugs act synergistically in inducing cell death, we apply four popular classes of reference models:

Highest single agent (HSA) model. Reported by Berenbaum, 1989,
Loewe additivity model. Reported by Loewe, 1953,
Bliss independence model. Reported by Bliss, 1939, and
Zero interaction potency (ZIP) model. Reported by Yadav et al., 2015.

These reference models, together with many of their subsequent variants and extensions, have been developed based on different assumptions about the expected effect of non-interaction.
The HSA model, or Gaddum’s non-interaction model, assumes that the expected combination effect equals to the higher individual drug effect at the dose in the combination, representing the idea that a synergistic drug combination should produce additional benefits on top of what its components can achieve alone. In many preclinical drug combination studies, however, even a drug combined with itself can easily produce an excess over HSA.
For more stringent synergy classification, the Loewe additivity and Bliss independence models are being widely used. The Loewe additivity model defines the expected effect as if a drug was combined with itself, while the Bliss independence model utilizes probabilistic theory to model the effects of individual drugs in a combination as independent yet competing events.
ZIP captures the drug interaction relationships by comparing the change in the potency of the dose-response curves between individual drugs and their combinations.

Due to the inherent differences in the model assumptions, there is a lack of consensus on which references model one should use in an unbiased and statistically robust manner. As pointed out by many others, there is still no standardized guideline on how to choose the optimal reference model.

The visualization of the synergy scores is conducted as a two-dimensional and a three-dimensional interaction surface over the dose matrix.
(R package: DOI: 10.18129/B9.bioc.synergyfinder)

Negative synergy scores indicate Antagonism: The combination of two drugs works less efficient in reducing the cell numbers than the sum of the treatments individually.
Synergy scores around zero indicate Additivity: The combination of two drugs acts like the sum of the treatments individually.
Positive synergy scores indicate Synergism: The combination of two drugs works better in killing the cells than the sum of the treatments individually.

Cell cycle

The following cell cycle analysis is based on the method used by the Cellomics Cell Cycle BioApplication, and the following information is taken from the BioApplication user manual.

Method

The Cell Cycle BioApplication takes an approach similar to flow cytometry in assessing cell cycle phase. By labelling a cells DNA with a fluorescent dye having an intensity proportional to the cell’s DNA content, the total fluorescence intensity of the dye from the nucleus of each cell typically exhibits a bimodal distribution. The first peak typically contains cells with 2N DNA content (that is, in the G0/G1 phase). The second peak is at an intensity which is double the first peak and contains cells with 4N DNA content (that is, in the G2/M phase). Under normal conditions, there are more cells in the G0/G1 versus the G2/M phase, resulting in the first peak being higher than the second. The cells that exist in between the two major peaks are currently doubling their DNA and are in S phase. The cells that exist below the G0/G1 lower intensity threshold are usually undergoing cell death. The cells that exist above the G2/M upper intensity threshold may be either polyploid cells or may just be unable to be resolved as single cells by the counting algorithm (ie., segmentation error).

The BioApplication classifies each cell’s total nuclear intensity into one of five categories of DNA content, as shown in the above figure. The five categories are listed below:
1. DNA < 2N
2. DNA ~2N
3. 2N < DNA < 4N
4. DNA ~4N
5. DNA > 4N

Cells categorized as having DNA ~2N, 2N < DNA < 4N and DNA ~4N are assigned the cell cycle phases G0/G1, S and G2/M respectively. The other two DNA content categories are not assigned to specific cell cycle phases, but are classified as either below G0/G1 or above G2/M.

Cell cycle phase thresholds

The lower and upper intensity thresholds for each phase of the cell cycle were determined as follows. First, a density plot of total DAPI intensity was generated for each negative control cell population (all of the AAVS1 wells on a per-plate basis). The maximum peak was used to approximately identify the G0/G1 peak. This value was then doubled in order to approximately identify the G2/M peak (these values are indicated with pink dotted lines on theplot). The lower and upper limits if each phase of the cell cycle (indicated by black solid lines) were then calculated by first determining the length of the intensity range between the G0/G1 peak and G2/M peak and then defining the peak width fraction. If A value of 0.25 is used, the half-widths of each DNA content distribution is set at 25% of the range between the two peaks. Increasing this value will widen the G0/G1 and G2/M phases.

Cell morphology

Image quality control

The powerloglog slope distribution for each well was plotted for the DAPI channel and poor quality (out of focus) images were removed. The threshold for removing out of focus images = 2.5 * IQR. Empty images (cell count = 0) were also removed. This strategy is based on the image quality control workflow described by Carpenter et al. 2012.

mp.value and M-dist

The mp.value (significance) and Mahalanobis distance (M-dist) was calculated for each sample. M-dist measures morphological changes in each well compared to the negative control wells, taking into account all features (including cell count). The higher the M-dist, the larger the change compared to the negative control. The M-dist and mp.value calculations were performed on the raw data using all features (after data cleaning but prior to feature reduction). See Hutz et al. 2012 for more information.

Phenotypic profiles

The median Robust Z-Scores were used to create a phenotypic profile of each sample. A phenotypic profile contains all of the data for a particular sample across all of the morphology features measured. The phenotypic profiles were displayed in a signature plot. Z-Scores of at least -/+ 2 are considered to be significantly different to the negative control.

Feature correlation

A correlation plot was generated to investigate the relationship between each pair of features. Positive correlations are displayed in blue and negative correlations are displayed in red. Both the colour intensity and the size of the circle are proportional to the correlation coefficients (Pearson’s).

Sample correlation

A correlation plot was generated to investigate the relationship between each pair of samples. Positive correlations are displayed in blue and negative correlations are displayed in red. Both the colour intensity and the size of the circle are proportional to the correlation coefficients (Pearson’s).

Hierarchical clustering

Hierarchical clustering was used to organise the samples (objects) into groups of similarity (clusters), using the final set of features (Robust Z-Scored).

Hierarchical clustering is an unsupervised clustering algorithm, which involves creating clusters that have predominant ordering from top to bottom. The algorithm begins by assigning all of the observations to a single cluster and then partitions that cluster to two less similar clusters. This process repeats on until each observation belongs to an individual cluster. The results of hierarchical clustering are most often displayed in a dendrogram. It is then up to the user to decide where to ‘cut’ the dendrogram. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

The cophenetic correlation coefficient is used to measure how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points (must be > 0.75). A range of distance metrics and linkage methods were trialled. The cophenetic correlation coefficient was used in combination with the phenotypic profiles and viewing the images to determine the best method to use.

Estimating “k”

The Elbow Method

The optimal number of clusters was estimated using the Elbow Method, in which the sum of squares at each number of clusters (k) is calculated and graphed. A change in slope from steep to shallow (an elbow) is then used to determine the optimal number of clusters. This method is somewhat inexact, but it does show how increasing the number of the clusters contributes to separating the clusters in a meaningful way. The bend indicates the point at which additional clusters add little value. The selected number of clusters is indicated by the pink dashed line.

Dendrogram

A dendrogram is a diagram that shows the hierarchical relationship between objects, and is the most common method display the results of hierarchical clustering. The horizontal axis of the dendrogram represents the distance, or dissimilarity, between clusters and the vertical axis represents the objects (samples). The lower the distance (further to the right on the x-axis) at which two objects branch apart, the more similar their phenotypic profiles are. The dendrogram was plotted and used to allocate the samples into clusters of similar morphology.

PCA

Principal Component Analysis (PCA) was performed as an additional method of data exploration and for comparison to hierarchical clustering. PCA was performed on the normalised data (data is scaled in the process of PCA, so prior Z-Scoring is not necessary). Unreliable features, eg. features with low correlation between replicates, low variance across samples and/or containing Inf, NA, and NaN values were removed prior to PCA. Highly correlated features were not removed prior to PCA (PCA is well suited to datasets in which many features are highly correlated). **

PCA is a linear dimensionality reduction method that is often used to simplify large datasets, by transforming a large set of variables into a smaller one that still contains most of the information present in the original dataset. When many variables are present, you cannot easily plot the data in its raw format, making it difficult to get a sense of the trends present within. PCA allows you to see the overall “shape” of the data, making it easier to identify which samples are similar to one another and which are very different and which variables (morphological features) make one sample different from another. As such, it is a very useful way of exploring data.

Principal components (PCs) are new variables that are constructed as linear combinations of the initial variables. These combinations are done in such a way that the new variables (ie. PCs) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first few PCs. So, 10 PCs might contain almost all of the information contained in the original dataset, but PCA tries to put the maximum possible amount of information in the first PC, then the maximum possible amount of remaining information in the second PC and so on. Organising data this way allows you to reduce dimensionality without losing much information, by discarding the PCs with the least amount of information and considering the remaining PCs your new variables.

The PCs are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables. If many variables correlate with one another, they will all contribute strongly to the same PC. Each PC sums up a certain percentage of the total variation (and therefore information) in the dataset. Where your initial variables are strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few PCs. As you add more PCs, you summarise more and more of the original dataset, but you also make the data more unwieldy. More information about PCA can be found in this review article.

Scree plot

Scree plots are used to determine the PCs that capture maximal amount of variance, or the most amount of information, in the data. A scree plot can, therefore, be used to select the PCs to keep. An ideal curve should be steep and then bend at an “elbow” (the cut-off point) before flattening out. The selected PCs should be able to describe at least 80% of the variance. A scree plot can also be used as a diagnostic tool to determine if PCA will work well on your data - if too many PCs (more than 3) are required, PCA may not be the best way to visualise your dataset.

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) was performed as a final method of data exploration and for comparison to hierarchical clustering. t-SNE was performed on the Robust Z-Scored data using the same set of features that were used for hierarchical clustering.

Like PCA, t-SNE gives you a feel for how the data is arranged in a high-dimensional space by projecting it into a low-dimensional space, aiming to preserve as much of the significant structure of the high-dimensional data as possible. However, while PCA is a linear technique that focuses on keeping the low-dimensional representations of dissimilar datapoints far apart, t-SNE is a non-linear technique that focuses more on keeping the low-dimensional representations of very similar datapoints close together. t-SNE preserves only small pairwise distances or local similarities, whereas PCA is concerned with preserving large pairwise distances to maximise variance.

The t-SNE algorithm starts by calculating the probability of similarity of points (samples) in high-dimensional space and then calculating the probability of similarity of points in the corresponding low-dimensional space. The similarity of points is calculated as the conditional probability that a point A would choose point B as its neighbour, if neighbours were picked in proportion to their probability density under a Gaussian (normal) distribution centered at A. It then tries to minimise the difference between these conditional probabilities (or similarities) in higher-dimensional and lower-dimensional space for a perfect representation of data points in lower-dimensional space.

In this way, t-SNE maps the multi-dimensional data to a lower dimensional space and attempts to find patterns in the data by identifying observed clusters based on similarity of data points with multiple features. However, after this process, the input features are no longer identifiable, and you cannot make any inference based only on the output of t-SNE. Hence, it is mainly a data exploration and visualisation technique, rather than a clustering technique. More detailed information about t-SNE can be found in this research paper.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

The biggest difference between the the output of UMAP when compared with t-SNE is this balance between local and global structure - UMAP is often better at preserving global structure in the final projection. This means that the inter-cluster relations are potentially more meaningful than in t-SNE. However, it’s important to note that, because UMAP and t-SNE both necessarily warp the high-dimensional shape of the data when projecting to lower dimensions, any given axis or distance in lower dimensions still isn’t directly interpretable in the way of techniques such as PCA. The increased speed, better preservation of global structure, the algorithm scales well with increasing data dimensions and more understandable parameters make UMAP a more effective tool for visualizing high dimensional data.

More detailed information about UMAP can be found in this researchpaper.

Random forest model

Random forest is a commonly machine learning algorithm that combines the output of multiple decision trees to reach a single result. It is commonly used in dealing with classification and regression problems

Decision trees

In brief, decision trees start with a overarching question, such as, “Is this cell dead?” Next, one can ask a series of questions to determine an answer, such as, “Is the cell nucleus condensed?” or “Is cell membrane ruptured?”. These questions make up the decision nodes in the tree, acting as a means to split the data in order to reach the final answer “Yes or No”.Metrics, such as Gini impurity, and mean square error (MSE), can be used to evaluate the quality of the split.

Ensemble methods

While decision trees are popular supervised learning algorithms, they can be susceptible to issues like bias and overfitting. However, in the random forest algorithm, when multiple decision trees come together as an ensemble, they produce more precise results, especially when the individual trees are uncorrelated dataset.

One of the most renowned ensemble methods is bagging, also known as bootstrap aggregation. This technique involves selecting a random sample of data from a training set with replacement, allowing individual data points to be chosen multiple times. After generating several data samples, independent models are trained. Depending on the task type—whether regression or classification—the average or majority of these predictions provides a more accurate estimate. This approach is commonly employed to reduce variance in a noisy dataset.

Random forest algorithm

The random forest algorithm builds upon the bagging technique by incorporating both bagging and feature randomness. This combination creates an ensemble of decision trees that are uncorrelated. Feature randomness, also referred to as feature bagging, involves generating a random subset of features. This ensures a low level of correlation among the decision trees. This distinction is crucial when comparing decision trees to random forests. Unlike decision trees, which consider all possible feature splits, random forests only utilize a subset of these features. By accounting for the full range of potential variability in the data, we can mitigate the risks of overfitting, bias, and overall variance, resulting in more accurate predictions.

More detailed information about Random forest can be found in this intuition.

Victorian Centre for Functional Genomics

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 9.5 (Blue Onyx)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Melbourne
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.4         knitr_1.49       
 [5] rlang_1.1.5       xfun_0.51         stringi_1.8.4     processx_3.8.6   
 [9] promises_1.3.2    jsonlite_1.9.1    glue_1.8.0        rprojroot_2.0.4  
[13] git2r_0.35.0      htmltools_0.5.8.1 httpuv_1.6.15     ps_1.9.0         
[17] sass_0.4.9        rmarkdown_2.29    jquerylib_0.1.4   tibble_3.2.1     
[21] evaluate_1.0.3    fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[25] whisker_0.4.1     stringr_1.5.1     compiler_4.4.2    fs_1.6.5         
[29] pkgconfig_2.0.3   Rcpp_1.0.14       rstudioapi_0.17.1 later_1.4.1      
[33] digest_0.6.37     R6_2.6.1          pillar_1.10.1     callr_3.7.6      
[37] magrittr_2.0.3    bslib_0.9.0       tools_4.4.2       mime_0.12        
[41] cachem_1.1.0      getPass_0.2-4

VCFG Analysis Methods