Science News

Enabling sensitive and precise detection of ctDNA through somatic copy number aberrations in breast cancer

March 9, 2025

Selection of SNPs to include in the panel

Two sets of high gMAF SNPs were included in the panel; a first set, including loci distributed uniformly along the human genome to ensure the detection of SCNAs involving extensive genomic regions, and a second set, consisting of SNPs located across genes frequently and focally altered by SCNAs.

For the first set, to ensure greater experimental efficiency, priority in selection was assigned to SNPs included in the Infinium Omni2.5-8 Kit microarray and characterized by a MAF ≥ 0.45, as reported in the dbSNP version v151 catalog⁴³ considering all populations included in the 1000 Genomes Project⁴⁴. More specifically, for each chromosomal band, a number of SNPs was selected to ensure a total density of 5 SNPs/Mb, a minimum number of 20 loci for the least populated chromosomal arm (chr21p) and maintaining a spacing of at least 200 bases between two consecutive SNPs.

For the second set, SNPs with gMAF ≥0.35 were selected in genomic windows of approximately 500 kbp both upstream and downstream of 5 selected genes (CCND1, ERBB2, PTEN, TP53, ESR1) that are frequently affected by focal SCNAs.

In total, 17,012 SNP loci were selected (Supplementary Data 2).

Selection of genes to include in the panel for SNV analysis

Publicly available data from the cBioPortal for Cancer Genomics^45,46 and the Integrative Onco Genomics (IntOgen) database⁴⁷ were utilized to define a list of genes of interest in the clinical context of breast carcinoma. Specifically, mutational data from the METABRIC studies^28,29,30, TCGA PanCancer Atlas (www.cancer.gov/tcga) INSERM³¹, MSK-IMPACT³², and The Metastatic Breast Cancer Project (www.mbcproject.org) were analyzed to associate a mutation frequency with each gene. A gene was considered mutated if at least one of its coding regions was found altered by the presence of at least one somatic point mutation causing a change of a base to a different and “non-synonymous” one.In total, 17,012 SNP loci were selected (Supplementary Data 2).

To ensure the selection is as accurate and informative as possible, the data and their mutational frequencies were stratified based on the molecular subtype of the tumor (HER2 + , HR + , Triple Negative) and its stage of evolution (primary or metastatic). This stratification allowed the selection of the most frequently mutated genes (N = 25) for each of the 6 classes. The unique set of genes identified from this initial selection (N = 68) was validated and further extended by analyzing tumor-specific aberration frequencies reported in the IntOgen database. The combined results of these two analyses identified 81 genes of interest, whose coding regions (exons, excluding UTRs) were selected for sequencing. The final selection included 2149 genomic regions, covering the exons of 81 genes (Supplementary Data 1 and Supplementary Data 2).

Panel design

All selected SNPs and coding regions of genes of interest were included in a single BED file. This file served as input for the Illumina HyperDesign software (hyperdesign.com), with stringency parameters (a value defining the specificity of a probe for the target region; min = 1, max = 20, optimal interval for unique probes from 1 to 4) and overhang (a value defining the number of bases by which the probe deviates from the target region; min = 0, max = 120, optimal interval from 0 to 30) set to 4 and 30, respectively. The output produced by HyperDesign defined the optimal genomic coordinates for 17,040 sequencing probes, covering a total of approximately 2.3 Mbp and including 14,829 SNPs (around 90% of the initially selected SNPs) and 2075 genes’ exonic regions (about 95% of the initially selected ones).

Samples and patients’ characteristics

Plasma samples (n = 44) were collected from 15 patients with metastatic breast cancer. Clinical and pathological characteristics for each of the patients are reported in Supplementary Data 3 and Supplementary Data 5. In particular, samples were drowned at baseline (T0, n = 15), after one cycle of treatment (T1, n = 15) and at disease progression (T2, n = 14). Plasma samples were prepared within an hour from blood collection and stored at −80 °C until cfDNA extraction. Normal K2EDTA plasma samples from 20 female healthy donors were purchased from Precision for Medicine (Precision for Medicine, MA).

cfDNA extraction, library preparation and sequencing

Plasma cfDNA was isolated by QIAmp Circulating Nucleic Acid Kit (Qiagen) and quantified by Qubit (Qubit dsDNA High Sensitivity Kit, Life Technologies). Libraries were prepared starting from 20 ng of DNA using KAPA HyperPrep Kit and Kapa Universal UMI adapter (Roche), according to the manufacturer’s instructions. Libraries size distribution and concentration were analyzed respectively by Bioanalyzer DNA High Sensitivity Kit (Agilent) and Qubit (Qubit dsDNA High Sensitivity Kit, Life Technologies) before target enrichment with the 17,040 customized Kapa hyper Cap Target Enrichment Probes (Roche). For the enrichment step four samples were combined using 500 ng of each amplified indexed library. The enriched DNA samples were amplified according to the manufacturer’s instructions with 16 cycles of Post-Capture PCR. The amplified enriched DNA samples libraries were quantified by Qubit dsDNA High Sensitivity Kit and libraries size distribution were evaluated by Bioanalyzer DNA High Sensitivity Kit. Sequencing was performed on Illumina Novaseq 6000 generating 150 bp paired-end reads.

Sequencing data pre-processing

Reads were trimmed to remove adapters and UMI extraction, consensus and sorting were performed using UMI_tools (version 1.1.2)⁴⁸, fgbio (version 2.0.2), and GATK (version 4.2)⁴⁹. Alignment to the human GRCh38 reference genome was performed using BWA-MEM. Realignment and recalibration were performed using GATK. MD tags were calculated using samtools calmd (version 1.7)⁵⁰ and overlapping read pairs were clipped using bamUtil (1.0.14). PaCBAM was used to generate pileup and depth of coverage statistics’ files⁵¹. In particular, we used rc files, reporting the average depth of coverage of all captured genomic regions along with their GC content, and snps files, reporting pileup statistics for all considered SNPs such as total coverage, coverage supporting the reference allele, coverage supporting alternative allele, and the variant allelic fraction (AF).

Reference mapping bias correction

Reference Mapping Bias (RMB), namely the presence of a main AF peak value different from the expected 0.5, is addressed to ensure proper downstream analysis of SNPs’ AF data and comparison of AF distributions from independent samples, similarly to what previously described in ref.⁵². The peak correction was applied separately for all cfDNA samples, applying a Kernel Density Estimation on the heterozygous SNPs AF distribution and extracting peaks by computing the local maxima of the smoothed distribution²⁷; the central peak was extracted and data centered to the 0.5 theoretical value.

Panel of controls

Sequencing data of control samples was used to generate a panel of controls, which characterizes the captured genomic regions of all genes of interest and all SNPs.

First, read depth of coverage of all captured regions across all control samples was normalized for both GC content and sample’s mean coverage. Normalized read depth (RD) was then used to compute, for each region across all control samples, mean and standard deviation values. Then, we performed a leave-one-out procedure in which, for each control sample i and each captured region c, a log2 ratio (log2R) value was calculated using the following formula:

$${\log 2R}_{{ci}}=\log 2R\left(\frac{{{cov}}_{{ci}}}{{mean}({{cov}}_{{cn}-i})}\right)$$

(1)

with ${{cov}}_{{ci}}$ being the RD of the control sample i in region c divided by the mean of the RD of region c across all control samples excluding sample i (${{cov}}_{{cn}-i}$). In this way a table of log2R values for each captured region across each control sample was obtained. While performing this operation, a mask ratio parameter (set at 0.5) was used. This value determines the minimum percentage of samples having finite values for a considered captured region in order to perform the log2R calculation. If the mask ratio was not reached that region was not included in the panel of controls. Overall, read depth of coverage statistics are organized in two tables: the first, referred to as rc table, having the RD mean and standard deviation for each captured region across all controls; the second table, referred to as log2R table, having for each captured region all the controls’ log2R values.

Then, for SNPs’ AF data two main data structures were built: (1) a collection of SNPs summary statistics; (2) AF dispersion stratified by changes in SNPs local coverages. Briefly, for each control sample, captured SNPs that have a heterozygous genotype (0.2 < AF < 0.8) were kept. Summary statistics for each heterozygous SNP across all control samples were computed, including, AF distribution mean, coefficient of variation, proportion of samples out of N harboring the heterozygous genotype, and mean coverage. AF dispersion was instead modeled collecting AF standard deviations stratified by local coverage quantiles Q (min 0%, max 100%, step 10%).

SCNA identification

For a given cfDNA plasma sample, raw read depth data was processed as mentioned above. Upon calculation of log2R for all captured regions, Circular Binary Segmentation (CBS)⁵³ was performed, using the R package DNAcopy (version 1.68)⁵⁴, to identify putative read depth distribution change points representing copy number variations. The segmentation analysis was performed considering single arms of each chromosome separately. Focal genes’ regions, enriched for high MAF SNPs in our panel, were smoothed in order to prevent over-segmentation, in other words, segmentation was not performed on these genes and their SNPs enriched regions were considered in their entirety for gaining a better signal.

Then, for each identified segment, a second run of CBS was performed. This time upon mirrored AF values (mAF), calculated as:

$${mAF}=\,\left\{\begin{array}{ll}1-{{SNP}}_{{af}}, & {{SNP}}_{{af}} \,>\, 0.5\\ \quad \,\,\,\,{{SNP}}_{{af}}, & {otherwise}\end{array}\right.$$

(2)

mAF values were used to better identify possible SCNA events not detectable by the log2R signal. As in this case, focal regions were analyzed in their entirety without segmentation.

Computation of allelic imbalance per segmented region

For each of the identified segments, an allelic imbalance value was computed using a methodology that extends our work in refs. ^27,55. In detail, given a cfDNA sample and an identified SCNA segment S, the set of SNPs that are heterozygous in the cfDNA sample, contained in the segment and also present in the panel of controls were selected and used to compute a value representing the evidence of allelic imbalance for the segment S and another value representing an estimate of the ${\beta }_{S}$ value, which represents the proportion of local read depth signal that is not imputable to ctDNA⁵⁶. The evidence of allelic imbalance was computed with the formula:

$$E\left({\rm{S}}\right)=\frac{\mathop{\sum }\nolimits_{1}^{k}W\left(d \,>\, D\right)}{k}$$

(3)

were k = 5 (by default), $d$ is the observed mAF distribution in the cfDNA sample, $D$ is a simulated mAF distribution generated sampling one time for each SNP in S from a normal distribution with mean and standard deviation obtained from the panel of controls, and W is a function returning 1 if the difference between ${d}$ and $D$ applying a Wilcoxon signed-rank test with significance cutoff of 1% is statistically significant, 0 otherwise.

The ${\beta }_{S}$ estimate was instead computed by comparing $d$ with simulated distributions mimicking different levels of β (representing different ctDNA levels) and searching for the most similar one. Formally:

$$\begin{array}{l}{\beta }_{S}=\min \left\{\beta \vee W\left(d \,>\, {D}_{\beta }\right)\right\}-\left(\min \left\{\beta \vee W\left(d \,>\, {D}_{\beta }\right)\right\}\right.\\\left.\qquad-\max \left\{\beta \vee W\left(d \,<\, {D}_{\beta }\right)\right\}\right)* P\end{array}$$

(4)

with

$$P=\frac{{median}(d-\min (d))}{\max (d)-\min (d)}\text{and}\,\beta \in \left\{0.01,0.02,\ldots ,0.99,1\right\}$$

(5)

and where $W(d \,>\, {D}_{\beta })$ is the Wilcoxon signed-rank statistics (significance cutoff of 1%) comparing $d$ and ${D}_{\beta }$.

Assignment of copy number state

To assign a copy number state to each SCNA segment identified in a cfDNA sample, a computational approach combining allelic imbalance evidence and read-depth z-score thresholds is used. In detail, given a SCNA segment S identified in a cfDNA sample, the panel of controls log2R table was queried to obtain the set of captured genomic regions that are contained in the segment, denoted as ${R}_{S}$. Then, for each control sample, the median log2R of all ${R}_{S}$ genomic regions was computed resulting in a vector of reference log2R values for the segment S, denoted as ${\log 2R}_{S}$. The z-score for the segment S was then calculated as follows:

$${{zscore}}_{S}=\frac{({\log 2R}_{S}-{mean}({\log 2R}_{S}))}{{std}({\log 2R}_{S})}$$

(6)

A statistical read-depth z-score threshold ${Z}_{{thr}}$ (by default 2.58) was finally combined with an allelic imbalance evidence threshold ${E}_{{thr}}$ (by default 0.8) to assign a copy number state to each segment S in the following way:

$$SCN{A}_{s}=n\left\{\begin{array}{ll}imbalanced\,GAIN\,(iGAIN), & zscor{e}_{s} > {Z}_{thr}\wedge E(S) > {E}_{thr}\\ {monoallelic}\,{LOSS}\,({mLOSS}),&{zscor}{e}_{s} < -{Z}_{thr}\wedge {E}(S) > {E}_{thr}\\\qquad\,{allelic}\,{imbalance}\,(AI), &{zscor}{e}_{s}\in \,[-{Z}_{thr},{Z}_{thr}]\wedge {E}(S) > {E}_{thr}\\ \qquad\qquad\qquad\qquad\;\;{GAIN}, & {zscor}{e}_{s} > {Z}_{thr}\wedge E(S)\le {E}_{thr}\\ \qquad\qquad\qquad\qquad\;\,\;{LOSS}, &{zscor}{e}_{s} < -{Z}_{thr}\wedge E(S)\le {E}_{thr}\end{array}\,\right.$$

(7)

The state ${AI}$ is used to identify all segments that have evidence of allelic imbalance but for which there is no statistical evidence of SCNA from the read depth analysis. Of note, when the ctDNA level estimation (see below) for the sample is above 15% we can confidently assume that ${AI}$ events are ${LOH}$ events.

ctDNA level and ploidy estimation

ctDNA detection in cfDNA plasma samples was assessed considering, for each sample, the presence of at least one segment having an allelic imbalance evidence value greater than ${E}_{{thr}}$. ctDNA level for a cfDNA sample was instead estimated considering the β values of all SCNA segments identified as ${mLOSS}$⁵⁶$,$ which represent the set of mono-allelic deletions identified in the cfDNA sample. In detail, an integrated β value was calculated as a weighted mean of all peaks identified in the β segment values distribution (β was weighted by considering the magnitude of the peak). Then, as described in⁵⁶, the ctDNA level was estimated as:

$${ctDNA}=1-\frac{\beta }{\left(2-\beta \right)}$$

(8)

For each cfDNA sample having evidence of tumor signal, a ploidy estimation was computed. In detail, samples segments were filtered for those with β equal to 1, indicating no presence of allelic imbalance. Clustering using dbscan⁵⁷ of log2R values for these segments was then performed and the left most cluster was identified as the one representing the balanced copy number 2 (one copy per allele) and used as shift value to adjust the overall sample’ log2R distribution. More specifically, the adjustment was applied to all segments in order to center putative copy number neutral segments to zero:

$$\log 2R.{corrected}=\log 2R-{shift}$$

(9)

Of note, since ctDNA is typically low in cfDNA samples and ploidy values are extremely challenging to calculate at low ctDNA level, the ploidy adjustment was applied only when the ctDNA level estimation was greater than 15%. When ploidy adjustment was applied, a new ctDNA level estimation was calculated after the adjustment.

In-silico benchmarking

To assess the theoretical detection/estimation limits of our approach, an in-silico cohort of synthetic samples was generated. To this end, synggen, a computational tool we recently developed for the fast generation of large-scale realistic and heterogeneous cancer sequencing synthetic datasets, was used³⁵. To generate the in-silico cohort, profiles of germline SNPs and somatic allele-specific SCNA were collected and retrieved, respectively, from CEU individuals in the 1000 Genomes Project collection and from the TCGA dataset (cbioportal.org).

In detail, sequencing data (BAM files) of the control samples from the available cohort were provided in input to synggen using a specific execution mode that, from those files, extracts a series of statistical models that summarize platform specific data characteristics, such as the distribution of the read depth of coverage, the distributions of read and base qualities, and base-specific systematic errors. These models were then used, in conjunction with the collected SNPs and allele-specific SCNA profiles, to generate synthetic control and cfDNA samples at different levels of ctDNA and average depth of coverage.

More precisely, for a read depth of coverage of ~800x (representing the average coverage of the data we generated), 20 representative breast cancer samples for each of the following decreasing ctDNA level were generated [80%, 60%, 40%, 20%, 10%, 7.5%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.1%]. Then, for increasing read depth of coverage scenarios [1000x, 1500x, 2000x, 2500x], 20 representative breast cancer samples for each of the following decreasing ctDNA level were generated [5%, 4%, 3%, 2%, 1%, 0.5%, 0.1%]. In addition, 40 control samples were generated for each simulated depth of coverage scenario, half representing cfDNA samples with no tumor signal and half to generate the panels of controls. All allele-specific SCNA were included in the synthetic cfDNA data as clonal events.

The simulated cohort was used to assess the performance of our assay. In particular, we looked into the ability of our computational approach to identify presence of tumor signal in a synthetic cfDNA sample, namely a detection performance, and, if possible, into the estimation of the ctDNA level present in the sample analyzed. Detection performances were tested looking into the accuracy of identifying samples with positive ctDNA at each different simulated condition (coverage/ctDNA level) independently (N = 20).

SNV calls

To detect somatic single nucleotide variants (SNVs) we applied ABEMUS³⁸, a method we previously implemented and that is specifically designed for SNVs detection in cfDNA samples. ABEMUS was run using our control samples to create the ABEMUS reference model, which captures platform specific characteristics that are used by the tool to improve precision of SNVs calling. ABEMUS analyses were then performed using the standard computational workflow across all cfDNA samples considering only the exonic regions that are captured by the panel (i.e. we excluded all regions capturing assay SNPs). Considering that we had no matched control samples for our cfDNA samples, we then implemented an ad-hoc post-processing strategy to reduce the number of false positives. First, we annotated all identified position using the SNP Nexus web server. Exploiting the sequential samples, we then excluded all identified SNVs that were annotated in the dbSNP database, had a MAF > 0.01 in gnomAD and had in 2/3 of the cfDNA sequential samples an AF > 0.2. From the remaining calls, we then excluded the ones that were annotated in the dbSNP database, had a MAF > 0 in gnomAD and had in all the sequential cfDNA samples an AF > 0.4. Finally, we retained only the SNV calls supported by a number of alternative reads >1 and annotated for the presence in the COSMIC database⁵⁸ or in breast cancer datasets available from the cBioPortal. The clonality of single nucleotide variants (SNVs) was determined by calculating the ratio of the SNV’s allele frequency to the sample’s ctDNA level. This calculation assumes the presence of a mono-allelic mutation and includes a correction for mLOSS. If the resulting ratio exceeds 1, it is normalized to 1. Values above 0.75 are considered associated to clonal SNVs.

Statistical analyses

Correlation of ctDNA levels’ estimations was performed using Pearson correlation statistics with significance level set at 5%. Univariate overall survival and progression-free survival analyses were performed using the Kaplan-Meier estimator (log-rank test).

Source link