Introduction

The amplicon package provides tools for analyzing pair end amplicon sequencing data, with a focus on identifying and characterizing PCR artifacts such as primer dimers and off-target amplification products. This vignette will guide you through the main features and typical workflow of the package.

Basic Workflow

1. Loading and Parsing FASTQ Files

The first step is to load your paired-end FASTQ files using the parse_fastq_pairs function, as well as define the sequence of the used primers (both in the same reading direction). Feel free to use the provided (simulated) toy data! The primer sequences are defined accordingly.

# Example with FASTQ files
fp <- "ACGTACGTACGT"
rp <- "TGCATGCATGCA"
forward_reads_path <- system.file("extdata", "r1_test.fastq", package = "amplicon")
reverse_reads_path <- system.file("extdata", "r2_test.fastq", package = "amplicon")
fastq_files <- parse_fastq_pairs(
  fastq_file_r1 = forward_reads_path,
  fastq_file_r2 = reverse_reads_path,
  forward_primer = fp,
  reverse_primer = rp
)

2. Detecting Primer Dimers

After loading your data, you can identify potential primer dimers:

# Detect primer dimers with default parameters
primer_dimers <- detect_primer_dimer(
  reads_df = fastq_files,
  max_dimer_length = 100, # Maximum length for primer dimer classification
  forward_primer = fp,
  reverse_primer = rp
)

# View summary of primer dimers
summary(primer_dimers)

3. Identifying Off-target Amplification

Next, analyze your sequences for off-target amplification products:

# Detect off-target amplification
offtargets <- detect_offtargets(
  reads_df = fastq_files,
  expected_length = 400, # Expected amplicon length
  length_tolerance = 50, # Allowed deviation from expected length
  max_dimer_length = 100 # Maximum length for primer dimer classification
)

# Access the results
print(paste("Too short sequences:", sum(offtargets$too_short)))
print(paste("Too long sequences:", sum(offtargets$too_long)))

4. Validating Amplicons

Check if your amplicons meet the expected criteria:

# Validate amplicons
valid_amplicons <- detect_valid_amplicon(
  reads_df = fastq_files,
  min_quality = 30, # Minimum quality score
  expected_length = 400, # Expected amplicon length
  length_tolerance = 50 # Allowed deviation from expected length
)

5. Comprehensive Analysis

Perform a complete analysis of your amplicon sequencing data. You can choose between creating a combined plot with all samples (when sample number is limited) or separate plots for each sample:

# Run full analysis with combined plot (default)
results_combined <- analyze_amplicons(
  forward_reads_path,
  reverse_reads_path,
  expected_length = 400,
  length_tolerance = 50,
  min_quality = 30,
  max_dimer_length = 100,
  separate_plots = FALSE  # default behavior
)

# Run analysis with separate plots for each sample
results_separate <- analyze_amplicons(
  forward_reads_path,
  reverse_reads_path,
  expected_length = 400,
  length_tolerance = 50,
  min_quality = 30,
  max_dimer_length = 100,
  separate_plots = TRUE  # create individual plots
)

# Access different components of the analysis
print(results_combined$summary_stats)

Visualization

The package provides visualization options to help interpret your results. You can either create a combined plot with all samples using facets, or separate plots for each sample:

# For combined plot (default behavior)
plot(results_combined$length_distribution_plot)

# For separate plots
# results_separate$length_distribution_plot contains a list of plots, one per sample
# Plot a specific sample
plot(results_separate$length_distribution_plot[[1]])  # First sample

# Plot all samples in sequence
for (sample_plot in results_separate$length_distribution_plot) {
  plot(sample_plot)
}

When using separate_plots = TRUE, individual PDF files are automatically saved in your output directory with names like length_distribution_<sample_name>.pdf. With separate_plots = FALSE (default), a single combined plot is saved as length_distribution.pdf.

Advanced Usage

Custom Parameters

You can customize various parameters to suit your specific needs:

# Example with custom parameters
results <- analyze_amplicons(
  fastq_file_r1, # Pair end read paths
  fastq_file_r2,
  forward_primer, # The primers used in your experiment
  reverse_primer,
  output_dir, # Custom output directory
  expected_length = 400,
  length_tolerance = 30, # Stricter length tolerance
  min_quality_score = 35, # Higher quality threshold
  max_dimer_length = 80, # Stricter primer dimer definition
  separate_plots = TRUE # Generate individual plots per sample
)

Conclusion

This vignette covered the basic workflow and main features of the amplicon package. For more detailed information about specific functions, please refer to the function documentation using ?function_name in R.

Session Info

sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] amplicon_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.37     desc_1.4.3        R6_2.6.1          fastmap_1.2.0    
#>  [5] xfun_0.52         cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1
#>  [9] rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.5         pkgdown_2.1.2    
#> [13] sass_0.4.10       textshaping_1.0.1 jquerylib_0.1.4   systemfonts_1.2.3
#> [17] compiler_4.5.0    tools_4.5.0       ragg_1.4.0        evaluate_1.0.3   
#> [21] bslib_0.9.0       yaml_2.3.10       jsonlite_2.0.0    rlang_1.1.6      
#> [25] fs_1.6.6