WGS Analysis Report
This report includes summaries of data quality, data processing, and snapshots of results for your WGS study. This report should assist you to get a general picture of the study, to spot any irregularities in the sample or data, and to explore the most significant results.
General Statistics
Sample Name | ≥ 30X | Median | Change rate | Ts/Tv | M Variants | Vars | SNP | Indel | Ts/Tv | Duplication | Error rate | Non-primary | Reads mapped | % Mapped | % Proper pairs | Total seqs | GC content | % Adapter |
in4276_1 | 93.0% | 63.0X | 426 | 1.980 | 7.24M | 7138307 | 5575275 | 1569067 | 1.12 | 13.0% | 0.76% | 0.0M | 1467.8M | 99.9% | 99.1% | 1468.8M | 40.8% | 4.7% |
in4276_2 | 91.0% | 46.0X | 406 | 1.992 | 7.60M | 7499329 | 5762463 | 1742280 | 1.07 | 12.3% | 0.52% | 0.0M | 1106.0M | 99.7% | 90.3% | 1109.7M | 41.4% | 16.2% |
in4276_3 | 85.0% | 38.0X | 358 | 1.998 | 8.61M | 8510434 | 6175965 | 2339307 | 0.96 | 11.6% | 0.56% | 0.0M | 905.8M | 99.8% | 82.7% | 908.1M | 40.6% | 19.8% |
in4276_4 | 93.0% | 49.0X | 440 | 1.990 | 7.01M | 6895279 | 5398664 | 1501801 | 1.22 | 14.3% | 0.49% | 0.0M | 1165.2M | 99.9% | 98.4% | 1166.2M | 40.7% | 9.7% |
in4276_5 | 93.0% | 53.0X | 416 | 1.994 | 7.42M | 7307703 | 5539370 | 1773113 | 1.18 | 12.5% | 0.48% | 0.0M | 1429.7M | 99.9% | 98.8% | 1431.0M | 40.6% | 47.6% |
Cumulative coverage distribution
Proportion of bases in the reference genome with, at least, a given depth of coverage. Note that for 5 samples, a BED file was provided, so the data was calculated across those regions. For 5 samples, it's calculated across the entire genome length. 5 samples have both global and region reports, and we are showing the data for regions
Coverage distribution
Proportion of bases in the reference genome with a given depth of coverage. Note that for 5 samples, a BED file was provided, so the data was calculated across those regions. For 5 samples, it's calculated across the entire genome length. 5 samples have both global and region reports, and we are showing the data for regions
Average coverage per contig
Average coverage per contig or chromosome
XY coverage
Variants by Genomic Region
The stacked bar plot shows locations of detected variants in the genome and the number of variants for each location.
The upstream and downstream interval size to detect these genomic regions is 5000bp by default.
Variant Effects by Impact
The stacked bar plot shows the putative impact of detected variants and the number of variants for each impact.
Variants by Effect Types
The stacked bar plot shows the effect of variants at protein level and the number of variants for each effect type.
This plot shows the effect of variants with respect to the mRNA.
Variants by Functional Class
The stacked bar plot shows the effect of variants and the number of variants for each effect type.
Variant Qualities
The line plot shows the quantity as function of the variant quality score.
The quality score corresponds to the QUAL column of the VCF file. This score is set by the variant caller.
Variant Substitution Types
Variant Quality
Indel Distribution
GATK4 MarkDuplicates
GATK4 MarkDuplicates metrics generated either by GATK4 MarkDuplicates or EstimateLibraryComplexity (with --use_gatk_spark).
Mark Duplicates
Number of reads, categorised by duplication state. Pair counts are doubled - see help text for details.
Samtools Flagstat
Percent mapped
Alignment metrics from samtools stats
; mapped vs. unmapped reads vs. reads mapped with MQ0.
Alignment stats
This module parses the output from samtools stats
. All numbers in millions.
FastQC (raw)
FastQC (raw) is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.
Sequence Counts
Sequence counts for each sample. Duplicate read counts are an estimate only.
This plot show the total number of reads, broken down into unique and duplicate if possible (only more recent versions of FastQC give duplicate info).
Sequence Quality Histograms
The mean quality value across each base position in the read.
To enable multiple samples to be plotted on the same graph, only the mean quality scores are plotted (unlike the box plots seen in FastQC reports).
Per Sequence Quality Scores
The number of reads with average quality scores. Shows if a subset of reads has poor quality.
Per Base Sequence Content
The proportion of each base position for which each of the four normal DNA bases has been called.
To enable multiple samples to be shown in a single plot, the base composition data is shown as a heatmap. The colours represent the balance between the four bases: an even distribution should give an even muddy brown colour. Hover over the plot to see the percentage of the four bases under the cursor.
To see the data as a line plot, as in the original FastQC graph, click on a sample track.
Rollover for sample name
Per Sequence GC Content
The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.
Per Base N Content
The percentage of base calls at each position for which an N
was called.
Sequence Length Distribution
Sequence Duplication Levels
The relative level of duplication found for every sequence.
Overrepresented sequences by sample
The total amount of overrepresented sequences found in each library.
FastQC calculates and lists overrepresented sequences in FastQ files. It would not be possible to show this for all samples in a MultiQC report, so instead this plot shows the number of sequences categorized as overrepresented.
Sometimes, a single sequence may account for a large number of reads in a dataset. To show this, the bars are split into two: the first shows the overrepresented reads that come from the single most common sequence. The second shows the total count from all remaining overrepresented sequences.
Top overrepresented sequences
Top overrepresented sequences across all samples. The table shows 20 most overrepresented sequences across all samples, ranked by the number of samples they occur in.
Overrepresented sequence | Samples | Occurrences | % of all reads |
Adapter Content
The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.
Note that only samples with ≥ 0.1% adapter contamination are shown.
There may be several lines per sample, as one is shown for each adapter detected in the file.
Status Checks
Status for each FastQC section showing whether results seem entirely normal (green), slightly abnormal (orange) or very unusual (red).
FastP (Read preprocessing)
FastP (Read preprocessing) An ultra-fast all-in-one FASTQ preprocessor (QC, adapters, trimming, filtering, splitting...).DOI: 10.1093/bioinformatics/bty560.
Filtered Reads
Filtering statistics of sampled reads.
Insert Sizes
Insert size estimation of sampled reads.
Sequence Quality
Average sequencing quality over each base of all reads.
GC Content
Average GC content over each base of all reads.
N content
Average N content over each base of all reads.
Observed Quality Scores
This plot shows the distribution of base quality scores in each sample before and after base quality score recalibration (BQSR). Applying BQSR should broaden the distribution of base quality scores.
Reported Quality vs. Empirical Quality
Plot shows the reported quality score vs the empirical quality score.
Vcftools is a program for working with and reporting on VCF files.DOI: 10.1093/bioinformatics/btr330.
TsTv by Count
- the transition to transversion ratio as a function of alternative allele count from the output of vcftools TsTv-by-count.
TsTv by Qual
- the transition to transversion ratio as a function of SNP quality from the output of vcftools TsTv-by-qual.
Software Versions
Software Versions lists versions of software tools extracted from file contents.
Group | Software | Version |
BCFTOOLS_STATS | bcftools | 1.18 |
BWAMEM1_MEM | bwa | 0.7.17.post1188 |
samtools | 1.19.2 | |
CRAM_TO_BAM | samtools | 1.19.2 |
CRAM_TO_BAM_RECAL | samtools | 1.19.2 |
DEEPVARIANT | deepvariant | 1.5.0 |
FASTP | fastp | 0.23.4 |
FASTQC | fastqc | 0.12.1 |
GATK4 MarkDuplicates | gatk4 | |
samtools | 1.19.2 | |
GATK4_APPLYBQSR | gatk4 | |
INDEX_CRAM | samtools | 1.19.2 |
INDEX_MERGE_BAM | samtools | 1.19.2 |
MERGE_BAM | samtools | 1.19.2 |
MERGE_CRAM | samtools | 1.19.2 |
Mosdepth | mosdepth | 0.3.6 |
SAMTOOLS_STATS | samtools | 1.19.2 |
SNPEFF_SNPEFF | snpeff | 5.1d |
TABIX_BGZIPTABIX | tabix | 1.19.1 |
VCFTOOLS_TSTV_COUNT | vcftools | 0.1.16 |
Workflow | Nextflow | 23.10.1 |
nf-core/sarek | 3.4.1 |
nextflow run /home/hpatel/sarek/sarek_custom/sarek/main.nf --input /home/hpatel/in4276.csv --outdir 's3://zymo-filesystem/home/hpatel/in4276_logo_sarek_run/in4276/' -w /mnt/workdir/hpatel/ -profile slurm,apptainer --partition devel --genome null --igenomes_ignore --fasta 's3://zymo-filesystem/home/hpatel/reference_genomes/GIAB/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta' --trim_fastq true --save_trimmed true --save_mapped true --save_output_as_bam true --dbsnp 's3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz' --dbsnp_tbi 's3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi' --known_indels 's3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz' --known_indels_tbi 's3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi' --snpeff_cache 's3://zymo-filesystem/home/hpatel/sarek/Annotation/snpeff_cache/' --snpeff_db 105 --snpeff_genome GRCh38 --tools deepvariant,snpeff --multiqc_config /home/hpatel/sarek/multiqc_custom_config.yml --custom_config_base /home/hpatel/sarek/ -resume
nf-core/sarek Workflow Summary
Core Nextflow options
- runName
- small_lavoisier
- containerEngine
- apptainer
- launchDir
- /mnt/home/hpatel/in4276
- workDir
- /mnt/workdir/hpatel
- projectDir
- /home/hpatel/sarek/sarek_custom/sarek
- userName
- hpatel
- profile
- slurm,apptainer
- configFiles
- N/A
Input/output options
- input
- /home/hpatel/in4276.csv
- outdir
- s3://zymo-filesystem/home/hpatel/in4276_logo_sarek_run/in4276/
Main options
- tools
- deepvariant,snpeff
FASTQ Preprocessing
- trim_fastq
- true
- save_trimmed
- true
- save_mapped
- true
- save_output_as_bam
- true
Reference genome options
- genome
- null
- dbsnp
- s3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz
- dbsnp_tbi
- s3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi
- fasta
- s3://zymo-filesystem/home/hpatel/reference_genomes/GIAB/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta
- known_indels
- s3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- known_indels_tbi
- s3://zymo-filesystem/home/hpatel/reference_genomes/GRCh38/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
- snpeff_db
- 105
- snpeff_genome
- GRCh38
- igenomes_ignore
- true
- snpeff_cache
- s3://zymo-filesystem/home/hpatel/sarek/Annotation/snpeff_cache/
Institutional config options
- custom_config_base
- /home/hpatel/sarek/
Generic options
- multiqc_config
- /home/hpatel/sarek/multiqc_custom_config.yml
- validationLenientMode
- true