The job of a geneticist or bioinformatician is not an easy one, the challenges lie in identifying clinically relevant variants within large amounts of generated data. Fortunately, technological advancements and the availability of whole exome and whole genome sequencing software have made this possible on longer and longer reads.
Let us start with what is whole exome sequencing (WES) and whole genome sequencing (WGS), both known best by their acronyms. Whole exome is focused on catching and sequencing the protein-coding regions of an organism’s genome, which can have an impact on numerous inheritable traits within the organism. Through target capture techniques, the exome is isolated and enriched for whole exome sequencing. Following this process the enriched exome is then sequenced using high-throughput sequencing platforms. The resulting data is analyzed to identify genetic variants within the exome, which are relevant for understanding disease-causing mutations or genetic associations.
Whole genome sequencing involves sequencing the entire genome, including both the coding and non-coding regions. By analyzing data, we can uncover genetic variants that can help us understand disease-causing mutations and genetic associations. Whole genome sequencing provides a comprehensive view of an organism’s genetic makeup, allowing researchers to explore both known and novel variants across the genome.
While exome sequencing is more cost-effective and efficient for targeted analysis of protein-coding regions, the choice between the two methods depends on your research objectives and the level of genomic coverage that you require. Both methods offer valuable insights into the complex world of genetics.
Essential Tools and Technology for Whole Exome and Whole Genome Sequencing
Let’s look at popular sequencing technology solutions that are used in research and clinical settings. These include:
- Illumina: For widely used high-throughput sequencing NovaSeq and HiSeq are used. Using a sequencing-by-synthesis approach, they can generate large amounts of sequencing data with great accuracy.
- Pacific Biosciences: Commonly known as PacBio sequencing is based on single-molecule real-time (SMRT) technology. It can deliver highly accurate long-reads that drive life sciences. This technology can be advantageous to some applications for detecting structural variants.
- Oxford Nanopore: Based on nanopore technology, where DNA molecules pass through nanopores, generating electrical signals that are used to sequence the DNA. This technology enables long-read sequencing and real-time data analysis and measures millions of such changes to determine the sequence of each fragment. The sequencing is relatively fast and cost-effective.
- Ion Torrent: Ion Torrent sequencing uses semiconductor-based technology to detect changes in pH during DNA synthesis, generating thousands of megabases of sequence information per day. This technology is fast and cost-effective .
Your choice of sequencing technology depends largely on the specific scientific goals, budget, and sequencing needs of the research project, and each platform has its own read length, error rate, throughput, and cost.
Strategies for filtering and prioritizing whole genome sequencing data are essential to identify the relevant genetic variants that may contribute to disease or have functional significance. When analyzing large genomic data sets, this can be challenging due to the complexity of the data and the need to distinguish between common polymorphisms and rare genetic variants.
Let us explore the challenges associated with variant analysis and highlight some strategies to prioritize genetic variants:
An important challenge in variant analysis is distinguishing between pathogenic and benign variations, which as we know can either cause a huge sigh of relief or increased worry for a patient within seconds. It’s important to keep in mind that the data caught will consist of numerous variants detected, including common polymorphisms that have no clinical significance. To overcome this challenge, we must employ a filtering strategy. These strategies involve filtering variants based on a criterion, such as allele frequency, functional impact, conservation scores, and population databases. Looking at an example, variants with a high allele frequency in the general population are more likely to be benign, while variants with a low frequency or absence in healthy individuals may be considered more relevant.
Important definitions:
Allele frequency: Allele frequency (also called gene frequency) is the term used to describe the fraction of gene copies that are of a particular allele in a defined population.
Functional impact: The functional impact (deleterious, pathogenic, disease-causing, and predisposing) of variants is that a genetic alteration may increase an individual’s susceptibility or predisposition to a certain disease or disorder.
Conservation scores: Conservation scores are calculated per base, indicating how many species in each alignment match at each locus.
Population databases: Population information applies data science to social genome data to answer important questions about human society and population health, much like bioinformatics applies data science to human genome data to answer questions about individual health.
Genetic database – A genetic database is one or more sets of genetic data (genes, gene products, variants, phenotypes) stored together with software to enable users to retrieve genetic data, add genetic data and extract information from the data.
Another crucial aspect of variant prioritization to look at is functional annotations. When annotating, it involves assessing a variant’s potential impact on protein function or gene regulation. But there are multiple computational tools and databases that are available to analyze protein-coding regions, non-coding regions, and regulatory elements to predict and compare the functional consequences of genetic variants. Transcript factor binding sites, protein splicing alterations, and potential protein changes can all be assessed with these tools. These computational tools play a vital role in guiding researchers to make informed decisions.
Where do you research for information? Take a look at scientific journals, conferences are often helpful for exploring the most recent advancements in technology and online databases are constantly updating their genomic research and variant analysis.
Best Practices for Filtering and Prioritizing Genetic Variants in Whole Genome Sequencing
Let’s dive deep into the challenges of analyzing large genomic datasets and identifying relevant rare or novel genetic variants. Let us also look if there is a difference in the bioinformatics tools and filtering strategies we use to prioritize variants based on their functional impact and disease relevance. Ready to jump in? So are we.
Challenging issues in analyzing large genomic datasets
Geneticists know that the severe size and volume of data can be challenging when trying to identify or differentiate between pathogenic variants and benign polymorphisms when identifying variants relevant to the studied phenotype or disease. The complexity of the human genome can often leave a well-meaning researcher at a loss. Overcoming these challenges requires careful consideration and the application of advanced filtering techniques.
Strategies Anyone? Variant Filtering for Whole Genome Sequencing
Bioinformatic Tools for Variant Prioritization: Identifying genetic variants can be very tricky and time consuming without an automatic tool for variant prioritization. Algorithms and tools for identifying genetic variants have been developed in order to help arrange and format variant prioritization. A commonly used tool for identifying genetic variants is Geneyx Analysis:
- Smart filtering: Geneyx Analysis offers a powerful feature called Smart filtering, which allows users to define frequency and CADD score thresholds to filter out irrelevant variants prior to VCF annotation. This capability enables the prioritization of rare or low-frequency variants and variants with higher predicted pathogenicity scores, streamlining the analysis process. By leveraging Smart filtering, researchers and clinicians can efficiently narrow down the variants of interest, reducing workload and enhancing downstream interpretation accuracy. This feature aids in focusing investigation efforts, saving valuable time and resources. Additionally, Smart filtering in Geneyx Analysis can incorporate read depth threshold and SpliceAI values for further customization and refinement of variant filtering.
Geneyx Analysis – Smart filtering in the version 5.10 release
Filtering strategies are employed to identify and prioritize those variants that have functional impact or disease relevance.
The strategies involve integrating various data sources that can be found in human genomic data repositories at the National Center for Biotechnology Information (NCBI).
Also, researchers and geneticists should consider allele frequency, conservation scores, predicted protein changes, and known disease associations.
The analysis of large genome datasets and the prioritization of genetic variants in whole genome sequences are complex tasks. However, researchers can identify variants with functional implications and disease relevance using bioinformatics technology, tools and filtering strategies. The technologies we discussed include Illumina, Oxford Nanopore Technologies, PacBio and Ion Torrent. These technologies integrate into AI based Geneyx Analysis software to make variant analysis a seamless solution. It is essential that we understand the challenges involved in genetic variation, as well as the appropriate tools and strategies, so that we can improve our understanding of how genetic variations impacts human health and disease.
All updates to Geneyx Analysis can be found in our video library here