Научная статья на тему 'INTEGRATED ANALYSIS OF SINGLE NUCLEOTIDE POLYMORPHISMS (SNP) SITES AND MUTATIONS IN THE CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) GENE'

INTEGRATED ANALYSIS OF SINGLE NUCLEOTIDE POLYMORPHISMS (SNP) SITES AND MUTATIONS IN THE CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) GENE Текст научной статьи по специальности «Биологические науки»

CC BY
112
12
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CATALOGUE OF SOMATIC MUTATIONS IN CANCER (COSMIC) / CFTR MODULATOR THERAPY / CLINVAR / CYSTIC FIBROSIS (CF) / CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) / EXOME / GENE THERAPY / GENOME / NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION (NCBI) / SINGLE NUCLEOTIDE POLYMORPHISM (SNP) / SINGLE NUCLEOTIDE POLYMORPHISM DATABASE (DBSNP) / SNP FUNCTIONAL CLASSES / SORTING INTOLERANT FROM TOLERANT (SIFT SCORE AND PREDICTION) / SINGLE NUCLEOTIDE VARIANT (SNV

Аннотация научной статьи по биологическим наукам, автор научной работы — Ke Ophelia

CF (Cystic Fibrosis) is a genetic health condition that affects a person’s lungs and digestive system, which affects more than 70.000 people worldwide. It is characterized by a faulty protein (CFTR) that affects the body’s cells, tissues, and glands which produce mucus and sweat. The research was done to determine and develop an integrated analysis of the SNP (Single Nucleotide Polymorphisms) sites, or a type of genetic variation representing a difference in a nucleotide, found in the CFTR gene. Objectives: The purpose of this study is to identify SNP sites and mutations in the CFTR (Cystic Fibrosis Transmembrane Conductance Regulator) gene that might cause Cystic Fibrosis and various strains of cancer. It revealed potential SNPs, which can help medical professionals set early diagnosis and risk evaluation for CF patients. Other information that was focused on also includes the structure of the CFTR gene, the distribution of its exome variant functions, the relationship with types of cancer and so on. If these results are studied and analyzed to a further extent, they may reveal a new method for the diagnosis, treatment, and hopefully, a cure for CF. Method: Using the genetic data collected by various institutions such as the National Center for Biotechnology Information (NCBI), genetic information in both afflicted and healthy individuals was downloaded. Then, by using the Macintosh operating system (Terminal), SNP sites were extracted into a VCF file. With this information, integrated and statistical analysis was utilized to pinpoint how they affect the phenotypic variabilities of CF through the use of the online tool, wANNOVAR. This information was then projected onto visual means through the use of software systems such as R and RStudio and Microsoft Excel. Online means like the Genome Browser were also utilized. Results: The integrated analysis identified key information on this genetic disorder such as the distribution of SNP functional classes, the frequency of CF occurrences, the associated types of cancers, the structure of the CFTR gene as well as the gene product, SIFT Score as well as a plethora of other information across the downloaded and extracted exome dataset. Conclusion: Through an accurate and thorough analysis on single nucleotide polymorphisms (SNPs) and mutations in the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene, the effects, influence, and types of SNP sites and somatic mutations were identified. This can further our understanding of health conditions associated with the gene and its product (CFTR protein) and enable medical communities to take the necessary action to overcome these disorders.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «INTEGRATED ANALYSIS OF SINGLE NUCLEOTIDE POLYMORPHISMS (SNP) SITES AND MUTATIONS IN THE CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) GENE»

https://doi.org/10.29013/ELBLS-20-4-15-28

Ke Ophelia, Cate School Class of2022, Advised by Dr. Pingzhang Wang of Peking University

and Ivy Mind Analytics E-mail: ophelia_ke@cate.org

INTEGRATED ANALYSIS OF SINGLE NUCLEOTIDE POLYMORPHISMS (SNP) SITES AND MUTATIONS IN THE CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) GENE

Abstract. CF (Cystic Fibrosis) is a genetic health condition that affects a person's lungs and digestive system, which affects more than 70.000 people worldwide. It is characterized by a faulty protein (CFTR) that affects the body's cells, tissues, and glands which produce mucus and sweat. The research was done to determine and develop an integrated analysis of the SNP (Single Nucleotide Polymorphisms) sites, or a type of genetic variation representing a difference in a nucleotide, found in the CFTR gene.

Objectives: The purpose of this study is to identify SNP sites and mutations in the CFTR (Cystic Fibrosis Transmembrane Conductance Regulator) gene that might cause Cystic Fibrosis and various strains of cancer. It revealed potential SNPs, which can help medical professionals set early diagnosis and risk evaluation for CF patients. Other information that was focused on also includes the structure of the CFTR gene, the distribution of its exome variant functions, the relationship with types of cancer and so on. If these results are studied and analyzed to a further extent, they may reveal a new method for the diagnosis, treatment, and hopefully, a cure for CF.

Method: Using the genetic data collected by various institutions such as the National Center for Biotechnology Information (NCBI), genetic information in both afflicted and healthy individuals was downloaded. Then, by using the Macintosh operating system (Terminal), SNP sites were extracted into a VCF file. With this information, integrated and statistical analysis was utilized to pinpoint how they affect the phenotypic variabilities of CF through the use of the online tool, wANNOVAR. This information was then projected onto visual means through the use of software systems such as R and RStudio and Microsoft Excel. Online means like the Genome Browser were also utilized.

Results: The integrated analysis identified key information on this genetic disorder such as the distribution of SNP functional classes, the frequency of CF occurrences, the associated types of cancers, the structure of the CFTR gene as well as the gene product, SIFT Score as well as a plethora of other information across the downloaded and extracted exome dataset.

Conclusion: Through an accurate and thorough analysis on single nucleotide polymorphisms (SNPs) and mutations in the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene, the effects, influence, and types of SNP sites and somatic mutations were identified. This can further

our understanding of health conditions associated with the gene and its product (CFTR protein) and enable medical communities to take the necessary action to overcome these disorders.

Keywords: Catalogue of Somatic Mutations in Cancer (COSMIC), CFTR Modulator Therapy, ClinVar, Cystic Fibrosis (CF), Cystic Fibrosis Transmembrane Conductance Regulator (CFTR), Exome, Gene Therapy, Genome, National Center for Biotechnology Information (NCBI), Single Nucleotide Polymorphism (SNP), Single Nucleotide Polymorphism Database (dbSNP), SNP Functional Classes, Sorting Intolerant from Tolerant (SIFT Score and Prediction), Single Nucleotide Variant (SNV)

I. Introduction

Cystic Fibrosis (CF) is a rare and inherited genetic disorder whose symptoms often include chronic cough, lung infections, and shortness of breath. It is caused by defects in the cystic fibrosis transmembrane conductance regulator (CFTR) gene. Situated on the seventh chromosome, the gene consists of twenty-seven exons of DNA and codes for 1,480 amino acids. Its product is the CFTR protein, which regulates the chloride ion content of epithelial cells that line the nasal cavity, lungs, and stomach by acting as a channel across the membrane [1; 2; 3].

The CFTR protein acts as a channel carrying chloride ions into and out ofhuman body cells, which aids in the movement of water in tissues. This is crucial for the production of mucus, a substance that lubricates and protects the lining of the respiratory, digestive, and reproductive systems. The CFTR protein also controls the functions of other channels, such as the ones which move sodium ions across cell membranes. These are necessary for organs like the lungs and pancreas to work. When these chloride ions cannot leave the cell, water is kept through osmosis, which causes the production of more viscous fluids [4].

The symptoms of CF depend on which organs are affected and the severity of the condition. The most serious and common complications in regard to cystic fibrosis are those of pulmonary or respiratory problems, which may include serious lung infections. Patients diagnosed with CF often also have problems maintaining good nutrition, as they find it difficult to absorb the nutrients from food, delaying growth. [5] Until late into the twentieth century, few people

diagnosed with CF lived beyond childhood. While improvements in medical care have succeeded in increasing life expectancy, there still exists no cure for it.

There exist more than 1.200 discovered faults on the CFTR gene. Of this number, the most frequent mutation remains the result of the deletion of a single amino acid at position 508 on the CFTR protein. It is also referred to as AF508 and accounts for approximately seventy percent of CF cases. Other mutations to the CFTR gene cause changes to the protein's structure, stability, or production, ultimately inhibiting the successful regulation of chloride ions in epithelial cells [6].

Single nucleotide polymorphisms, more commonly known as SNPs, are the most common type of genetic variation among people, each representing a difference in a nucleotide, or a single DNA building block. These variations are most frequently found in the DNA between different genes. Nowadays, the scientific community uses SNPs as biological markers because they help pinpoint which genes are associated with the disease. However, when SNPs occur within a gene or a regulatory region near the gene, they can actually be the cause of the affliction by affecting the function of that gene [7].

II. Procedure

A. Materials

This study was mainly based on bioinformatics analysis and involves the usage and integration of publicly available datasets, tools, software, as well as other online resources. The various tools used to complete this study include Macintosh operating system (MacOS); Microsoft Excel; dbSNP database,

an online database for single-nucleotide variations; wANNOVAR, which was used to annotate functional consequences of genetic variation from high-throughput sequencing data. Other online tools that were used were UCSC Genome Browser; National Cancer Institute GDC Data Portal; and cBioPortal.

B. General Overview

From the online genetic database, dbSNP, data on Cystic Fibrosis patients and healthy controls from around the world were downloaded in the form of a GZ file. After decompressing the file with the Macintosh Operating System (MacOS), all of the SNP sites on the CFTR gene were extracted. Then, individual analysis was performed on this information with the web-based tool wANNOVAR, resulting in an analysis of the genome and exome of the CFTR gene. With

this information gathered, a more integrated study

was performed, furthered by the extensive use of visual representations as well as detailed explanations.

C. Online Gene Database (Single Nucleotide Polymorphism Database)

The Single Nucleotide Polymorphism Database (dbSNP) is a free public web-based archive that records genetic variation within and across different species. It was developed by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). It contains human single nucleotide variations (SNV), microsatellites, small- scale insertions and deletions, along with information on publication, population frequency, molecular consequences, as well as genomic and RefSeq mapping information for both common variations and clinical mutations.

D. Extraction of SNP Sites on MacOS

After decompressing the file with the Macintosh Operating System (MacOS), all of the SNP sites on the CFTR gene were extracted using the command: grep CFTR00-All.vcf >CFTR.result.vcf.

E. Annotation of Genetic Variation on wANNOVAR

By default, it performs "individual analysis" on the VCF file to help find the genes which cause the

disease as well as various other online sources. The resulting files are split into information on the exome and genome. It can be read by Microsoft Excel and includes a plethora of data on the CFTR gene. [8][9]

F. Using Terminal to Reveal SNP Exonic Functions

By using the Terminal function in the Macintosh Operating System (MacOS), the unique parts were classified and sorted from the 2,481 individual data into 11 exonic variants: frameshift deletion, frameshift insertion, frameshift substitution, non-frameshift deletion, non-frameshift insertion, non-frameshift substitution, nonsynonymous SNV, start-loss, stopgain, stop-loss, and synonymous SNV. After that, the distribution was represented visually as a pie chart.

G. Using Terminal to Reveal SNP Genomic Functions

Due to significant time restraints, most data analysis was restricted to information on the exome, but there is still Similar to the procedure above, genomic functions were sorted from the 33,355 individual data into 9 genomic variants: UTR3 (3 Prime Untranslated Region), UTR5 (5 Prime Untranslated Region), downstream, exonic, exonic splicing, intergenic, intronic, splicing, and upstream. The data was then represented visually as a pie chart.

H. UCSC Genome Browser: Human GRCh37/ hg19

By entering a specific range of the CFTR gene onto the UCSC Genome Browser: Human GRCh37/hg19, a detailed picture of the gene's structure was created and downloaded.

I. Frequency of CF Occurrence (ClinVar_DIS)

Through the integrated analysis wANNOVAR

performed, information from ClinVar, a database which aggregates information about genomic variation and its relationship to human health, was accessed. By bridging this crucial gap, the frequency of CF occurrence in relations to SNP sites on the CFTR gene can be concluded with a chart.

J. SIFT Score Prediction SIFT, short for Sorting Intolerant From Tolerant, is a system that can give accurate speculation as to whether or not an amino acid substitution will affect protein function. This way, users, particularly those in the medical community, can prioritize substitutions for further study. The SIFT score ranges from 0.0 (deleterious/harmful) to 1.0 (benign/tolerated). By looking into this, the connection between the portion of the gene and its effect was drawn. K. COSMICID, COSMICDIS: COSMIC, or the Catalogue of Somatic Mutations in Cancer, is the world's forefront online database for the investigation on the impact of somatic mutations in human cancer. After directing and pinpointing the database on the CFTR gene, figures and statistics were downloaded for the benefit of this project.

Table 1. - A portion of the resulting i

L. NCI GD C Data Portal and cBioPortal

To venture further into the topic of mutations of the CFTR gene and its connection to human cancer, the NCI GDC Data Portal, a data-driven platform and cBioPortal for Cancer Genomics, a software was accessed. Both provided alteration frequency as well as frequent somatic mutations.

M. UniProt Knowledgebase

The UniProtKB (abbrev. for Knowledgebase) is the world's central resource on the functional information on proteins, the information of which is derived from the current research literature. With this, information on the CFTR gene was gathered and clarified for the purpose of providing a more thorough analysis.

III. Results

N. Distribution of Exome Variant Functions V file (displayed on Microsoft Excel)

Start End Ref Alt ExonicFunc.refGene

117120149 117120149 A G startloss

117120150 117120150 T A startloss

117120150 117120150 T C startloss

117120151 117120151 G A startloss

117120151 117120151 G T startloss

117120152 117120152 C T stopgain

117120158 117120158 T G nonsynonymous SNV

117120159 117120159 C A stopgain

117120159 117120159 C T nonsynonymous SNV

117120160 117120160 G T synonymous SNV

That shows the location of the SNP site along with the exome variant associated with it. The highlighted column (ExonicFunc.refGene) includes the following: frameshift deletion, frameshift insertion, frameshift substitution, non-frameshift deletion, non-frameshift insertion, non-frameshift substitution, start- loss, stop-gain, stop-loss, nonsynonymous SNV, and synonymous SNV

As referenced above in (Table 1), wANNOVAR sorted the types of SNPs on the CFTR gene into eleven types. SNP sites can fall within the coding sequences, non-coding regions of the gene, or in the intergenic zones. Sites that are located within the coding regions are either synonymous and nonsynonymous SNPs. Synonymous mutations are fairly common, but since they do not affect the amino acid

sequence of a protein, they are not noticed (non-frameshift insertion, deletion, or substitution)

On the other hand, in a nonsynonymous mutation, this is not the case. There is commonly an insertion or deletion of one nucleotide in the coding sequence during the process of transcription. The single missing or added nucleotide causes a frame-shift mutation (frameshift insertion, deletion, or

substitution) which proceeds to throw off the entire reading frame of the amino acid sequence, ultimately mixing up the codons. These comprise of nonsense and missense mutations.

A missense mutation is a change in one base pair on the DNA that will ultimately result in the substitution of one amino acid for another in the protein product. By doing so, it alters a codon and creates a completely different protein. A nonsense mutation

is also a change in one DNA base pair and includes stop-gain, start-loss, and start-gain. Stop-gain refers to a mutation that results in a premature termination codon, signaling the end of translation, while stop-loss is a mutation in the original termination codon, resulting in an abnormal extension of the protein's carboxyl terminus. Start-gain is defined as a point mutation in the transcript's AUG codon, which also serves as an initiation site for the gene product.

Figure 1. A pie chart depicting the distribution of different SNP exome variant functions on the CFTR gene in humans around the world. They include frameshift deletion, frameshift, insertion, frameshift substitution, nonframeshift deletion, nonframeshift insertion, nonframeshift substitution, nonsynonymous SNV, start-loss, stop-gain, stoploss, and synonymous SNV

Finally, a single-nucleotide variant (nonsynonymous or synonymous SNV) is referred to as a variant in a single nucleotide with no limitations on frequency. They are not the same as single- nucleotide polymorphisms due to the fact that when an SNV is detected in a single sample, it can potentially be an SNP. However, this cannot be ascertained given that this variation is only from one organism.

While all the types of SNPs on the CFTR bring about genetic variation in a human population, their effects and frequency certainly differ from each other. By using the Macintosh Operating System (MacOS) to determine the count of each unique type, this distribution can be visually repre-

sented in the form of a pie chart, enabling a clearer understanding of these SNP sites.

As shown in (Figure 1), nonsynonymous SNV remains the most common type of SNP on the CFTR gene with an overwhelming majority, while stop-loss exists as the rarest class with only one identified case. Nonetheless, understanding the distribution of the exome variants of SNPs can allow scientists to pinpoint the types of CF that are more common as well as the causes behind them.

O. Distribution of Genome Variant Functions This research project focused on the exome summary produced by wANNOVAR instead of covering every single detail of the CFTR gene. However,

as shown in (Figure 2), the distribution of genomic summary. As shown above, the most common form of variant functions was included to give a general over- genomic variant on the CFTR gene is intronic, where view of the information that was given in the genome the variant overlaps with an intron.

Figure 2. A pie chart depicting the distribution of different SNP genomic variant functions on the CFTR gene in humans around the world. They include intronic, exonic, intergenic, UTR3, UTR5, downstream, upstream, exonic splicing, and splicing

It is closely followed by exonic, where the variant and intergenic, where the variant is within the in-

overlaps a coding region; UTR3, where it overlaps a tergenic region. The rest of the genomic variants are

3' untranslated region; upstream, where it overlaps less common in the world. 1-kb region upstream of the transcription start site; P. Structure of the CFTR Gene and Protein

Figure 3. The structure and location of the CFTR gene on Chromosome 7 is shown through the use of the UCSC Genome Browser (Human Feb. 2009; GRC37/hg19 Assembly)

The CFTR (Cystic Fibrosis Transmembrane Conductance Regulator) gene is located in the seventh human chromosome. It provides instructions for making a protein called the cystic fibrosis transmembrane conductance regulator, which consists of 1.480 amino acids. When altered, health conditions such as Congenital Bilateral Absence of the Vas Deferens (CBAVD), Cystic Fibrosis (CF), Hereditary Pancreatitis, and others arise.

The cytogenetic location of this gene, shown in (Figure 3 is q31.2, referring to the long (q) arm of the chromosome at the position 31.2. Its molecular location is also depicted in the image, ranging from the base pairs 117,120,017 to 117,308,718 (188,702 base pairs) [10].

The CFTR gene codes for an ATP binding cassette (or ABC) transporter-class ion channel protein. It conducts the transportation of chloride ions across epithelial cell membranes. This protein also comprises two six-span units, each of which is membrane-bound and attached to a nuclear binding factor for

adenosine triphosphate (ATP). Between these two regions, there is an R- domain, consisting of several charged amino acids. This is an entirely unique feature of the CFTR protein within the ABC superfam-

ily [11; 12].

Figure 4. A picture depiecting the shape and structure of the CFTR protein

Q. Probability of SNP Sites in Relations to World Demographics:

Table 2. - The chart that shows the probability of SNP sites in the CFTR gene in relations to the demographics of the world

The source of the region highlighted yellow is 1000 Genomes; the region highlighted pink is ExAC Browser; the region highlighted green is the Exome Variant Server (ALL: All; AFR: African; AMR: American; EAS: Eastern; EUR: European; SAS: South Asian; FIN: Finnish; NFE: Non-Finnish European; OTH: Other)

Table 2 depicts the result of wANNOVAR's filter-based annotation. It gathers information on the cataloging of genetic variation among different ethnicities, races, and nationalities from around the

world to display them all in the above fashion. Its

purpose is to establish the frequency of variants in whole-genome data. The 1000 Genomes (Represented by 1000G) dataset provides allele frequencies in six populations that are whole-genome variants. The Exome Aggregation Consortium (ExAC) is represented by a group of investigators who collect and systematize exome-sequencing data from a variety of large-scale projects. The Exome Sequencing Project (ESP) is an exome-sequencing project that is funded by the National Heart, Lung, and Blood Institute (NHLBI). It identifies genetic variants in exonic regions from over 6.000 individuals, including healthy controls as well as those with different health conditions.

Right now, there are more than 10 million known Americans who are carriers of one mutation of the

CFTR gene, which amounts to a total of 30.000

CF patients. The chances of being a carrier of one CFTR mutation or being afflicted with CF, which is caused by two CFTR disease-causing mutations, depends on race and ethnicity. And although it is not shown in Figure 8, the most affected group includes Caucasians of northern European ancestry (British, Scandinavians, French, certain Eastern Europeans). On the other hand, the disease is considerably more infrequent in other ethnicities, affecting about 1 in 1 in 100.000 Asian-Americans and 17.000 African-Americans.

According to a study by the John Hopkins University, the risk of particular ethnicities carrying the faulty CFTR gene is 1 in 29 for Caucasians; 1 in 46 for Hispanics; 1 in 65 for African Americans; and 1 in 90 for Asians. Thus, it is clear that given information and resources on the genetic variation in different ethnic populations, Caucasians remain the most affected group.

R. Health Conditions Caused by CFTR Defects (CllnVarDIS):

As shown in the pie chart below, more than half of the faults in the CFTR gene have no effect or are not specified. However, the health condition most clearly associated with this is Cystic Fibrosis. Patients with CF experience issues with their respiratory, digestive, and reproductive systems. Although it is not shown in Figure 5, most men with CF also have a congenital bilateral absence of the vas defer-ens (CBAVD), a condition in which the Vas Defer-ens or the tubes that carry sperm are clogged with mucus, effectively sterilizing most patients. Other health conditions caused by the faulty CFTR gene include hereditary and idiopathic pancreatitis, as well as sweat chloride elevation without CF.

Figure 5. Frequensy of CF Occurrence (Clin Var_ DIS). A pie chart that depicts the frequency of CF occurrence in SNP sites on the CFTR gene, along with a plethora of other health conditions, which include Congenital Bilateral Absence of the Vas Deferens (CBAVD), Pancreatitis (Both hereditary and idiopathic), and Sweat Chloride Elevation without Cystic Fibrosis

). )IFT Score Prediction

Single nucleotide polymorphism (SNP) studies and random mutagenesis projects that determine amino acid substitutions in protein-coding regions, as each substitution can potentially affect the function of the protein. Other than that, the SIFT score was also scrutinized during the research process. It predicts whether an amino acid substitution will af-

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

fect protein function, making it an invaluable source deleterious, where those with scores closer to 0.0

for the classification ofbenign and deleterious effects. are more confidently predicted to be deleterious. In A SIFT score predicts whether an amino acid the range from 0.05 to 1.0, variants are predicted to

substitution affects protein function. The SIFT score be tolerated (benign), where those with scores very

ranges from 0.0 (deleterious) to 1.0 (tolerated): In close to 1.0 are more confidently predicted to be tol-

the range from 0.0 to 0.05, variants are considered erated.

Table 3. - A chart showing the SIFT score associated with the given health condition, revealing whether or not it is pathogenic

ClinVar_SIG ClinVar_DIS SIFT SIFT converted SIFT

score rankscore pred

Pathogenic Cystic fibrosis 0 0.912 D

not provided\ x2cLikcly patho- Cystic fibrosis\x2cCystic fi-brosis 0 0.912 D

genic

not provided\ x2cLikely patho- Cystic fibrosis\x2cCystic fi-brosis 0 0.912 D

genic

not provided\x2c-not provided Cystic fibrosis\x2cCystic fibrosis 0 0.912 D

not provided\x2c-not provided Cystic fibrosis\x2cCyslic fibrosis 0 0.912 D

not provided Cystic fibrosis

0.017 0.512 D

Pathogenic Cystic fibrosis

Pathogenic Cystic fibrosis 0.004 0.654 D

other Cystic fibrosis 0 0.912 D

0.019 0.501 D

Pathogcnic | Pathogenic Cystic fibrosis | Hereditary pancreatitis 0.005 0.632 D

Figure 6. A pie chart showing the distribution of different SIFT score ranges associated with the CFTR gene

Referring to (Table 3), SNP sites whose SIFT scores are closer to 0.0 are more likely to be considered "pathogenic" and are associated with Cystic Fibrosis, CBAVD, and pancreatitis, while those with SIFT scores closer to 1.0 are considered "not provided." According to Figure 6, the majority of individuals in the dataset have a SIFT Score of

between 0 to 0.05, meaning that it is considered pathogenic or deleterious. By summarizing information such as that, every fault and blemish in the CFTR gene can be accounted for accordingly, which will allow for reliable and accurate predictions.

T. CFTR Gene and Human Cancer

Figure 7. Types of Cancers Associated with SNPs in tre CTFR Gene. A pie chart showing the

types of cancers associated with SNP sites in the CFTR gene, including those that cause multiple human cancers (two or more). "No Symptoms" was not included in the chart above

Figure 8. CNV Distribution (Left)

Although defective CFTR is commonly associated with CF, a common genetic disorder in the Caucasian population, there is accumulating evidence that suggests the role of CFTR faults in various cancers, particularly gastroenterological cancers such as pan-

creatic cancer and colon cancer (of the large intestine). Figure 7 (Above) does not include "No Symptoms" due to the overwhelming majority. But common human cancers associated with the CFTR gene include skin cancer, while the rest are similarly frequent.

Figures 9. Cancer Figures 8 (Left) and 9 (Right): Two bar graphs created with data from the National Cancer Institute GDC Data Portal. (Left) CNV (Copy Number Variation) Distribution (OV: Ovarian; UCS: Uterine; ESCA: Esophagea; STAD: Stomach Adenocarcinoma; LUSC: Lung Squamous Cell Carcinoma; HNSC: Head-Neck Squamous Cell Carcinoma; SKCM: Skin Cutaneous Melanoma; LUAD: Lung Adenocarcinoma; ACC: Adrenocortical Carcinoma; SARC: Sarcoma; CHOL: Cholangio-carcinoma; UCEC: Uterine Corpus Endometrial Carcinoma; BRCA: Breast Invasive Carcinoma; LIHC: Liver Hepatocellular Carcinoma; TGCT: Testicular Germ Cell Tumors; BLCA: Bladder Urothelial Carcinoma; CESC: Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma; DLBC: Lymphoid Neoplasm Diffuse Large B-cell Lymphoma; PAAD: Pancreatic Adenocarcinoma; LGG: Low Grade Glioma) (Right) Cancer Distribution (SKCM: Skin Cutaneous Melanoma; UCEC: Uterine Corpus Endometrial Carcinoma; COAD: Colon Adenocarcinoma; LUSC: Lung Squamous Cell Carcinoma; STAD: Stomach Ad-

Distribution (Right)

enocarcinoma; BLCA: Bladder Urothelial Carcinoma; ACC: Adrenocortical Carcinoma; HNSC: Head-Neck Squamous Cell Carcinoma; READ: Rectum Adenocarcinoma; CESC: Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma; LUAD: Lung Adenocarcinoma; ESCA: Esophagea; BRCA: Breast Invasive Carcinoma; GBM: Glioblastoma; KIRC: Kidney Renal Clear Cell Carcinoma; UCS: Uterine; SARC: Sarcoma; LIHC: Liver Hepatocellular Carcinoma; PAAD: Pancreatic Adenocarcinoma).

Copy number variation (CNV), defined as large-scale gains and losses of DNA fragments, forms another one of the major classes of genetic variation. After the Human Genome Project, it became clear that the human genome goes through gains and losses of DNA material. The extent to which CNVs are attributed to certain human afflictions remains unknown. However, it is an established fact that certain cancers are associated with heightened copy numbers of specific genes. According to (Figure 7), ovarian cancer, uterine cancer, and esophageal cancer each take up 25.81%,

25%, and 20.65%, respectively of total CNV dis- reveals that Skin Cutaneous Melanoma (SKCM) tribution. In (Figure 8), the distribution of human is the prevalent form of cancer, covering 21.75% of cancer types caused by SNPs in the CFTR gene. It affected cases.

Figure 10. Transcript of the CFTR Protein provided by the National Cancer Institute GDC Data Portal showing the number of cases for each type of mutation.

When referring to (Figure 9), the points at which particular strains of cancer are caused are brought up, along with the number of cases and type of mutation it is associated with. According to the protein transcript, chr 7: g.117642527C>T appears the most. It is a synonymous mutation that involves the substitution of a specific base, making up approximately 0.62% of affected cases in CFTR (7 affected cases across the GDC). Other common cases include chr7: g.117603609C>T (0.44%), chr7: g.117530985G>A (0.35%), and chr7: g.117590426G>T (0.35%), all of which are substitution mutations. By accumulating and analyzing data on the specific type, effect, and frequency of somatic mutations, common causes can be identified for a more personalized and thorough health treatment.

Discussion

Ever since the discovery of the CFTR gene more than thirty years ago, the scientific and medical communities have been trying to find ways to alter and ultimately correct the mutations in the gene, specifically those that cause CF. Although progress was noticeably slower in the beginning, scientific breakthroughs in the past decade have accelerated advances in gene therapy, CFTR modulator therapies, along with other treatments. Through integrated analysis such as this very research project, the medical community can correctly classify specific instances, functional types, as well as detrimental effects of different

SNPs. This will potentially aid in the development of treatment and therapies for disorders, cancers, and diseases associated with the gene.

Gene therapy is the process by which the correct version of the CFTR gene is to be positioned in a person's cells. Although the faulted copies of the gene still remain, the correct copy now allows cells to produce functioning CFTR proteins.

There are three main types of gene therapy: Integrating, Non-integrating, and RNA Therapy. First, in integrating gene therapy, a portion of DNA with the correct version of the CFTR gene is delivered to a patient's cells, which then stays within their genome.

Similarly, in non-integrating gene therapy, an accurate version of the CFTR gene is delivered to a patient's cells. However, unlike integrating gene therapy, this DNA remains separate from the person's genome. This way, the cell can still utilize the new copy to make normal CFTR proteins.

Both therapies previously mentioned involve the "donation" of DNA copies with the correct CFTR gene to a patient's cell, which allows it to make its own RNA copies through transcription. But recently, another approach has been advancing to the forefront of this field. It involves directly giving the cell these RNA copies. This is known as RNA therapy.

Other than the bright future for gene therapy, other possible paths for the treatment of CFTR- related diseases include CFTR modulator therapies.

They are designed to correct the faulty protein produced by the CFTR gene. Due to the fact that different mutations bring about different faults in the resulting protein, these treatments will only work for patients affected by specific mutations.

There are four CFTR modulators for people with specific mutations, including Ivacaftor (Kalydeco®), lumacaftor/ivacaftor (Orkambi®), tezacaftor/ivacaftor (Symdeko®), elexacaftor/tezacaftor/ivacaftor (TrikaftaTM). With the rise of easily acceptable datasets, user- friendly software programs, and a new generation of aspiring researchers, more potential CFTR modulators will be in circulation to address the underlying cause of the disease in people with other CF mutations, including this very research project.

Declaration

I hereby declare that the papers submitted are the research work and research results obtained under the guidance of my instructor/supervisor Dr. Ping-zhang Wang of Ivy Mind Analytics. As far as I am aware, the paper does not contain research results that have been published or written by others, except

for the content specifically listed in the reference and the acknowledgments. If there is anything wrong, I am willing to bear all related responsibilities. Conclusion

Understanding, analyzing, and pinpointing some of the SNP sites along with their effects, type, and influence on the CFTR gene and protein can allow us to understand how it is correlated with diseases and disorders such as CF, CBAVD, pancreatitis, and cancer. By doing so, it can open up the medical community to more possibilities for prediction, prevention, diagnosis, and treatment. Such possibilities would be more specific to the needs of each individual, allowing for a rise in a combined accuracy and prediction of related health treatments.

Acknowledgment

I want to thank Dr. Pingzhang Wang for offering me his guidance and expertise while working on the S. T. Yau Biology Award. His classes in biostatistics and genetic biology has inspired the topic of my project. I would also like to thank Ms. Betty Wang for the opportunity to learn about this fascinating field through Dr. Wang's online classes.

References:

1. About Cystic Fibrosis. CF Foundation. URL: https://www.cff.org/What-is-CF/About-Cystic- Fibrosis

2. Cystic Fibrosis. URL: https://www.nhlbi.nih.gov/health-topics/cystic-fibrosis

3. Basics of the CFTR Protein. CF Foundation. URL: https://www.cff.org/Research/Research-Into- the-Disease/Restore-CFTR-Function/Basics-of-the-CFTR-Protein

4. National Institutes of Health. (2020, August 17). Cystic fibrosis - Genetics Home Reference - NIH. U. S. National Library of Medicine. URL: https://ghr.nlm.nih.gov/condition/cystic-fibrosis.c

5. The Embryo Project Encyclopedia. Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) Gene | The Embryo Project Encyclopedia. URL: https://embryo.asu.edu/pages/cystic- fibrosis-trans-membrane-conductance-regulator-cftr-gene

6. National Institutes of Health. What are single nucleotide polymorphisms (SNPs)? - Genetics Home Reference - NIH. U. S. National Library ofMedicine. URL: https://ghr.nlm.nih.gov/primer/genomicresearch/snp.

7. Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38: e164, 2010.

8. Chang X., Wang K. wANNOVAR: annotating genetic variants for personal genomes via the web. J Med Genet. 2012. Jul;49 (7): 433-6.

9. National Institutes of Health. (2020, August 17). CFTR gene - Genetics Home Reference - NIH. U. S. National Library of Medicine. URL: https://ghr.nlm.nih.gov/gene/CFTR

10. Ng P. C. & Henikoff S. (2003, July 1). SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC168916

11. U. S. Department of Health and Human Services. Cystic Fibrosis. National Heart Lung and Blood Institute. URL: https://www.nhlbi.nih.gov/health-topics/cystic-fibrosis

12. SIFT score. Ionreporter.thermofisher.com. URL: https://ionreporter.thermofisher.com/ionreporter/ help/G UID-2097F236-C8A2-4E67-862D-0FB5875979AC.html

i Надоели баннеры? Вы всегда можете отключить рекламу.