Метаболизм переходит на полногеномный ассоциативный анализ: сочетание новых данных полногеномных исследований с имеющейся информацией о метаболизме в теле человека открывает перспективу для будущих исследований

Баумл Дж.

tested in an individual, usually for single-nucleotide-polymorphisms (SNP) in the DNA [7, 8].

KORA (S4/F3/F4)

One of the most well known studies performed in Germany is the KORA study (former MONICA), done by the Helmholz Zentrum Munchen. The study is a regional, population-based study in and around the city of Augsburg, Bavaria, Germany which was set up in 1985. The data acquisition has been performed via medical examinations, such as taking blood samples, determine the BMI (body mass index) and blood pressure. Moreover, interviews and questionnaires have been collected to ask for life style habits, activities, medications, smoking and alcohol consumption. In the first pilot study, the age-range was between the age of 25 to 75 years. The response rate in the baseline S4 is 68% and in the follow up F4, the response rate is 80%, whereas in the baseline S3, the rate is 75% and in the follow up F3, the response rate is 76%6. The overall aim of the KORA study is to provide new approaches in the field of the diagnose, prevention and treatment of chronic diseases.

Metabolomics

In general, metabolomics is the study of metabolites at a global level more specific, the study of small molecule metabolite profiles in a human body. In that case, small molecules are defined as organic compounds with a low molecular weight (<800 Da) which bind with a high affinity to a bio-polymer, such as proteins. The metabolite profiles depend on several factors for instance the genetic background, the physiological status, environmental factors and lifestyle. Metabolites intermediate steps as well as end products of metabolism which in turn is a set of chemical reactions is which happens in living organisms to stay alive. The chemical reactions of metabolism are organized into metabolic pathways in which one chemical is transferred into another chemical through a series of consecutive steps.

Diabetes

Diabetes Mellitus a complex disease which can be distinguished into two (main) different subgroups, Diabetes Mellitus type I which is the insulin dependent type and Diabetes Mellitus type II which is the insulin independent type. In the following, type II Diabetes will be assayed more detailed. Type II diabetes is more family linked then the type I diabetes but also depends strongly on other factors such as different kinds of lifestyle and environmental factors. A severe risk factor of getting diabetes is obesity which often occurs within families because of the similar eating and lifestyle habits. Also genetic variants, oligo- and poly-genic genetic factors, can lead to a higher risk of getting diabetes especially if more then one factor occurs simultaneously [11].

In the following, for different papers, published within the last three years in different scientific journals, are briefly summarized. The first paper [1] “A genome-wide perspective of genetic variation in human metabolism”, written by Thomas Illig et al., published in Nature Genetics in 2010, the idea was to identify and describe genetic variants in metabolism-related genes which lead to specific and clearly distinguished metabolic phenotypes. In the second paper [4] “Metabolic footprint of diabetes: A multiplatform metabolomics study in an epidemiological setting”, written by Karsten Suhre et al., published in PLoOone, in 2010, one aim was mentioned to identify a series of metabolites which are associated with diabetes. Moreover, the scientists suggest metabolic makers to detect diabetes-related complications, under sub-clinical conditions in the general population. Consequently, the overall aim is to identify markers which could help to find person at risk of getting diabetes before developing symptoms and finding a way of providing personalized medication and treatment. For the detection and quantification, three different platforms has been used which analyzes a total set of 482 metabolite concentrations, whereas the phenotypes were not known in the corresponding laboratories. Hence, a total of 423 unique metabolites has been quantified on at least one platform. In the third paper [3] “Genetic determinants of circulating sphin-golipid concentrations in European populations”, written by Andrew A. Hicks, published in PLoS One in 2009, the function and task of sphingolipids in the human body was shown. Sphingolipids are components of plasma and endosomes which play an important roll in e.g. protein and lipid transport, as well as in cell surface protection and cellular signalling cascades. So, the aim is to find common genetic variants which influence the balance of individual sphin-golipid concentrations and the understanding of the contribution of sphingolipids to common diseases. The way of function and the differences have been analyzed within and across the five different populations. In the fourth paper [2] “Genetics meets metabolomics: A genome wide association study of metabolite profiles in human serum”, written by Christian Gieger et al., published in PLoS One in 2008, the authors wanted to show that genetic variants which are associated with changes in the homeostasis of key lipids, carbohydrates or amino acids can provide information about the biochemical context of variations especially if enzyme coding genes are involved.

METHODS

Participants

In the first paper [1], the study population derives from the KORA F4 study, were the participants are

selected from KORA S4 survey, an independent population based sample from the KORA F4 population. First 1809 participants were analyzed and replication in 422 participants of the TwinUK cohort. The participants were aged between 32-81 years. In the second paper [4], the study population has been limited to males, above 54 years. In total, 100 individuals has been analyzed, 40 self-reported diabetes type II cases and 60 healthy age-matched controls from the KORA F3 population. In the third paper [3], the study population derives from five different projects from five different countries. The first project is the family-based ERF study from The Netherlands which includes 3000 participants. All of them filled out the questionnaire on risk factors but only 800 participants were finally included to the lipidomics study. The second project is the MICROS study which analyzed people in three isolated villages in South Tyrol, Italy, in the German-speaking region near Austria and Switzerland between 2001 and 2003. After the blood analysis, filling out the standardized questionnaire and the data cleaning, 1334 participants were available for the lipidomics study. The third project derive from Sweden, were people from the north of Sweden has been selected because of the rare immigration in this part of the country. The fourth study is the Orkney Complex disease Study which is an ongoing family-based, cross-sectional study in the isolated Scottish archipelago of Orkney. The fifth study is the Vis study from Croatia were 986 unselected participants between the age of 18 to 93 years from a little village called Vis has been selected and analyzed between 2003 and 2004. Finally, after blood and gene analysis, and after a data cleaning and mathematical correction, a total of 4110 participants (South Tyrol = 1097, Sweden = 656, Orkney = 719, Croatia = 720, ERF = 918) has been analyzed in the genetic study of Hicks et al. In the fourth paper [2], a population-based sample from the KORA Se survey has been drawn and 1644 participants between the age of 25 to 74 years has been followed up in the KORA F3 study. Finally, out of the 1644 followed participants, genotyped in KORA F3 500K study population, 284 males between the age of 55 to 79 years were randomly selected for the metabolic characterization.

Statistical methods

In the first paper [1], only SNPs with a MAF (Minor Allele Frequency) of at least 10% has been included. To specify the association between the genotypes, the 163 metabolites as well as the metabolite concentration ratios, an additive genetic model has been used. The case-by-case analysis has been performed with SPSS and for the calculations with the linear regression algorithm in GWAS R has been used. Additionally, the significance level has been calculated and after a Bonferroni correction, based on

a a = 0.05, a p-value of 5.93 x 10-10 has been detected.

In the second paper [4], for the metabolom-wide analysis R has been used and the statistical case-by-case analysis has been performed with SPSS. A linear model has been used to test the statistical association with the phenotype “diabetes” and the covariate “BMI” (body mass index). Additionally, the effect-size n2 (eta squared) for use in an ANOVA has been reported. A correction for multiple testing, to detect the false positives, has been performed after Storey and Tibshirani.

In the third paper [3], the genome-wide association analysis has been performed with R. To test for the possible association between the age- and sex-adjusted residuals of sphingolipids, a score test has been done and to analyze the SNP genotype an additive model has been used. A threshold level of 7.2 x 10-8 has been chosen for the overall meta-analysis. In this analysis, a Bonferroni correction can not be performed since this kind of correction would lower the p-values from 10-9 to 10-10 therefore the age-sex corrected p-values has been reported separately. All the reported highly significant variants are in Hardy-Weinberg-Equilibrium.

In the fourth paper [2], only SNPs with a MAF of at least 5% has been included to the statistical analysis. In the study of genome-wide association, an additive genetic model has been used to specify the dependency of metabolites on different genotypes. Additionally, for the case-by-case analysis SPSS and for the linear regression algorithm R has been used. After Bonferroni correction with a = 0.05, the p-value is 1.33x10-9. A linear regression model has been used to show the association between the associating SNP and the best hits whereas different metabolite concentrations have been used as quantitative trait. After Bonferroni correction for metabolite pairs a genome-wide significance level of p= 6.6x10-12 has been set. 4.3

Analytic methods

In the first article [1], Illig et al., for the genotyp-ing of the study population, the Affymetrix 6.0 GeneChip array has been used and for the TwinsUK population, the Illumina Hap317k chip has been used. Additionally, the fasting serum concentration of 163 metabolites has been determined, including a biological relevant panel of amino acids, sugars, acylcarnitines and phospholipids by using the electrospray ionization tandem mass spectrometry (ESIMS/MS) with the Biocrates Absolute IDQ targeted metabolomics technology. In the second article [4], Suhre et al., the metabolomics measurements have been performed by three different providers. The Biocrates platform, Chenomx platform and Metabolon platform. All three hadn't had access to the previously defined phenotypes to avoid bias. In the third arti-

cle [3], Hicks et al., the lipids were quantified by electrospray ionization tandem mass spectrometry (ESI-MS/MS). The genotyping has been performed on Illumina Infinium HumanHap300v2 (and Hu-manHap300v1 for Vis samples) or Hu-manCNV370v1 SNP bead microarray. In the fourth article [2], Gieger et al., the genotyping for the KORA F3 500K has been performed by using the Affymetrix 500K Array Set. For the metabolite measurements, the electrospray ionization tandem mass spectrometry (ESI-MS/MS) has been used and was performed on a quantitative metabolomics platform at Biocrates Life Sciences AG, Austria. In total, 363 different metabolites have been detected and additionally, 208 phospholipids have been analyzed.

RESULTS

Driven by previous findings from Gieger et al. and Altmeier et al., all 163 metabolite concentrations and all metabolite concentration ratios have been tested with a linear additive model for association with all single-nucleotide-polymorphisms (SNPs), since the use of metabolite concentration ratios as proxies for enzymatic reaction rates has been proven to reduce the variance and yields robust statistical association [1]. After correction for testing 517,480 SNPs and 26,406 multiple metabolic trait combinations a p-value of p = 3.64 x 10-12. The aim of this kind of analysis is to identify ratios (= pairs of metabolites) that are more likely to appear coupled. The study is a two-step discovery design in the KORA F4 population and followed by a replication step with the population of the TwinUK study. After replication with the KORA data 15 loci with a high association has been identified. After adding the data from the TwinUK study and performing a Bonferroni correction, 9 out of 15 loci were replicated. 5 of them showed signals of association with similar effect-size estimates but since there significance has been measured above the threshold, the five loci should be considered as unreplicated. But four out of the five loci showed evidence of an association, namely CPS1, SCD, SLC22A4 and PHGDH, were two of them are indirect replications of previous studies (CPS1 and SLC22A4). The four detected loci are located in or near by enzyme-coding or solute carrier-encoding genes for which the association metabolic traits match the proteins' function. For SLC16A9 and PLEKHH1 new hypothesis on the function of a gene can be drawn. Supplementary, for three loci new clinical end points can be associated, SLC22A4 with Crohn's disease, FADS1 with hyperactivity and cholesterol and triglyeride levels and ACADS might be a hint for ethylmalonic aciduria. Some of the findings go along with the findings of previous GWA studies on kidney function with the loci UMDO, where the

loss of function of the corresponding gene leads to disorders. In the study population, ACADM, ACADL and ETFDH are suspected of indicating the genetic variants or variants in linkage disequilibrium and may create more moderate phenotypes.

In the paper of Suhre et al. [4], a total of 482 distinct values (423 unique) of metabolite concentration have been analyzed by three different platforms, including 9 metabolites measured on all three platforms. First, the differences between cases (diabetes) and controls have been analyzed and second the metabolite pairs which showed an increase of association when using the ratios were displayed. The strongest positive association with diabetes has been detected in several sugar metabolisms. Moreover, the concentration of sugar (e.g. glucose, mannose, des-oxyherose) are significantly increased by up to 90% in the diabetes group compared to the healthy control group. Not only positive association with diabetes has been detected but also negative associations. In the group of glycerophospholipids, the phosphatidylcholines PC_aa_C34:4 and the lysophosphocholine PC_a_20:4 showed a strong but negative association with diabetes. However, phosphatidylethanolamines with similar lipid side chains compositions, e.g. PE_aa_C34:2 and PE_aa_C36:2, showed an increase in the diabetes group. But some metabolites has been detected only in a small amount of people for example, salicyluric glucuronide. Additionally, in three diabetes patients, both pioglitazone and hydroxy-pioglitazone has been detected which confirms the intake of diabetes-specific medication and has not been found in the control group. Kynurenine levels were up to 14.6%-21.8% higher in the diseased group then in the healthy control group.

In the paper of Hicks et al. [3], a total of 32 SNPs in five distinct loci have reached a genome-wide significance in a GWA study for single species and matched metabolite ratios additionally, a genome-wide significance in p-values for large cohorts were reached by three chromosomal regions, 4p12, 14q23.2 and 19p13.2. For these three loci, the strongest association has been found for sphingomyelins and dihydrosphingomyelins. Two loci, 11q12.3 and 20p12.1 were almost significant between all five populations as well as reaching a genome-wide significance in a meta-analysis. Which goes along other finding from literature, the used of metabolic concentration ratios increase the power of association, in this case, 43 matched metabolite ratios increase the power up to 10 orders and has been displayed 10 additional SNPs with a statistical significance. However, none of the new genes reached a genome-wide significance. Within the 32 significant SNPs, some of the variants explain a certain percentage of variance in the corresponding ratios. More detailed, variants in LASS4 explain up to 7.5% of the variance in the

SM16:0/SM18:0 ratio and highest explanation rate is 12.7% which explains the the variance in the SM14:0/SM16:0 ratio. If some genes were combined with each other, so e.g. SPTLC3 and SGPP1 then up to 14.2% of the variance can be explained. Three SNPs, rs 10938494, rs2351791 and rs4695267 have been shown a genome-wide significant association with glycosyleramides. Furthermore, SNP rs 10938494 showed in the single species analysis the strongest association in South Tyrol with a p-value of p = 1.68x10-9 and in the joint analysis a p-value of p = 8.03x10-19. At the locus 19p13.2, the strongest associations with sphingolipids lie within LASS4, a gene which encodes LAG1 longevity assurance hom-ologue 4.

To identify the most promising variants, 624 SNPs within or near 40 genes encoding for enzymes involved in sphingolipid metabolism have been investigated for further analysis. Out of 624 SNPs, 70 variants demonstrate a association p-value of 10-4 or less. Three fatty-acid desaturate genes, FADS1, FADS2 and FADS3 are located contiguous to one other at the 11q12.3 locus. Hence, in a GWAS, only the FADS1-FADS3 cluster overlaps in the joint meta-analysis of circulating serum lipoprotein levels, whereas the strongest association has been measured with total and LDLcholesterol. Variants at the FADS1-FADS3 locus are associated with classical lipids and cardiovascular diseases and give an evidence of the role of sphingolipids in atherosclerotic plaque formation and lipotoxic cardiomyophaty. In several GWAS data sets has been looked for the evidence of an association between the major variants and the sphingolipid concentrations.

In the study of Gieger et al. [2], up to 363 metabolites has been defined and out of 363 metabolites, 201 has been obtained in more than 95% of the samples. One of the top ranking association signals in the study has been the SNP rs 174548 with lies in the linkage disequilibrium block which also contains the FADS1 gene. The FADS1 gene is strongly associated with several glycerophospholipid concentrations. Therefore, the SNP rs174548 explains up to 10% of the variation of certain glycerophospholipids. The FADS gene codes for the fatty acid delta-5 desaltu-rase whereas the minor allele of the corresponding SNP (MAF = 27.5%) reduces the efficiency of fatty acid delta-5 desalturase. Arachidonic acid as a direct product of FADS as well as lysophosphatidylcholine as its derivative are significantly reduced with an increasing number of copies of the minor allele of the SNP. Additionally, there are negative associations as well, e.g. for the sphingomyelin concentration which can be interpreted as a result of a change in the ho-meostatis of phosphatidylcholins. Another consequence of the imbalance in the glycerophospholipids metabolism is the negative association of the lyso-

phosphatidylethanolamin. The first conclusion of the paper, relating to the fatty acids is that the direction of the association is influenced and can be explained by the modification in the efficiency of the fatty acid delta-5 desaturase reaction.

A further finding in this article is that the power of an association increases if the metabolite concentration ratios are used. In FADS1 polymorphism, the orders of magnitude have been decreased by up to fourteen orders when using metabolite concentration ratios. The ratio [PC aa C36:4] / [PC aa C36:3] is s strong indicator for the efficiency of the FADS1 reaction since a reducement of the catalytic activity of FADS leads to more eicosatrienoyl-CoA and less arachidonyl-CoA which consequently leads to increased PC aa C36:3 concentrations and to decreased PC aa C36:3 concentrations. The next step in the study has been to investigate the effect of the variation in FADS gene on biochemical variables related to medical outcomes. The test hypothesis has been that in large samples, the FADS polymorphism has a detectable effect on the corresponding serum parameters. This hypothesis has been confirmed when looking on a GWAS with 18,000 participants and the p-value for the SNP rs 174548 and its association with serum low-density lipoprotein (LDL), high-density lipoprotein (HDL) and the total cholesterol. The significance rages between p = 1.89x10-4 and p = 6.07x10-5. The same procedure has been performed for several other determined metabotypes that associates with medical phenotypes, such as LIPC, PARK and PLEK.

Conclusion

The four papers show that combining different kind of study types bring big efforts for the scientific world. They connected single studies with genome-wide studies and with knowledge about metabolic traits to draw new conclusions about metabolic diseases. An additional step is to analyze not only single metabolite concentrations but also calculate metabolite concentration ratios. These kinds of ratios will be even more important in the future because you can identify interactions between different metabolites and see whether two metabolites occur more often together and are therefore somehow related to each other. Moreover, the variation in the data set can be dramatically reduced by using concentration ratios and if the variation in a study goes down,the power (P) increases and consequently, the p-value (a) decrease. since a = 1-p. The four studies have been an exemplary study design since the combination of GWA studies and the use of metabolomic phenotypic traits leads to a successful new outcome which will hopefully leads to new milestones in the field of genetics and personalized medicine. Personalized medicine is a part of pharmacogenetics which is a new sector in the field of genetics and deals with the effi-

cacy or intolerance of some drugs in different people and has a slogan named: “right person, right treatment”. The aim is to identify medical phenotype traits to find phenotypes which are more often associated with a medical or clinical outcome. If those phenotypes, for a specific disease, are defined the treatment can be adjusted for the corresponding phenotype or person to make the medicine even better. An additional goal is that side effects can be reduced or even avoided if the exact phenotype is known. One step in the right direction to achieve the set goals is already done by combining the GWA studies with common knowledge about metabolites and plenty of other steps need to be taken to improve the recent situation in medicine.

REFERENCES

1. Illig T, Gieger C, Zhai G et al. A genome-wide perspective of genetic variation in human metabolism // Nat Genet. - 2010. - 42(2):137-41.

2. Gieger C, Geistlinger L, Altmaier E et al. Genetics meets metabolomics: a genome-wide association study of metabolite profiles in human serum // PLoS Genet. - 200S. - 4(11):e1000282.

3. Hicks AA, Pramstaller PP, Johansson A et al. Genetic determinants of circulating sphingolipid concentrations in European populations // PLoS Genet. - 2009. - Oct;5(10):e1000672.

4. Suhre K, Meisinger C, Doring A. Metabolic footprint of diabetes: a multiplatform metabolomics study in an epidemiological setting. // PLoS One. - 2010. - Nov 11;5(11):e13953.

5. http://www.bio.davidson.edu/people/kahales/301Gene tics/timeline.html

6. http://www.dartmouth.edu/~bio70/

7. http://en.wikipedia.org/wiki/History_of_genetics

S. http://grants.nih.gov/grants/gwas/

9. http://en.wikipedia.org/wiki/Genome-wide_association_study

10. http ://www. aokgesundheitspartner. de/imperia/md/gpp/ bund/dmp/evaluation/konferenz_juni09/dmp_konf29_ 30_06_0 9_kora. pdf

11. http://www.diabetes.org/diabetes-basics/genetics-of-diabetes.html

УДК 61:575=111

APPLICATIONS OF NEXT GENERATION SEQUENCING TECHNOLOGY TO HUMAN DISEASE RESEARCH: AN EXPERIENCE OF BEIJING GENOMICS INSTITUTE

Beijing Genomics Institute (BGI) Europe A/S, DK-1870 Frederiksberg, Denmark;

Beijing Genomics Institute, BGI-Shenzhen, Shenzhen, China E-mail: [email protected]

For the past dozens of years, the Sanger method has been the dominant approach and gold standard for DNA sequencing. The commercial launch of the first massively parallel pyrosequencing platform in 2005 ushered in the new era of high-throughput genomic analysis now referred to as next-generation sequencing (NGS). Although the platforms differ in their engineering configurations and sequencing chemistries, they share a technical paradigm in that sequencing of spatially separated, clonally amplified DNA templates or single DNA molecules is performed in a flow cell in a massively parallel manner. Through iterative cycles of polymerase-mediated nucleotide extensions or, in one approach, through successive oligonucleotide ligations, sequence outputs in the range of hundreds of megabases to gigabases are now obtained routinely. Highlighted in this review are the impact of NGS on basic research, bioinformatics considerations, and translation of this technology into clinical diagnostics. I will introduce NGS platform which is used in studying human disease, with some demonstrative cases that have been completed or performing in Beijing Genomics Institute (BGI).

Keywords: next-generation sequencing, genetic variation, disease susceptibility genes, Beijing Genomics Institute.

ИСПОЛЬЗОВАНИЕ ТЕХНОЛОГИИ НОВОГО ПОКОЛЕНИЯ СЕКВЕНИРОВАНИЯ ДЛЯ ИЗУЧЕНИЯ БОЛЕЗНЕЙ ЧЕЛОВЕКА: ОПЫТ ПЕКИНСКОГО ГЕНОМНОГО ИНСТИТУТА

Ли К.

Пекинский геномный институт, Европейское представительство, DK-1870 Фредериксберг, Дания;

Пекинский геномный институт, Шензен, Китай

За последние десятилетия метод Сэнгера был доминирующим подходом и золотым стандартом секвенирования молекулы ДНК. Коммерческий пуск первой платформы пиросеквенирования в 2005 году открыл новую эру высокопродуктивного геномного анализа, получившего название - новое поколение секвенирования (NGS). Хотя существующие сегодня платформы отличаются своими инженерными конфигурациями и химическими методологиями анализа, процесс секвенирования пространственно разделен, и ДНК матрицы или одиночные молекулы ДНК амплифици-руются потоком параллельного множества копий. В настоящем обзоре применения технологии NGS основное внимание будет уделено фундаментальным исследованиям биоинформатики и клинически ориентированным исследованиям. В обзоре будет представлена платформа NGS, используемая для изучения болезней человека, что будет продемонстрировано примерами из опыта Пекинского геномного института (BGI).

Ключевые слова: новое поколение секвенирования, генетическая вариабельность, гены предрасположенности к болезням, Пекинский геномный институт.

BGI (formerly known as Beijing Genomics Institute) was founded in Beijing on 1999 with the mission of supporting the development of biological science and technology, building up strong research teams, and promoting the development of commercial services. In 2007, BGI headquarters was relocated to Shenzhen in southern China. Now, BGI has 4 main sub institutes in mainland China, one in HongKong, which takes care of all international projects. BGI also opened BGI-Americas in San Francisco and BGI-Europe in Copenhagen in 2010.

BGI now owns 137 Illumina HiSeq 2000, regarded as a top-of-list tool providing solutions to genome, transcriptomics and epigenomics studies for both academic and industrial users; also has 27 Sol-id4 as well as 3730xl DNA Analyzer from Applied Biosystems and new employment of Illumina iScan for genotyping studies. The daily production can be as much as 5 Tb per day, which is equal to sequence

50 human genomes with 30 times coverage; given human genome size is 3 Gb.

Considering such large amount of data production, to obtain the same capacity to deal with data, BGI builds up the super computing system. The data centre now has a total of 1000T FLOPS CPU power, 200TB memory and 10000 Petabyte storage Vol.umes. The supercomputing facility can store and simultaneously analyze large amount of data.

The high-performance computational facilities and bioinformatics-knowledge platform provide the hardware and software tools to support bio-computational research for a variety of genomics research studies.

Bioinformatics Software: BGI develops and uses state-of-the-art software tools for large-scale biological analysis and large-scale genome sequencing data processing. SOAP has been in eVol.ution from a single alignment tool to a tool package that provides

Аннотация научной статьи по фундаментальной медицине, автор научной работы — Баумл Дж

Похожие темы научных работ по фундаментальной медицине , автор научной работы — Баумл Дж

Metabolism goes gwas: combining new gwas data with existing information about metabolism in the human body to make a statement for the future