Posters (including abstracts)
Dr Irina Abnizova
genome Campus, Hinxton, Cambridge, UK
New method to improve error probability estimation applied to Illumina sequencing
Around 10 years ago the Phred (Ewing & Green, 1998) base-calling and error calibration algorithm was introduced. It is a threshold-dependent binning algorithm, allowing estimation of an error probability for every base call given a number of parameters computed from the trace data. It is almost quadratic with respect to the number of parameter/threshold combinations. As far as we know, since then there were no significant attempts, (except by Brockman et al. 2008), to improve the Phred error probability calibration. , The new short read sequencing technique introduced new technological and computational challenges. It requires reconsideration of well-known error estimation algorithms, taking into account different sequencing platforms. , We revisited the well-established Phred, and found a way to make it linear with respect to parameter/threshold combination number, which greatly reduces computational time and memory., , We also developed our own conceptually new error calibration algorithm, which is essentially linear, and combines information from different parameters in one-dimensional space. In contrast, the Phred operates in N- dimensional space, where N is the number of parameters used. Our algorithm is therefore very fast computationally, requires small memory, and most importantly, is very stable in the sense of small variability of error rate. Another advantage is that, in contrast to machine learning approaches, it allows us to see which parameter is most responsible for error rate variability. , The algorithm is validated on massive human and phiX data from GA1 and GA2, Illumina sequencing platform. ,
Dr Alexander V Alekseyenko
Statistics Department, 390 Serra Mall, Sequoia Hall, Stanford University, Stanford, California 94305-4065, USA
Improving Speed and Accuracy in Stochastic Simulation via Higher Order Leaping,
Stochastic simulation methods are important in modeling chemical reactions and biological and physical stochastic processes describable as continuous-time discrete-state Markov chains with a ﬁnite number of reactant species and reactions. The current algorithm of choice, tau-leaping, achieves acceptable performance in stochastic simulation by assuming little change in reaction propensities and taking large time steps that leap over many individual reactions. During a leap interval (t, t + \tau) in tau-leaping, each reaction channel operates as a Poisson process with a constant intensity. We improve the ordinary tau-leaping by allowing for linear and quadratic changes in reaction intensities. This relaxes the constant intensities assumption and enables the simulation to accurately anticipate the change in them. The resulting step anticipation tau-leaping (SAL) algorithm is more accurate and faster than the ordinary tau-leaping. We demonstrate this through a number of applications: Kendall’s process, a two-type branching process, Ehrenfest’s model of diffusion, and Michaelis-Menten enzyme kinetics.,
Mr Ryan Ames
Michael Smith Building, University of Manchester, Oxford Road, Manchester, M13 9PL
Gene duplication is a key driver of adaptation within a species
Gene duplication is an important factor in genome evolution. While comparative genomics approaches have examined duplicate genes between species there, are no comparative studies for populations of the same species. Using data from the Saccharomyces Genome Resequencing Project we were able to identify and compare duplicate genes between 39 strains of S. cerevisiae and 28 strains of S. paradoxus. Our findings show an abundance of gene duplication in all populations with variation between the individual strains. Lineage specific duplicates are identified and demonstrate the ongoing retention of new duplicates. We also demonstrate that these duplicates have experienced selection for function and location on the chromosome. Duplicate gain and loss events are mapped to the strain phylogenies using a bespoke weighted parsimony scheme to highlight areas of large duplicate gain or loss. Finally, we show that differences in retention of duplicate genes lead to phenotypic differences.
Dr Simon Anders
European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
HilbertVis: A tool to visualize and explore genomic data with space-filling curves
In many genomic studies, one works with genome-position-dependent data, e.g., ChIP-chip or ChIP-Seq scores. With existing tools, it is hard to explore such extremely long data vectors in a comprehensive manner. Here, I present a new tool to visualise genomic data in a space-efficient two-dimensional manner inspired by the Hilbert curve. The tool allows the user to visually judge and compare global statistical properties of the data vectors as well as shape and distribution of individual features in an easy and powerful manner. It offers interactive functions for further data exploration or detailed comparisons and can be used as a stand-alone application or within the R/Bioconductor environment. It will hopefully prove to be a valuable tool for quality control and for interpretation of any kind of high-resolution, genome-wide data.,
Professor Maia Angelova
Prof. M. Angelova , Intelligent Modelling Lab, School of CEIS, Northumbria University, Pandon Building, Camden Street, Newcastle upon Tyne NE2 1XE
Generalized Nets Models of Genetic Networks
New developments in post-genomic technology provide data necessary to study regulatory processes at multiple levels of biological organisation. The advancement of biotechnological sciences relies heavily on the design of accurate models that not only are consistent with the experimentally established correlations but are also capable of predicting and extrapolating expected results with a high degree of certainty. In this work we propose a model using Generalized Nets (GN) for modelling genetic networks. GN are extensions of Petri nets and Petri nets modifications. An introduction and basic properties of GN theory with examples of applications will be given. , The GN model, proposed in this work, uses the natural structure and relations of the regulatory entities and provides more flexibility for different extensions. The advantages of the model, compared with Boolean network models and Petri nets models, are discussed. A case study investigating the nutritional stress of E.coli is presented.,
Mr Maurice Berk
Room 540, Department of Mathematics, South Kensington Campus, Imperial College London, LONDON, SW7 2AZ
On functional data analysis approaches for detecting differentially expressed genes in microarray time series experiments
Microarray time series experiments are very short, typically consisting of no more than 10 time points, while the data is both very highly dimensional and noisy. We handle the paucity of data by framing the problem as functional data analysis: we assume our observed gene expression values have arisen from an underlying smooth function of time which we parameterize using splines - piecewise polynomials. Care must be taken with inference as the high dimensionality of the data means we must take the multiple testing problem in to account. , In this talk we address the various issues involved in the use of splines including basis selection and the location and number of knots, and we review the existing models and software for practical data analysis. We highlight the shortcomings of these models and show how our methods deal with these. We then illustrate their application to an example data set investigating Mycobacterium Tuberculosis infection and demonstrate our web based interface for exploring these results.
Ms Juok Cho
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Effect size estimation meta-analysis of multiple microarray studies and its application to the curated corpus of data in ArrayExpress
We present a novel method for integrative meta-analysis of multiple microarray datasets. The method is an extension of the interstudy variation modelling framework proposed by Choi et al. (2003). We generalize the computation of effect size, a standardized index of gene expression, to multifactorial experiments with multiple treatments per factor, taking into account unstable sample variances at small sample sizes by shrinking them via empirical Bayesian moderation (Smyth et al., 2004). Effect size estimates from multiple datasets are combined into robust estimates of transcriptomic activity using mixed effects models. Homogeneity tests were used to choose appropriate models of interstudy variance. The developed method was applied to data in ArrayExpress Atlas of Gene Expression (Parkinson et al, 2008), computing standardized expression levels on datasets from over 800 studies, comprising over 20,000 assays. For every biological condition curated in ArrayExpress Atlas (these cover a variety of tissues, cell types, disease phenotypes, as well as drug response studies), we calculated a meta-analytical Z-score and identified significant genes, using permutation tests to estimate the FDR. A discussion of the initial results of using the combined effect size estimates for data analysis is presented, as well as a new R/Bioconductor package implementing the developed work.
Dr Lachlan James Coin
Department of Epidemiology and Public Health, Norfolk Place, Imperial College, London W21PG
High resolution multi-platform copy number variation genotyping and imputation using a haplotype hidden Markov model
There is a need to develop algorithms for accurate copy number variation (CNV) genotyping at directly measured loci, and imputation at unmeasured loci to support genomewide meta-analyses for identification of disease associated CNVs where cohorts have been assayed on multiple platforms. We have developed `cnvHap', a tool for genotyping CNVs using multiple SNP genotyping and aCGH platforms either singly or jointly. Unlike existing algorithms, cnvHap jointly models SNP and CNV variation at the haplotype level, thus using LD information to improve sensitivity. We evaluated this algorithm using, 50 individuals assayed on the Illumina Human1M BeadArray and a 244k Agilent CGH array; and 35 of these assayed on the Illumina Hap300 BeadArray and an 185k Agilent CGH array. To facilitate comparison at the probe level, as well as future CNV meta-analyses, we developed an imputation procedure for mapping CN called on one platform to unmeasured loci on another which also estimates imputation uncertainty.
Ms Kathryn Cooper
Department of Bioengineering, Imperial College London, South Kensington Campus, London, SW7 2AZ ,
Role-Similarity Clustering in Directed Networks
We consider the problem of clustering on directed networks with an alternative approach. In contrast to traditional clustering, which is based upon density of connections, we choose to group vertices that play a similar role within the network. The method reveals interesting insights on data from ecology, world trade and metabolic networks. , The majority of initial network research to date has focused on simple, unweighted, undirected graphs. Including edge direction is crucial to understanding the structure in many cases. , We speculate that the function of a node will be reflected in the pattern of connections and vertices with the same function may not in general be close in a network sense. The identification of functionally similar nodes has implications such as the ability to assign a function to uncharacterized nodes, or to simplify a network into a coarse-grained structure. , We draw examples from world trade, ecology and metabolic networks as proof of application potential. Our method detects plausible trophic levels within example food webs. Initial results from world trade networks are consistent with previous core-periphery structure theories from the social sciences. Additionally, we shed light on a possible core-peripheral structure of metabolic networks across species.
Ms Wei Dai
8the Floor Cyclotron Building, Hammersmith Hospital, Du Cane Road, London W12 0NN
Implementation of methylation linear discriminant analysis (MLDA) on CpG Island microarray data of ovarian cancer
Differential Methylation Hybridisation (DMH) is used for analysing genomic DNA methylation. To account for the specific biological features of DNA methylation and non-symmetrical distribution of DMH data, we have developed an algorithm, named Methylation Linear Discriminant Analysis (MLDA), to identify differential methylation based on linear regression models using non-normalised DMH data (Dai et al., 2008)., , We designed a focused oligonucleotide microarray covering 596 CpG islands, with on average 24 oligonucleotide probes per island, that have been chosen based on previous studies implicating them as prognostic DNA methylation markers in ovarian cancer. Analysis of methylation of ovarian cell line DNA by DMH showed good reproducibility and allowed further optimisation of methodology. MLDA has been implemented on this data and found 101 and 26 CpG islands differentially methylated between cisplatin-sensitive and resistant ovarian cancer cell lines generated in vitro and in vivo, respectively. Of 14 loci identified in a previous large-scale study (Dai et al., 2008), 13 loci were independently identified by MLDA in the current study. Analysis of the data showed high sensitivity of MLDA and reproducibility of DMH using these arrays. Currently we are conducting an analysis of ovarian tumour DNA to further evaluate these potential prognostic biomarkers.,
Dr Ronan Daly
Room 320, Sir Alwyn Williams Building, Department of Computing Science, University of Glasgow, Glasgow G12 8RZ
A Probabilistic Analysis of Mispriming in PCR
Polymerase chain reaction (PCR) is a technique used to amplify specific regions of DNA and has become a common method in diverse biomolecular applications. Although the primers used in PCR are designed to amplify specific sections of a DNA sequence, it has been shown that amplification of other sections can occur if a close match to the primers occurs around those sections. In most cases this 'mispriming' simply causes the main product to be created with reduced efficiency and is not of overriding concern. Recently however, techniques such a quantitative real-time PCR (qRT-PCR) have been developed that can measure the amount of product produced from a PCR reaction. Mispriming in this context could cause a difference in the amount of product produced and hence measured. , The work introduced here presents a probabilistic model of mispriming and seeks to characterise the effect, of parameters such as primer length and the source of the DNA to be amplified, on this model. Experiments have been conducted matching random primers to known DNA databases using the given model. Preliminary results show primer length to be the main factor in mispriming, both in terms of the amount and variability of mispriming that occurs.
Dr Simon J Furney
Institute of Psychiatry, Kings College London, de Crespigny park, SE5 8AF, London
Genome-wide analysis of neurodegeneration in Alzheimer's disease
A total of over two hundred controls, patients with mild cognitive impairment and patients with Alzheimer’s disease from centres in six different European locations were genotyped on Illumina 610K human SNP arrays. In addition, T1-weighted magnetic resonance imaging scans were taken for all subjects. These scans were processed to produce normalised whole brain and total hippocampal volumes for each subject. These two measures were used as quantitative traits in analyses combining the genome-wide SNP data with the neuroimaging data. The neuroimaging traits were corrected for age, sex, population stratification, and ApoE4 dosage and were applied in a regression analysis allowing for an additive SNP allele effect on the quantitative trait. To control for the risk of false positives, empirical p-values for association significance were calculated using permutation testing. These analyses produce panels of SNPs that could be implicated in Alzheimer’s disease-mediated neurodegeneration.
Professor Erol Gelenbe
Dept of EEE, Intelligent Systems and Networks Group, Imperial College London, SW7 2BT
Solving gene regulatory networks from master equations
Dynamical models of gene regulatory networks using boolean networks or mass equations have been well studied. In this paper we will show how gene regulatory networks can be studied from first principles using stochastic master equations, and how these stochastic master equations can then be solved to obtain the "higher level" mass equation and Boolean network representations. Examples will also illsutrate this approach.
Dr Colin Gillespie
School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK ,
A analysis of the moment closure approximation
Stochastic population models have proved to be a powerful tool in the modelling of complex biology phenomenon. However, all too often the associated mathematical development involves non-linear mathematics, which immediately raises difficult and challenging analytical and computational problems that need to be solved if useful progress is to be made. One approximation that is often employed to estimate the moments of a stochastic process is moment closure. This approximation essentially truncates the moment equations of the stochastic process by assuming an underlying distribution, such as the Lognormal or Gaussian. In this talk we explore how moment closure techniques can be utilised in the systems biology context. By developing general expressions for the marginal and joint-moment equations, we show that the moment equations can be quickly recovered for large stochastic models. We will examine the role that the assumed closure distribution has on the moment estimates and also where the approximation breaks down. Finally, we will illustrate how moment closure theory can be leveraged for parameter estimation. In particular, where we only observe a small subset of species at discrete time points. , References: , Gillespie, C.S. Moment-closure approximations for mass-action models. IET Systems Biology, 2009.,
Drs Marco Grzegorczyk and Dirk Husmeier
JCMB, The King's Buildings, Edinburgh , EH9 3JZ
A non-homogeneous dynamical Bayesian network for, modelling non-stationary gene regulatory processes,
Dynamical Bayesian networks have been extensively, applied to the reconstruction of gene regulatory, networks from gene expression time series. However, the standard approach is based on a homogeneous, Markov chain, which fails to allow for changes in, the regulatory processes with time. Moreover, the standard Bayesian score for network structures, is either based on discretized data, or a linear, model for continuous data. , The objective of our poster is to discuss, a non-linear, non-homogeneous generalization of this, score. The method is based on a change-point, process and a mixture model, using latent, variables to assign individual measurements, to different components. The practical, inference follows the Bayesian paradigm and, samples the network structure, the number, of components and the assignment of latent, variables from the posterior distribution with, MCMC, using a variation of the recently proposed, allocation sampler. We demonstrate, that when applying this scheme to gene expression, time series from Arabidopsis thaliana and infected, macrophages, the Bayesian model selection is consistent , with dichotomies intrinsic to the studied systems.
Dr Keith James Harris
Room 320, Sir Alwyn Williams Building, 18 Lilybank Gardens, Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK.
Definition of Valid Proteomic Biomarkers: Bayesian Solutions to a Currently Unmet Challenge.
Clinical proteomics is suffering from high hopes generated by reports on apparent biomarkers, most of which could not be later substantiated via validation. This has brought into focus the need for improved methods of finding a panel of clearly defined biomarkers. To examine this problem, urinary proteome data was collected from healthy adult males and females, and analysed to find biomarkers that differentiated between genders. We believe that models that incorporate sparsity in terms of variables are desirable for biomarker selection, as proteomics data typically contains a huge number of variables (peptides) and few samples making the selection process potentially unstable. This suggested the application of the two-level hierarchical Bayesian probit regression model that Bae and Mallick (2004) proposed for variable selection, which used three different priors for the variance of the regression coefficients (inverse Gamma, exponential and Jeffreys) to incorporate different levels of sparsity in their model. We have also developed an alternative method for biomarker selection that combines model based clustering and sparse binary classification. By averaging the features within the clusters obtained from model based clustering, we deﬁne “superfeatures” and use them to build a sparse probit regression model, thereby selecting clusters of similarly behaving peptides, aiding interpretation.
Dr Andrew Harrison
Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, Essex, CO4 3SQ
On the causes of correlations seen in Affymetrix GeneChips
GeneChips are a powerful, and popular, technology for analysing gene expression across the whole genome. Their popularity has led to more than ten thousand articles in refereed journals. Much of this data is in the public domain, and is now available for meta-analysis. We are mining this data, in order to discover novel biological signals. We have identified a number of biases in the data, and have been able to identify a number of gene markers that cannot be trusted.
Mr Ulrich Dieter Kadolsky
Imperial College London, Department of Immunology, Medical School, St Mary’s Campus, Norfolk Place, London, W2 1PG
Quantifying the Impact of HIV-1 Escape from the Cytotoxic T-cell Response
HIV-1 escape from the cytotoxic T-lymphocyte (CTL) response leads to a weakening of viral control and is likely to be detrimental to the patient. Although HIV escape is of considerable interest and the subject of much research, the impact of escape on viral load and CD4+ T cell count has not been quantified, primarily because of sparse longitudinal data and the difficulty of separating cause and effect in cross-sectional studies. , We use two independent methods to quantify the impact of HIV-1 escape from CTLs in chronic infection: mathematical modelling of escape and analysis of a cross-sectional cohort of 157 clade C infected individuals using multiple linear regression to adjust for confounding effects. Mathematical modelling revealed a modest increase in viral load of 0.056 log copies/ml per escape event (interquartile range: 0.036 - 0.082). Analysis of the cross-sectional cohort revealed a highly significant positive association between viral load and the number of "escape events" (HLA-associated polymorphisms per epitope), after correcting for length of infection and rate of replication. We estimate that a single CTL escape event leads to a viral load increase of 0.11 log copies/ml (95% confidence interval: 0.040 - 0.18), consistent with the predictions from the mathematical modelling. We find that polymorphisms in pol, but not gag, are the main drivers of this association. Overall, the number of escape events could only account for approximately 6% of the viral load variation in the cohort. , Our findings, using two independent approaches, indicate that although the loss of the CTL response for a single epitope results in a highly statistically significant increase in viral load, the impact is modest. We suggest that this small increase in viral load is explained by the small growth advantage of the variant compared to the wildtype. Escape from CTLs had a measurable, but unexpectedly low, impact on viral load in chronic infection.
William P Kelly and John W Pinney
Centre for Bioinformatics, Imperial College London, South Kensington Campus, London, UK
Visualizing properties of protein interaction networks in functional space
The networks formed by interacting proteins are of increasing interest as abstractions of cellular processes. However, the implications of the stochastic and systematic errors introduced by different sources of network data, whether high-throughput experimental protocols or literature curation efforts, are only now starting to be understood in any detail. Using GLASS (Gene Layout by Semantic Similarity), a recently developed methodology for the visualization of genome-scale data, we demonstrate the striking differences between network data sets in terms of their biases towards different biological processes, and the knock-on effects of these biases on derived network statistics. These visualizations can help us to separate the truly biological aspects of network structure from artifacts created by incomplete and uneven sampling of the interactome.
Mr Haseong Kim
Intelligent Systems and Networks, Department of Electrical and Electronic Engineering, Imperial College London , South Kensington Campus, London, UK
Modeling Strategy of Gene Regulatory Networks via Queuing Networks
Modeling gene regulatory networks (GRNs) have been developed along with the study of the underlying mechanisms of gene and protein expressions. Numerous mathematical models attempt to explain the gene expressions (Paulsson, 2005). In this study, we will introduce a modeling strategy based on queuing networks. , The queuing networks were firstly applied to a simple GRN by Arazi et al (Arazi, et al., 2004). Also, an analytical solution of the queuing network (G-network) explaining gene regulatory networks was proposed by E. Gelenbe (Gelenbe, 2007). In order to apply the queuing networks, a stochastic GRN model (McAdams and Arkin, 1997, Thattai and van Oudenaarden, 2001) is constructed and simulation data are generated by using the Gillespie algorithm (Gillespie, 1977). Then, the GRNs with G-network are modeled. A node in our model represents a gene expression mechanism which is involved with the DNA-mRNA-protein productions. The proposed method is applied to various simulated GRN models having different parameter values. Special importance was placed on the steady state of the GRNs with diverse external perturbations. , This GRN modeling study may be extended to detecting candidate genes of disease by applying congestion detecting measures of queuing network theory.,
Mr Paul Kirk
Imperial college London
Gaussian process regression bootstrapping: Exploring the effects of uncertainty in time course data
Details to follow
Mr Ricardo de Matos Simoes
Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), Dr. Bohr Gasse 9, A-1030 Vienna, Austria
Species-Specific Evolving Regions in the Human and Chimpanzee, Genome
Human and chimpanzee DNA sequences are believed to evolve, according to the same substitution model. Exceptions to, this assumption have been described (Ebersberger and Meyer 2005), but a genome-wide analysis of this aspect has not, been performed. We have developed a test statistic to assess and compare branch specific substitution models. , In total we identified 673 regions 125 KB, in size that evolved with significantly different, substitution models in humans and chimpanzees., A more detailed analysis revealed changes in the transition rates, as main cause for the model differences. , For an analysis of strand specific substitution rates, we subsequently focused on the transcribed fraction of , the human genome., A transition bias favored, A->G over T->C substitutions on the transcribed strand, which is interpreted as signature, of transcription coupled repair (Green et al. 2003). Notably, the, extend of the transition bias is markedly different, between humans and chimpanzees in the significant regions. , We propose that species-specific differences in the germline gene expression of the corresponding genes account for the, different modes of evolution.
Mr Aidan MacNamara
Department of Immunology , Wright-Fleming Institute , Imperial College , Norfolk Place , London W2 1PG ,
The determination of the basis of HLA class I protection in HTLV-I infection
The question of why the same virus can cause varying degrees of pathogenicity in different hosts can be explained, in part, by differences in the host immune response. The Human Leucocyte Antigen (HLA) class I immune genotype, along with other host genetic factors, has been found to be a significant determinant of the susceptibility to disease in HTLV-I infection. However, why specific alleles of the HLA class I genotype afford greater protection than others is still not fully understood, but it is thought to be related to differences in the cytotoxic T lymphocyte (CTL) immune response that they restrict. , Using software to predict what parts of the virus bind to the MHC class I molecules and a variety of statistical techniques, we have shown that protective alleles target a specific HTLV-I protein called HBZ and that targeting this protein is also associated with a reduced proviral load. More generally, we have shown that asymptomatic carriers have alleles that preferentially target proteins associated with a reduction in proviral load. These results highlight the importance of the relatively unknown HBZ protein and increase our understanding of the immune response to HTLV-I.
Mr Jamie I MacPherson
B.1075 Michael Smith Building, University of Manchester, Oxford Road, Manchester, M13 9PL
IDENTIFYING AND INVESTIGATING CLUSTERS IN THE HIV-HUMAN PROTEIN INTERACTION NETWORK
Background:, The recently published HIV-1, Human Protein Interaction Database has provided the HIV-research community with an extensive set of protein-protein interactions (PPIs). In this hand-curated dataset, a unique depth of protein interaction detail is recorded, including an interaction type – a short description of the interaction outcome, e.g. phosphorylates, inhibits, upregulated by. , Methods:, In this work we use biclustering, a method normally used in gene expression analysis, to identify clusters of human proteins in the HIV-human PPI network. We define a cluster to be a set of human proteins that are involved in the same set of multiple interactions with HIV-1, taking in to account the HIV protein interactant and the interaction type. We then analyze these clusters to explore their significance and function using a variety of methods. , Results:, Using biclustering analysis, we identified 551 significant human protein clusters. We find that many of these clusters consist of closely related proteins that respond in similar ways during the course of HIV-1 infection. , Conclusions:, We show that biclustering can be used for identifying clusters of related proteins in PPI networks. Using this method, we highlight ways through which HIV-1 perturbs the host cell.
Mr Inti Pedroso
SGDP Centre, PO 82, Room C0.10, Institute of Psychiatry, King's College London, De Crespigny Park #16, SE5 8AF, London, United Kingdom
Alternate Open Reading Frames in the human genome.
A DNA sequence contains six potential open reading frames (ORFs), three on one strand and three on the reverse strand. Typically only one of the six is actually expressed because it is associated with appropriate genetic signals that specify the DNA strand and the reading frame to be transcribed and translate. Exceptions occur in which more than one open reading frame is translated into protein, e.g. certain viral (Fiddes, 1979) and prokaryotic genes (Veloso, 2006; Pedroso, 2008). We call these cases Alternate ORFs. Despite documented examples in eukaryotic genomes their implications for eukaryotic genome annotation and proteome evolution have not been studied in detail. They are of evolutionary interest as a potential source of new protein domains (Veloso, 2006) and of biomedical interest as aberrant expression of some Alternate ORFs has been associated with development of autoimmune disorders in humans (McGuire, 2005). We carried out a bioinformatic study looking for Alternate ORFs in the human genome with supporting experimental evidence. Results from this scan are presented and the implications for genome annotation will be discussed in the context of particular case studies.
Dr Paola M.V. Rancoita
IDSIA , Galleria 2, 6928 Manno-Lugano, Switzerland,
An integrated Bayesian model for genotyping and copy number data
SNP-microarrays are able to measure simultaneously both genotype and copy number (CN) at several Single Nucleotide Polymorphism (SNP) positions. Combining the two data, it is possible to better identify genomic aberrations. For example, a long sequence of homozygous SNPs might be shown due to either a uniparental disomy event (UPD), i.e. each SNP has two identical alleles both derived from only one parent, or the physical loss of one allele. In this situation, the knowledge of the copy number value can help in distinguishing between these two events. , To identify genomic aberrations better, we propose a Bayesian piecewise constant regression which infers the type of aberration occurred, taking into account all the possible influence in the microarray detection of the genotype, resulting from an altered copy number level. Namely, we modeled the distributions of the detected genotype given a specific genomic alteration and we estimated the parameters involved on a public reference dataset. The prior distribution of the CN alterations is derived from the CN profile of the sample estimated with (Rancoita et al. 2009), while the probability of heterozygosity for each SNP is retrieved from the microarray annotation. Then, the estimation is performed similarly to mBPCR.
Ms Elisa Loza Reyes
University of Bath, Department of Mathematical Sciences, Claverton Down, BA2 7AY, Bath, UK
A Bayesian Phylogenetic Mixture Model for Detecting Heterogeneity in DNA Evolution
The evolution of DNA is intrinsically heterogeneous. Sites that encode crucial functional information are highly conserved while less functionally constrained sites experience substitutions at high rates. Commonly-used phylogenetic models that account for rate heterogeneity among sites [3,4] do not accommodate variability in branch lengths. In these models, the length of a branch represents the number of substitutions accumulated between two nodes. Whenever different sites accumulate a different number of substitutions, the limitations of using a model that characterises all sites with a single set of branch lengths are obvious. A point estimate of this set is a compromise among the signals coming from differently evolving sites. , We propose a Bayesian mixture model that accounts for both branch length and rate heterogeneity among sites. Our model postulates a heterogeneous population consisting of evolutionary classes j=1,...,k; each class characterised by a common-to-all-class tree topology, and a class-specific instantaneous rate matrix and set of branch lengths. Further, we adopt a 'missing data' formulation that enables us to infer the identity of the class from which a site is generated, providing interesting biological insights. , We have applied our model to two real data sets: the mitochondrial DNA of primates (mtDNA)  and genes from the Borrelia burgdorferi bacterium . Results show that the usual four-class model a priori assumed for mtDNA (three codon positions plus a fourth tRNA class) is suitably replaced by an inferred two-class mixture. The highly conserved codon-position-two and tRNA are allocated to one class while the hypervariable codon-position-three constitutes the second class. This provides a more adequate profile of the mtDNA evolutionary process. Analysis of the Borrelia data has revealed the evolutionary differences between conserved housekeeping and plasmid-located hypervariable genes. Commonly taken as evolving homogeneously, we demonstrate that this is not the case. We also discuss the relationship of our model with the recently published method by Husmeier and Mantzaris . , References,  Husmeier D. and A.V. Mantzaris, 2008. Statistical Applications in Genetics and Molecular Biology, 7(1), article 34.,  Margos, G. et al., 2008. Proceedings of the National Academy of Sciences of the USA, 105(25): 8730 - 8735.,  Pagel M. and A. Meade, 2004. Systematic Biology, 56(4): 571-581.,  Yang, Z., 1994. Journal of Molecular Evolution, 39: 306-314.,  Yang, Z., 1995. Genetics, 139: 993-1005.,
Dr Simon Rogers
Department of Computing Science, 18 Lilybank Gardens, University of Glasgow, Glasgow, G12 8QQ
Non-parametric clustering of coupled mRNA and protein profiles ,
Previously, we have presented a coupled mixture model for the analysis of mRNA and protein time-series for approximately 500 genes. Using this method, we were able to discover biologically interesting patterns potentially corresponding to different modes of regulation. However, the model required significant manual tuning and manual analysis of the results. Here, we present an improved model based on factorising the 2D contingency table of mRNA and protein cluster assignments into layers. The model consists of two coupled hierarchical Dirichlet Processes. In our experiments, many of these layers are significantly enriched with gene ontology terms. Furthermore, the non-parametric nature of the model provides an elegant way to infer the plausible number of layers and the number of mRNA and protein clusters.
Dr Aylwyn Scally
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
Combined genome sequence assembly of gorilla
At present our knowledge of the genome sequences of many species is highly, fragmentary, based on assemblies of low coverage capillary-sequenced data. We, present a method for improving the accuracy and completeness of such assemblies, using additional new-technology sequencing (NTS) data. NTS reads are much, shorter than capillary reads and cannot be assembled using the same techniques, so our method integrates several recently-developed tools (such as Velvet, Abyss, and Maq) to handle alignment and assembly. , In addition to de novo assembly, we are able to make use of an auxiliary, reference sequence, such as that of a closely related species. This can assist, in identifying reads for contig extension and guiding the construction of large, supercontigs in syntenous regions. , We apply our method to data for gorilla, for which we have capillary reads at, 2.1x coverage and Solexa NTS reads at approximately 30x. The initial assembly, of capillary data comprises 2.06 Gbp in contigs with an N50 of 2094 bp; using, de novo methods alone we are able to extend this to 2.53 Gbp with N50 = 2867., Using human as an auxiliary reference we can assemble bridging reads between, contigs, producing typically 20 % additional sequence and much longer contigs.
Ms Christiana Spyrou
Statistical Laboratory, Centre for Mathematical Sciences, Wilberforce Rd, Cambridge , CB3 0WB
BayesPeak: a peak finding algorithm for ChIP-Seq data.
High throughput sequencing technology has become very popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation followed by sequencing of the samples results in vast amounts of data that need to be analysed to map transcription factor binding sites and histone , modifications. , , Our proposed statistical algorithm, BayesPeak, models the data allowing for overdispersion in the abundance of reads in regions along the genome. A control sample can be incorporated in the analysis to account for post-alignment sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the hidden Markov model parameters, and posterior probabilities are used to detect enrichment.
Mr James Swingland
Room 248, Cyclotron Building, MRC Clinical Sciences Centre, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, United Kingdom
Low Chromosome X Gene Expression in Male Parkinson's disease Subjects
Introduction: Parkinson's disease (PD) is a movement disorder of unknown aetiology that is more frequent in the male population. This gender bias is often thought due to the neuroprotective role of oestrogens. New evidence however suggests that the sexual differentiation in brain is not entirely due to sex steroids but is genetically imprinted. Consequently we hypothesized that the transcription profile of chromosome X may be altered in the disease and that this disruption may differ in males from females. , Methods: We used microarray datasets of publicly available substantia nigra samples from PD subjects and controls from two independent groups. Raw data were normalized and transferred into Chromowave. Chromowave is a software package which uses a combination of wavelet denoising and SVD to extract spatially coherent patterns of expression. , Results: chromosome wide X expression in the nigra was found to be lower in male PD patients than controls in all data-sets. These data-sets contained too few female subjects to apply the same method. , Conclusion: The low X expression may be a susceptibility factor for PD or a result of the disease process. In either case, this finding links for the first time the global control of X expression with a disease.
Dr Frances Susan Turner
Centre for Bioinformatics , Imperial College, Exhibition Road, South Kensington, London SW7 2AZ,
A resource for the study of stress response in Campylobacter jejuni and Mycobacterium Tuberculosis.
Microarray studies are a popular means of examining a cell/organism’s response to a stress. Faced with a microarray data set, it is common to ask “which groups of functionally related genes show a significant change in expression?”. Gene Set Enrichment Analysis (GSEA) is a popular method used to identify sets of differentially expressed genes. However a researcher focused on a particular pathway may be more interested in the reverse question “Under which conditions do genes belonging to this pathway show significantly altered expression?”. The identification of datasets in which a pathway or other group of genes show differential expression, and therefore the stresses to which it may respond, to could identify useful expression data, further the understanding of the role of the pathway, and inform the design of further experiments studying the pathway. , Publicly available expression data for two important pathogens, Campylobacter jejuni and Mycobacterium tuberculosis, has been stored in a database with related groups of genes (based on functional annotation, protein interaction, predicted operons and transcription networks). A graphical interface allows users to easily identify expression datasets relevant to groups of genes (either supplied by the user or stored in the database), using an algorithm based on GSEA. ,
Mr Ernest Turro
Epidemiology & Public Health, St Mary's Campus, Imperial College, Norfolk Place, London, W2 1PG, United Kingdom
MMBGX: a method for estimating isoform- and gene-level expression from whole-transcript Affymetrix arrays that is unconfounded by multi-match probes
Probes on Affymetrix microarrays may cross-hybridise perfectly to multiple genomic regions, thus artificially inflating their intensities relative to the abundance of the intended mRNA targets. Current signal extraction methods ignore this effect and are therefore prone to bias, prompting researchers to discard measurements from multi-mapping probes prior to analysis. However, this workaround may reduce the precision of some estimates and even preclude the estimation of some probeset-level parameters completely. Probes may also cross-hybridise between alternative isoforms of the same gene, which is useful in detecting alternative splicing. However, current methods do not exploit available information on the structure of known isoforms and do not quantify the abundance of individual isoforms. , We characterise the different types of multi-mapping between probes and probesets targeting genes and known alternative isoforms respectively for Affymetrix Gene and Exon arrays. We present a fully hierarchical Bayesian model, MMBGX, that intuitively partitions the probe-level signal into latent components for each matching gene or Ensembl transcript, allowing it to estimate the expression of individual isoforms and make use of all the data while avoiding the upward bias on probes mapping to multiple locations. We demonstrate the performance of the model using simulated and real data, including RT-PCR validations.,
Ms Maria Vounou
Department of Mathematics, Imperial College London , South Kensington Campus, London , SW7 2AZ
, Genome-wide Association Studies in Imaging Genomics: A Multivariate Approach
Authors: Maria Vounou,(1) Giovanni Montana,(1) Thomas E. Nichols,(2) and Brandon Whitcher(2) , (1)Department of Mathematics, Imperial College London and (2)Clinical Imaging Centre, GlaxoSmithKline , Genome-wide association studies in an imaging genetics framework aim at identifying statistical associations between variations in the human genome in a sample of individuals with a neurological disease and variations in their brain expressed by imaging phenotypes. In such studies both the genotype and phenotype data are very high dimensional and present a complex correlation structure. A statistical challenge is to simultaneously localize a handful of SNPs and a handful of regions of interest (ROIs) in the brain that show a strong and significant dependence. In this work we employ a sparse Canonical Correlation Analysis (CCA) approach to attack this problem. CCA performs dimensionality reduction by extracting latent factors from each one of the two paired individual measurements that are maximally correlated. We are able to identify highly correlated SNPs and ROIs by adopting a sparse solution. This is achieved by first re-expressing CCA as an ordinary regression problem and then exploiting penalization techniques. We report on the statistical properties of the proposed multivariate approach, which has been assessed using extensive Monte Carlo simulation studies.
Dr Vladislav Vyshemirsky
Sir Alwyn Williams Building, University of Glasgow, Glasgow, Scotland, G12 9HN,
Using Gaussian Processes for Bayesian Hypotheses Testing in Raman Spectroscopy
Surface enhanced resonance Raman spectroscopy (SERRS) can be used to detect a wide range of biochemical species by employing a specific set of nanoparticle probes. New data obtained using this technology will significantly improve our abilities to understand biological systems by enabling high throughput measurements of protein concentrations. Analysis of spectra produced by SERRS is often done manually, and a solid statistical approach to interpreting such results is very important to draw valid conclusions. , We model data obtained using SERRS using Gaussian Processes. This modelling approach enables computing marginal likelihoods over different covariance functions of GPs, and therefore consistent hypotheses testing can be performed. , We investigate several important problems in analytical biochemistry:, * Whether the spectroscopic response of analytes changes in time, or the observed variations can be explained by measurement errors., * Is it possible to measure the differences in concentrations of an analyte given practical variability of the measurement., * What are the most informative frequency bands to measure the concentration of a given protein with high confidence. , We, additionally, develop a calibration procedure based on GP regression of the spectroscopic data using Markov Chain Monte Carlo to marginalise, over the hyper-parameters of the covariance function. ,
Dr Mark Nicholas Wass
Structural Bioinformatics Group, Centre for Bioinformatics, Biochemistry Building, Imperial College London, London, SW7 2AZ
Ligand binding site prediction using homologous structures and conservation
Knowledge of ligand binding sites is important for identifying protein function. In the post genomics era with the accumulation of millions of sequences, functional characterisation has become an important task for bioinformatics. We present an approach for the prediction of binding sites, which was among the best performing methods at CASP8 (Critical Assessment of Structure Prediction). , Given a target sequence, sequence based methods can be used to predict binding sites. However, while the availability of protein structures is limited, structural data can result in more successful predictions. Our method uses protein structure prediction to identify a structure for the target, which is used to investigate potential binding sites. Two different techniques are combined. Firstly, structures homologous to a query sequence are identified and those with bound ligands are superimposed onto the predicted structure of the query. Agreement of ligand binding sites in multiple homologous structures is used to identify the binding site on the target. Secondly, sequence based conservation methods are used and conserved residues are mapped onto the predicted target structure. Final predictions combine these two methods. At CASP8 our method obtained 82% coverage and 56% accuracy and was classed as the joint best method with the LEE group.,
Mr Jon White
John Bingham Laboratory, NIAB, Huntingdon Road, Cambridge, CB3 OLE, UK
Applying Association Mapping Techniques to an Inbreeding Cereal Species
Scientific plant breeding and the systematic definition and evaluation of varieties over the past 40 years has given geneticists a potentially valuable resource: seed banks contain samples of genetic material and historic databases contain corresponding phenotypic data. If genotype data can be obtained from historic samples then all the ingredients are in place for association genetics. , Using a set of ~600 barley accessions genotyped at 1100 polymorphic loci an association study was conducted on 33 botanical characters. Evidence of extreme population structure was found which could be attributed to discrete breeding pools within the barley crop. Structured association, principle component analysis and linear mixed modeling could all be used to control for this confounding factor, albeit with differing degrees of success. , Some evidence of marker association was found in 11 of the 33 traits. Strong association with two trait classes was suggestive of a major gene effect and co-linearity with the rice, brachypodium and sorghum genomes highlighted candidate genes for further study. , , [This work is part of the AGUWEB project supported by the LINK Sustainable Arable Programme through sponsorship by BBSRC, RERAD and HGCA. Studentship jointly supported by HGCA, CEL and the NIAB Trust.],
Mr Weldon Ward Whitener
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1HH, UK
A method for detecting human microsatellite length-polymorphism using Solexa/Illumina paired-end sequencing data
Microsatellites are common motifs in the human genome; however, little is known about the patterns of polymorphisms in microsatellites. To rectify this, we have developed a method to use Solexa/Illumina sequencing data to identify microsatellites that differ in length between an individual's genome and the human reference genome. , Solexa/Illumina sequencing technology produces paired-end reads separated by a predetermined distance (SD). Sequenced reads are mapped using MAQ (http://maq.sourceforge.net) to the human reference. For each microsatellite in the reference genome, mapping distances (MD) of spanning-paired-end-reads are stored. Homozygous reference alleles will have same MD as SD, and homozygous deletion alleles will have a larger MD than SD. Based on this, we have developed likelihood ratio tests to determine whether alleles of a particular microsatellite locus are statistically different. , We used this approach to analyse publicly available whole-genome Solexa/Illumina sequencing data for a single individual. Our algorithm successfully detected repeat loci that are known (using an independent approach based on Sanger sequencing) to be longer or shorter compared to the reference genome. Our algorithm successfully detected 47% of loci that are longer at one or both alleles compared to the reference genome, and 68% that are shorter.
Dr Simon Williams
Room B1074, Michael Smith Building, University of Manchester, Oxford Road, Manchester, M13 9PT,
Identifying determinants of specificity in yeast protein complexes
Protein complexes play a critical role within the cell, having key functional roles in almost all biological processes. The evolution and specificity of interaction interfaces is of great importance in understanding individual complexes as well as the interaction network as a whole. We aim to characterise the determinants of specificity in protein-protein interactions. Using complexes from S. cerevisiae with known structure we model orthologous proteins from closely related species. We then analyse sequence divergence at the interfaces, considering factors such as location and likelihood of substitutions and the influence of intermolecular coevolution in determining specificity. To assess the consequences of this evolutionary divergence on organism fitness in vitro we utilise natural and artificial hybrids of yeast species. Here the protein subunit from one parental species may bind to the subunit from another species forming a chimeric complex and the fitness measured. We find that a large proportion of substitutions observed in the interfaces of closely related species are essentially evolutionary neutral and likely to have little or no effect on binding specificity and fitness. This demonstrates that amino acids in protein interfaces have varying roles in specificity and in overall evolutionary constraint.