Bioinformatics overview

Bioinformatics(Overview) Brijesh Singh Yadav ([email protected]) Bioinformatics is the application of information technology to the field of molecular biology. Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Over the past few decades rapid developments in genomic and other molecular research technologies combined developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. It is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Common activities in Bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures. The primary goal of bioinformatics is to increase our understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques (e.g., data mining, and machine learning algorithms) to achieve this goal. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution.

Bioinformatics was applied in the creation and maintenance of a database to store biological information at the beginning of the "genomic revolution", such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data. In order to study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences,

protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include: a)the development and implementation of tools that enable efficient access to, and use and management of, various types of information b)the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences.

Major work areas: Sequence analysis Genome annotation Computational evolutionary biology Measuring biodiversity Analysis of gene expression Analysis of regulation Analysis of protein expression Analysis of mutations in cancer Prediction of protein structure Comparative genomics Modeling biological systems High-throughput image analysis Protein-protein docking

Software and tools: Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions. The computational biology tool best-known among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences. The NCBI provides a popular web-based implementation that searches their databases. BLAST is one of a number of generally available programs for doing sequence alignment.

Web services in bioinformatics: SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages lay in the end user not having to deal with software and database maintenance overheads Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological Sequence Analysis). The availability of these service-oriented bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.

BIOLOGICAL DATABASES: A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met: 1.Easy access to the information; and 2.A method for extracting only that information needed to answer a specific biological question.

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both "public" repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible via open standards like the Web is very important since consumers of bioinformatics data use a range of computer platforms: from the more powerful and forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating the labs of computerwary biologists. RNA and DNA are the proteins that store the hereditary information about an organism. These macromolecules have a fixed structure, which can be analyzed by biologists with the help of bioinformatics tools and databases.

Protein Structure Database: 1. 2. 3.

Protein Data Bank (PDB) Structural Classification of Proteins Class, Architecture, Topology and Homologous super family

Protein Sequence Database •

Primary Database

1.

Swiss-Prot

2.

Protein Information Resource (PIR)

Secondary Database 1. 2.

Prosite Protein Family (PFAM)

Nucleotide Sequence Database 1. 2. 3.

GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL)

Composite Databases 1. 2. 3.

National Center for Biotechnology Information (NCBI) Sequence Retrieval System (SRS) Catalogue of Databases (DBCAT)

Other Databases 1. 2. 3. 4. 5.

Receptor-Ligand Database (ReliBase) Restriction Enzyme Database (REBASE) G-Protein Coupled Receptor Database (GPCRDB) Nuclear Receptor Database (NucleaRDB) Literature Database - PubMed

Genbank: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure that is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.There are approximately 191,400,000 bases and 183,000 sequences as of June 1994. EMBL: The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182,615 sequence entries. SwissProt: This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

PROSITE: The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch at the University of Geneva. EC-ENZYME: The 'ENZYME' data bank contains the following data for each type of characterized enzyme for which an EC number has been provided: EC number, recommended name, Alternative names, Catalytic activity, Cofactors, Pointers to the SWISS-PROT entree(s) that correspond to the enzyme, Pointers to disease(s) associated with a deficiency of the enzyme. RCSB-PDB : The RCSB PDB contains 3-D biological macromolecular structure data from X-ray crystallography, NMR, and Cryo-EM. It is operated by Rutgers, The State University of New Jersey and the San Diego Supercomputer Center at the University of California, San Diego. GDB: The GDB Human Genome Data Base supports biomedical research, clinical medicine, and professional and scientific education by providing for the storage and dissemination of data about genes and other DNA markers, map location, genetic disease and locus information, and bibliographic information. OMIM: The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc Kusick with the assistance of Claire A. Francomano and Stylianos E. Antonarakis at John Hopkins University. PIR-PSD: PIR (Protein Information Resource) produces and distributes the PIR-International Protein Sequence Database (PSD). It is the most comprehensive and expertly annotated protein sequence database. The PIR serves the scientific community through on-line access, distributing magnetic tapes, and performing off-line sequence identification services for researchers. Release 40.00: March 31, 1994 67,423 entries 19,747,297 residues. Protein sequence databases are classified as primary, secondary and composite depending upon the content stored in them. PIR and SwissProt are primary databases that contain protein sequences as 'raw' data. Secondary databases (like Prosite) contain the information derived from protein sequences. Primary databases are combined and filtered to form non-redundant composite database

Genethon Genome Databases: PHYSICAL MAP: computation of the human genetic map using DNA fragments in the form of YAC contigs. GENETIC MAP: production of micro-satellite probes and the localization of chromosomes, to create a genetic map to aid in the study of hereditary diseases. GENEXPRESS (cDNA): catalogue the transcripts required for protein synthesis obtained from specific tissues, for example neuromuscular tissues.

MGD: The Mouse Genome Databases: MGD is a comprehensive database of genetic information on the laboratory mouse. This initial release contains the following kinds of information: Loci (over 15,000 current and withdrawn symbols), Homologies (1300 mouse loci, 3500 loci from 40 mammalian species), Probes and

Clones (about 10,000), PCR primers (currently 500 primer pairs), Bibliography (over 18,000 references), Experimental data (from 2400 published articles). ACeDB (A Caenorhabditis elegans Database): Containing data from the Caenorhabditis Genetics Center (funded by the NIH National Center for Research Resources), the C. elegans genome project (funded by the MRC and NIH), and the worm community. ACeDB is also the name of the generic genome database software in use by an increasing number of genome projects. The software, as well as the C. elegans data, can be obtained via ftp. ACeDB databases are available for the following species: C. elegans, Human Chromosome 21, Human Chromosome X, Drosophila melanogaster, mycobacteria, Arabidopsis, soybeans, rice, maize, grains, forest trees, Solanaceae, Aspergillus nidulans, Bos taurus, Gossypium hirsutum, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Sorghum bicolor. MEDLINE: MEDLINE is NLM's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, and the preclinical sciences. Journal articles are indexed for MEDLINE, and their citations are searchable, using NLM's controlled vocabulary, MeSH (Medical Subject Headings). MEDLINE contains all citations published in Index Medicus, and corresponds in part to the International Nursing Index and the Index to Dental Literature. Citations include the English abstract when published with the article (approximately 70% of the current file). For researchers to benefit from all this information, however, two additional things were required:

1)

Ready access to the collected pool of sequence information 2) A way to extract from this pool only those sequences of interest to a given researcher

and

Simply collecting, by hand, all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organization and analysis of this data still remained. It could take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins. Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to analyze sequence data rapidly. The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models. The physical linking of a vast array of computers in the 1970's provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplified and sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.

Sequence Analysis In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide oramino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns.

A sequence alignment, produced by ClustalW between two human zinc finger proteins identified by GenBank accession number.

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In protein sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other

than to amino acids, the conservation of base pairing can indicate a similar functional or structural role. Sequence alignment can be used for non-biological sequences, such as those present in natural language or in financial data. Computational approaches to sequence alignment generally fall into two categories:

1. global alignments 2. local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search. Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a number of input and output formats, such as FASTA format andGenBank format; however, the use of specific tools authored by individual research laboratories can be complicated by limited file format compatibility. A general conversion program is available at DNA Baser or Readseq (for Readseq you must upload your files on a foreign server and provide your email address). Global and local alignments

Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments that can occur if sequences are insufficiently similar Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) A general global alignment technique is called the NeedlemanWunsch algorithm and is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment

method also based on dynamic programming. With sufficiently similar sequences, there is no difference between local and global alignments.

Pairwise alignment Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high homology to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match', or the longest subsequence that occurs in both query sequence. Longer MUM sequences typically reflect closer relatedness.

Multiple sequence alignment: Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanisticinformation to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems. Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

Sequence Similarity Tool: BLAST: BLAST, the Basic Local Alignment Search Tool, is a statistical pattern matching algorithm. It was developed and published by Altshul et al. in 1990 and then as an enhanced version in 1997. It is one of the foundational algorithms for the study of comparative genomics. BLAST's impact on our understanding of biology is demonstrated by its ubiquity. BLAST is web-base and fast. It is used world-wide to compare DNA and protein sequences for similarity in structure and function and to infer evolutionary relationships between sequences. As an example of the volume of BLAST analyses conducted worldwide, in March 2003, the US National Center for Biotechnology Information (NCBI) was receiving 100,000 unique BLAST runs from 70,000 unique IP addresses daily, with usage increasing continually. ( Personal communication W. Matten, 2003.) BLAST operates by cutting up sequences into smaller "words" and searching for each of the words in "target" sequences. It looks in both directions along target sequences to find longer pattern matches. BLAST scores matches according to experimental knowledge of homology. This accounts for some of the imperfect matches it generates. BLAST also matches and aligns sequences locally. It does not create global sequence alignments.

FASTA: FASTA (pronounced FAST-AYE) stands for FAST-ALL, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed. This is achieved by performing optimised searches for local alignments using a substitution matrix, in this case a DNA identity matrix. The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead it initially looks for segment's containing several nearby hits. This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database. The majority of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K values. These are statistical parameters estimated from the distribution of unrelated sequence similarity scores. This approach avoids the artificiality of a random sequence model by employing real sequences, with their natural correlations. FASTA uses four steps to calculate three scores that characterise sequence similarity. These steps are outlined below. A representation of these steps is reported in a postscript format figure drawn from Barton (1994) Protein Sequence Alignment and Database Scanning. Step 1: Identify regions shared by the two sequences with the highest density of identities (ktup=1) or pairs of identities (ktup=2). The first step uses a rapid technique for finding identities shared between two sequences; the method is similar to an earlier technique described by Wilbur and Lipman. FASTA achieves much of its speed and selectivity in this step by using a lookup table to locate all identities or groups of identities between two DNA or amino acid sequences during the first step of the comparison. The ktup parameter determines how many consecutive identities are required in a match. A ktup value of 2 is frequrntly used for protein sequence comparison, which means that the program examines only those portions of the two sequences being compared that have at least two adjacent identical residues in both sequences. More sensitive searches can be done using ktup

= 1. For DNA sequence comparisons, the ktup parameter can range from 1 to 6; values between 4 and 6 are recommanded. When the query sequence is a short oliginucleotide of oligopeptude, ktup = 1 should be used. In conjunction with the lookup table, we use the "diagonal" method to find all regions of similarity between the two sequences, counting ktup matches and penalizing for intervening mismatches. This method identified regions of a diagonal that have the highest densitu of ktup matches. The term diagonal refers to the diagonal line that is seen on a dot matrix plot when a sequence is compared with itself, and it denotes an alignment between two sequenves without gaps. FASTA uses a formula for scoring ktup matches that incorporates the actual PAM250 values for the aligned residues. Thus, groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. This more sensitive formula is used for protein sequence comparisons; the constant value for ktup matches is used for DNA sequence comparisons. FASTA saves the 10 best local regions, regardless of whether they are on the same of different diagonals. Step 2: Rescan the 10 regions with the highest density of identities using the PAM250 matrix. Trim the ends of the region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps. After the 10 best local regions are found in the first step, they are rescored using a scoring matrix that allows runs of identities shorter than ktup residues and conservative replacements to contribute to the similarity score. For protein sequences, this score is usually caculated using the PAM250 matrix, although scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative measure of similarity, can also be used with FASTA. The PAM250 scoring matrix was derived from the analysis of the amino acid replacements occuring among related proteins, and it specifies a range of positive scores for replacements that commonly occur among related proteins and negative scores for unlikely replacements. FASTA can also be used for DNA sequence comparisons, and matrices can be constructed that allow separate penalties for transitions and transversions. For each of the best diagonal regions rescanned with the scoring matrix, a subregion with the maximal score is identified. Initial scores are used to rank the library sequences. These scores are referred to as init1 score. Step 3: If there are several initial regions with scores greater than the CUTOFF value, check to see whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined initial regions minus a penalty (usually 20) for each gap. This initial similarity score (initn) is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (init1). FASTA checks, during a library search, to see whether several initial regions can be joined together in a single alignment to increase the initial score. FASTA calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidily calculated using a dynamic programming algorithm. FASTA uses the resulting score, referred to as the initn score, to rank the library sequences. The third "joining" step in the computation of the initial score increases the sensitivity of the search method because it allows for insertions and deletions as well as conservative replacements. The modification does, however, decrease selectivity. The degradation selectivity is limited by including in the optimization step only those initial regions whose scores are above an empirically determined threshold : FASTA joins an initial region only if its similarity score is

greater than the cutoff value, a value that is approximately one standard deviation above the average score expected from unrelated sequences in the library. For a 200-residue query sequence and ktup-2, this value is 28. Step 4 : constructs NWS (Needleman-Wunch-Sellers algorithm) optimal alignment of the query sequence and the library sequence, considering only those residues that lie in a band 32 residues wide centered on the best initial region found in Step 2. FASTA reports this score as the optimized (opt) score. After a complete search of the library, FASTA plots the initial scores of each library sequence in a histogram, calculates the mean similarity score for the query sequence against each sequence in the library, and determines the standard deviation of the distribution of initial scores. The initial scores are used to rank the library sequences, and, in the fourth and final step of the comparison, the highest scoring library sequences are aligned using a modification of the standard NWS optimization method. The optimization employs the same scoring matrix used in determining the initial regions; the resulting optimized alignments are calculated for further analysis of potential relationships, and the optimized similarity score is reported.

CLUSTAL W: ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.

Phylogenetic analysis: A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor. In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the edge lengths in some trees correspond to time estimates. Each node is called a taxonomic unit. Internal nodes are generally called hypothetical taxonomic units (HTUs) as they cannot be directly observed.

A speculatively rooted tree for rRNA genes Tool for Phylogenetic analysis: • • • • • • • • •

BIONJ - Server for NJ phylogenetic analysis DendroUPGMA PHYLIP - Server for phylogenetic analysis using the PHYLIP package PhyML - Server for ML phylogenetic analysis Phylogeny.fr - Robust Phylogenetic Analysis For The Non-Specialist The PhylOgenetic Web Repeater (POWER) - perform phylogenetic analysis BlastO - Blast on orthologous groups Evolutionary Trace Server (TraceSuite II) - Maps evolutionary traces to structures Phylogenetic programs - List of phylogenetic packages and free servers (PHYLIP pages)

Homology modeling In protein structure prediction, homology modeling, also known as comparative modeling, is a class of methods for constructing an atomic-resolution model of a protein from its amino acid sequence (the "query sequence" or "target"). Almost all homology modeling techniques rely on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. The sequence alignment and template structure are then used to produce a structural model of the target. Because protein structures are more conserved than DNA sequences, detectable levels of sequence similarity usually imply significant structural similarity. The quality of the homology model is dependent on the quality of the sequence alignment and template structure. The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually X-ray crystallography) used to solve the structure. Model quality declines with decreasing sequence identity; a typical model has ~2 Å agreement between the matched Cα atoms at 70% sequence identity but only 4-5 Å agreement at 25% sequence identity. Regions of the model that were constructed without a template, usually by loop modeling, are generally much less accurate than the rest of the model, particularly if the loop is long. Errors in side chain

packing and position also increase with decreasing identity, and variations in these packing configurations have been suggested as a major reason for poor model quality at low identity. Taken together, these various atomic-position errors are significant and impede the use of homology models for purposes that require atomic-resolution data, such as drug design and protein-protein interaction predictions; even the quaternary structure of a protein may be difficult to predict from homology models of its subunit(s). Nevertheless, homology models can be useful in reaching qualitative conclusions about the biochemistry of the query sequence, especially in formulating hypotheses about why certain residues are conserved, which may in turn lead to experiments to test those hypotheses. For example, the spatial arrangement of conserved residues may suggest whether a particular residue is conserved to stabilize the folding, to participate in binding some small molecule, or to foster association with another protein or nucleic acid. Homology modeling can produce high-quality structural models when the target and template are closely related, which has inspired the formation of a structural genomics consortium dedicated to the production of representative experimental structures for all classes of protein folds. The chief inaccuracies in homology modeling, which worsen with lower sequence identity, derive from errors in the initial sequence alignment and from improper template selection. Like other methods of structure prediction, current practice in homology modeling is assessed in a biannual largescale experiment known as the Critical Assessment of Techniques for Protein Structure Prediction, or CASP.

BIOINFORMATICS CENTRE OF EXILLENCE: Our Vision

1. To foster high quality, innovative, and multi-disciplinary research and Advanced training in Bioinformatics, and Programming Languages.

2. Provide a forum for emerging Bioinformatics researchers and a platform for the development of commercial Bioinformatics capability. Mission 1. To provide a national level facility and expertise in Bioinformatics and Computational Biology. 2. To put individual in global race by enhancing skills in ERP and Programming Languages. 3. To provide integrated Bioinformatics solutions such as biological and chemical databases, data analysis, data mining, bio-medical text mining and customized tool development among others. 4. A strong base for pharmaceutical research and development and IT services are driving the offshoring of Bioinformatics services. 5. Started work in drug development, conduct research, clinical trials and contract manufacturing segments for the other industry. Our Current Work: 1. We are owing world largest Infectious Disease Databases 2. We are working on Neuroinformatics Project

Future scenario The study feels that the focus of Bioinformatics in the drug discovery process has shifted from target identification to target validation. The market for laboratory information management system products in India is estimated at $23 million in 2008. This is expected to grow to above 30 per cent over 2008-11. It is also expected that companies with presence across various geographies require bioinformatics services on a global scale and often seek a single vendor who can offer a comprehensive range of services on a long-term basis, across the world. The industry is expected to scale up its business through organic and inorganic growth. As far as funding for expansion is concerned, rising demand for bioinformatics has improved venture capitalist interest and presence in this sector. The bioinformatics outsourcing opportunity will rise from $32 million in 2007 to $62 million in 2010. These opportunities range from laboratory information management systems’ enterprise solutions, improved database utility and data management tools to exportable software that can be shared. However, to capture this opportunity, vendors need to establish creditability through success of past projects, and by demonstrating a strong technical capability, establishing strong overseas relationships and training end-users.

Bioinformatics overview

Recommend Documents