Biological Databases Matthias König [05|05|10]
1 61 121 181 241
ggacactaag gcgcccagca gaatgcttgc ccccaaacca tgcggcggag
ccccacagct atggccctgc cgactcgttg gcccgaggag aagccttgga
caacacaacc ctggagaaca gagaacaatg aaccacattc tatttccact
aggagagaa tccaggctc aaaaggagg tcccaggga tcagaagcc
Outline biological databases I. Introduction & Overview II. Examples III. Sequence III. Sequence alinment & !rament searc" I#.. $ata%ase tools and implementation I#
Outline biological databases I. Introduction & Overview II. Examples III. Sequence III. Sequence alinment & !rament searc" I#.. $ata%ase tools and implementation I#
I Introduction and overview
Why databases ?
%iolo "as turned into data-rich science
#ast amount o! data enerated in experiments *li+e ,S peptide !raments-
need !or storin and communicatin lare datasets "as rown tremendousl
'i"(t"rou"put enomics) proteomics) meta%olomics) ...
arc"ivin) curation) analsis and interpretation o! all o! t"ese datasets are a c"allene convenient met"ods !or proper storin) searc"in & retrievin necessar
Databases are the means to "andle t"is data overload
What can databases do ?
Make biological data available ...
1. to scientists. /. in computer-readable form.
'andle and s"are lare volumes o! data Inter!ace !or computer %ased sstems *lorit"ms) e% inter!aces-
Store data
nalsis *computer %ased-
$e!ined !ormats utomated storae and retrieval o! experimental data
2in+ +nowlede wit" external resources
Database classification I
Type of data
3ucleotide or protein sequences
4rotein sequence patterns and moti!s
,acromolecular $ structures
6ene expression data
,eta%olic pat"was
...
Data entry and quality control
Scientists deposit data directl
ppointed curators add and update
7pe and deree o! error c"ec+in
8onsistenc) redundanc) con!licts) updates
Database classification II
Primary or derived data
4rimar9 experimental results directl into data%ase
Secondar9 results o! analsis o! primar data%ases
Technical design
:lat(!iles
;elational data%ase *S<2-
O%=ect(oriented data%ase
Exc"ane>pu%lication tec"noloies *:74) '7,2) 8O?;) @,2) SO4-
Maintainer status
2are) pu%lic institution !unded % overnment *E,?2) 38?I cademic roup or scientist
How to find my database ?
3ucleic cid ;esearc" o!!ers data%ase issue ever ear $ata%ase Aournals
$ata%ase portals
$ata%ase9 7"e Aournal o! ?ioloical $ata%ases and 8uration
$?$ *data%ase o! %ioloical data%ase4at"uide
e%searc"
"ttp9>>lmt!.com>
How to access the data ?
'uman Web interface *we% %ased) small scale
8ommon mode o! searc" are +ewords wit" modi!iers or identi!iers 8ross(re!erences lin+ t"e in!ormation o! di!!erent data%ases
Web service *SO4) 8O;?-
lat files *script %ased) lare scale-
Database dump *script %ased) lare scale-
II Examples of biological databases
Nucleotide sequence databases
●
6en?an+
E,?2
$$?A
sequences su%mitted directl % scientists and enome sequencin roup) and sequences ta+en !rom literature and patents. entries in t"e E,?2) 6en?an+ and $$?A data%ases are synchroni!ed on a dail %asis
accession numbers are manaed in a consistent manner
comparativel little error c"ec+in and !air amount o!
Nucleotide sequence example
6luco+inase *"exo+inase B-) m;3 [6en?an+]
1 61 121 181 241 301 361 421 481
gagcaggaaa cagtcacctg ttcccactga agcgctgagg agtgaggaag aaactgtgac cccagggcgg tactggggaa acagagccag
tgccgagcgg cagcctaatt atctgtctga acgccaccca ggtccagaag tgaacctcaa gccgtgaccc ggctgagggg gatggaggcc
cgcctgagcc actcaaaagc aggacactaa agcgcccagc ggaatgcttg accccaaacc ctgcggcgga tcccagctcc gccaagaagg
ccagggaagc tgtccccagg gccccacagc aatggccctg ccgactcgtt agcccgagga gaagccttgg ccacgctggc agaaggtaga
aggctaggat tcacagaagg tcaacacaac cctggagaac ggagaacaat gaaccacatt atatttccac tgctgtgcag gcagatcctg
gtgagagaca gagaggacat caggagagaa atccaggctc gaaaaggagg ctcccaggga ttcagaagcc atgctggacg gcagagttcc
Potein sequence databases
Cni4rot D?
mission to provide a compre"ensive) "i"(qualit and !reel accessi%le resource o! protein sequence and !unctional in!ormation "W#""-P$%T is a protein sequence data%ase w"ic" strives to provide a "i" level o! annotations *suc" as t"e description o! t"e !unction o! a protein) its domains structure) post(translational modi!ications) variants) etc.-) a minimal level o! redundanc and "i" level o! interation wit" ot"er data%ases. Tr&M'( is a computer-annotated supplement o! SISS(4;O7 t"at contains all t"e translations o! E,?2 nucleotide sequence entries not et interated in SISS(4;O7.
4I; SISS(4;O7 and 4I; are di!!erent !rom t"e nucleotide data%ases in t"at t"e are %ot" curated
Potein sequence example
6luco+inase "omo sapiens [4555 *'@DBF'C,3-]
10 MLDDRARMEA 70 VRSTPEGSE 130 MLFDISE#I 190 VVGLLRDAIK 250 VELVEGDEGR 310 LVRLVLLRLV 370 TTD#DIVRRA 430 ERFHASVRRL
20 AKKEKVEQIL 80 VGDFLSLDLG 140 SDFLDKHQMK 200 RRGDFEMDVV 260 M#V!TE"GAF 320 DE!LLFHGEA 380 #ESVSTRAAH 440 TPS#EITFIE
30 AEFQLQEEDL 90 GT!FRVMLVK 150 HKKLPLGFTF 210 AMV!DTVATM 270 GDSGELDEFL 330 SEQLRTRGAF 390 M#SAGLAGVI 450 SEEGSGRGAA
40 KKVMRRMQKE 100 VGEGEEGQ"S 160 SFPVRHEDID 220 IS#EDHQ# 280 LEDRLVDES 340 ETRFVSQVES 400 !RMRESRSED 460 LVSAVA#KKA
Peptide elated infomation
,E;O4S ( 4eptidase $ata%ase
4eptide $ata%ase *8ancer- [example]
4eptide,ass
SHS:4EI7'I
cleaves a protein sequence !rom t"e Cni4rot Dnowlede%ase *Swiss(4rot and 7rE,?2- or a user( entered protein sequence wit" a c"osen enGme) and computes t"e masses o! t"e enerated peptides.
SH:4EI7'I is a data%ase comprisin more t"an 000 peptide sequences +nown to %ind class I and class II ,'8 molecules. 7"e entries are compiled !rom pu%lis"ed reports onl.
4eptidetlas
multi(oranism) pu%licl accessi%le compendium o! peptides identi!ied in a lare set o! tandem mass
Peptide databases
!n"ymes
?;E3$ [luco+inase /..1./]
8ompre"ensive enGme in!ormation sstem
DE66 EnGmes [luco+inase /..1./]
Ensem%le [68D E3S60000010]
7"e Ensem%l pro=ect produces enome data%ases !or verte%rates and ot"er eu+arotic species) and ma+es t"is in!ormation !reel availa%le online.
#tuctue databases
4$? [luco+inase 1SJ/]
4$?sum
#econday Databases
Sometimes +nown as pattern databases 8ontain results !rom t"e analysis of the sequences in t"e primar data%ases Examples
4;OSI7E
4!am
4;I37S
$otifs and seconday stuctue
4;OSI7E ['E@ODI3SES 4S00K]
$ata%ase o! protein domains) !amilies and !unctional sites 'exo+inases sinature9 4attern [2I#,](6(:([73](:(S([:H](4( x*5-([2I#,]([$3S7](x*-([2I#,](x*/-((7(D(x( [2:].
$otifs and seconday stuctue
4!am ['exo+inaseF/ 4:0/]
7"e 4!am data%ase is a lare collection o! protein !amilies) eac" represented % multiple sequence alinments and "idden ,ar+ov models *',,s-
%iteatue Databases
4u%,ed > ,E$2I3E
$ata%ase o! citations and a%stracts !or %iomedical literature
O,I, *Online ,endelian In"eritance in ,an- [6luco+inase]
8atalo o! "uman enes and enetic disorders wit" textual in!ormation and copious lin+s to scienti!ic literature
6oole Sc"olar
8ite@plore
com%ines literature searc" wit" text minin tools !or %iolo.
rxiv
Open access to 01)L10 e(prints in 4"sics) ,at"ematics) 8omputer Science)
&axonomy
Cni4rot taxonom ["omo sapiens]
Oranisms are classi!ied in a "ierarc"ical tree structure. next to manuall veri!ied oranism names) external lin+s) oranism strains and viral "ost in!ormation is provided.
38?I taxonom ["omo sapiens]
'hemical entities
8"E?I *8"emical Entities o! ?ioloical Interest E?I
De 8ompounds
!reel availa%le dictionar o! molecular entities !ocused on MsmallN c"emical compounds DE66 8O,4OC3$ is a c"emical structure data%ase !or meta%olic compounds and ot"er c"emical su%stances t"at are relevant to %ioloical sstems. 4eptide entries in DE66 8O,4OC3$ are desinated wit" 4eptide in t"e !irst Entr line
4u%8"em
α-D-glucose 6-phosphate CHEBI:17665 KEGG:C00668 PubChe:5!58
(eactions
De ;eactions [;00/LL]
;"ea [1K/K]
;"ea is a !reel availa%le) manuall annotated data%ase o! c"emical reactions
$etabolic netwo)s * pathways
De 4at"was [lcolsis > luconeoenesis "sa]
,eta8c *'uman8c-
;eactome ( a curated +nowlede%ase o! %ioloical pat"was
III Sequence Alignment – Fragment search with BLAS
#equence +lignment * B%+#&
?2S7) is an alorit"m !or comparin primar %ioloical sequence in!ormation *amino(acid or nucleotide sequences
?2S7 is one o! t"e most widel used %ioin!ormatics prorams
Ena%les to compare a quer sequence wit" a li%rar or data%ase o! sequences) and identi! li%rar sequences t"at resem%le t"e quer sequence a%ove a certain t"res"old.
it addresses a !undamental pro%lem t"e alorit"m emp"asiGes speed over sensitivit *practical on t"e "ue enome data%ases currentl availa%le
#ariants
3ucleotide(nucleotide ?2S7 *%lastn-
4rotein(protein ?2S7 *%lastp-
3ucleotide (!rame translation(protein *%lastx-
B%+#&
7o run) ?2S7 requires a query sequence to searc" !or) and a sequence to searc" aainst *also called t"e taret sequence- or a sequence data%ase containin multiple suc" sequences. Input9 sequences in :S7 or 6en%an+ !ormat. Output9 rap"ical !ormat s"owin t"e hits !ound) a ta%le s"owin sequence identi!iers !or t"e "its wit" scoring data) as well as alignments !or t"e sequence o! interest and t"e "its received wit" correspondin ?2S7 scores !or t"ese. 38?I ( "ttp9>>%last.nc%i.nlm.ni".ov>?last.ci
B%+#& (esults
GE!E ID$ 3101 HK3 % &'()*+,a-' 3 ./&+t' c' H)) -a+',- .' 10 P:M'; +,*- S)t a+g,',t<) t&+- -:='ct -'>',c' :?$ E a' Sc)' P'c',t +;',t+t? Q'? -tat )-+t+), S:='ct -tat )-+t+), Sc)' @ 626 :+t- .140B E('ct @ 5'C09 I;',t+t+'- @ 2127 .77B P)-+t+'- @ 2327 .85B Ga- @ 027 .0 Q'? 4 LPLGFTFSFPVRHEDIDKGILL!"TKG 30 LPLGFTFSFP R DGILL!"TKG S:=ct 602 LPLGFTFSFP#RQLGLDQGILL!"TKG 628
I! database design and implementation
Database &ools
$ata%ase desin *,odel %uildin
Superimpose a loical structure upon t"e data on t"e %asis o! t"ese relations"ips.
Sc"eme development *paper & pencilSc"eme implementation and re!inement *data%ase desiner li+e ,icroO24 $? $esiner-
;elational data%ase *Storae
$etermine t"e relations"ips %etween t"e di!!erent data elements.
,S<2) 4ostreS<2) S<2ite
Inter!aces *ccess
S<2 queries dministration tools *p"p,S<2) p"p4dmin:ramewor+s & e%inter!aces *$=ano *4t"on-) 'pernate *Aava--
&han)s
8omputational Sstems ?ioc"emistr roup 4ro!. 'olG"Ptter & ,ic"ael eidlic"
4resentation availa%le at "ttp9>>www.c"arite.de>ss%io>people>+oeni>
#ouces ?ioloical data%ases
3ucleic cid ;esearc" /001 4er Draulis Q $ata%ases in %ioin!ormatics ( Stoc+"olm ?ioin!ormatics 8enter) S?8) 2ecture notes) "ttp9>>www.avatar.se>mol%ioin!o/001>data%ases."tml 2im Hun 4in Q ?ioloical data%ases ( 3ational Cniversit o! Sinapore ( www.s(star.or>downloads>tutorial>t1%.pd!
Dlipp & 2ie%ermeister Q Sstems ?iolo *$ata%ases-
i+ipedia "ttp9>>en.wi+ipedia.or>wi+i>?ioloicalFdata%ase
Sequence linment & ?2S7
i+ipedia ( "ttp9>>en.wi+ipedia.or>wi+i>?2S7 /001 4er Draulis Q Sequence alinments ( Stoc+"olm ?ioin!ormatics 8enter) S?8) 2ecture notes "ttp9>>www.avatar.se>mol%ioin!o/001>multali."tml