Ori-Finder2.0

Introduction to Ori-Finder 2


[Down] 1. What is the replication origin?
[Down] 2. The major features of replication origin in Archaea
[Down] 3. Ori-Finder 2, identification replication origins in archaea
[Down] 4. What are the origin recognition boxes (ORB)?
[Down] 5. What are the four disparity curves?
[Down] 6. MEME, a tool for motif discovery and searching.
[Down] 7. FIMO, scanning for occurrences of a given motif.
[Down] 8. REPuter, fast computation of maximal repeats in complete genomes.
[Down] 9. Performance evaluation.
1. What is the replication origin? [Down]
  • The replication origin (also called the origin of replication) is a specific sequence in the genome where DNA replication is initiated. This can either involve the replication of DNA in living organisms such as prokaryotes and eukaryotes, or that of DNA or RNA in viruses, such as double-stranded RNA viruses.
  • DNA replication may proceed from this point bidirectionally or unidirectionally. The specific structure of the origin of replication varies somewhat from species to species, but all share some common characteristics such as high AT content. The origin of replication binds the pre-replication complex, a protein complex that recognizes, unwinds, and begins to copy DNA. [From Wikipedia]
2. The major features of replication origin in Archaea [Down]
    Archaea is classified as a separate domain in the three-domain system and shares some similar features with both bacteria and eukaryotes. Similar to the mechanism in bacteria, the oriCs are the intergenic regions adjacent to the replication proteins, such as Cdc6 protein, origin recognition complex protein, whip and DNA primase, and contain the origin recognition boxes (ORB) which are AT-rich unwinding element and several conserved repeats. In some organisms, G-stretches are found at the end of ORBs. As well some archaea adopt multiple oriCs to initiate DNA replication like the mechanism in eukaryote.
3. Ori-Finder 2, identification replication origins in archaea [Down]
    The online tool Ori-Finder 2 utilizes an integrated method to automatically predict the replication origins in archaeal genome including disparity analysis using the Z-curve method, the distribution of ORB with FIMO tool, and the occurrence of genes frequently close to replication origin. The web server also could analyze the un-annotated complete genome with two embedded gene-finding programs, Zcurve and Glimmer, for gene identification.

    Users could submit an annotated or unannotated genome sequence to the web server. For annotated genome, we recommend that users submit sequence file in GenBank format or upload sequence file in FASTA format as well as its corresponding protein table (PTT) file. The web server could also analyze the unannotated genomes by integrating two gene-finding programs, ZCURVE and Glimmer, for gene identification and BLAST program for functional annotations of genes. Then all the intergenic sequences are scanned by FIMO to obtain the ORB sequences and by REPuter program to identify repeats. Finally, all the intergenic sequences adjacent to the replication genes with ORB sequences are predicted as oriCs. Since the approach relies on prior knowledge of oriCs in archaea, it will fail to identify oriCs adjacent to genes of unknown function, but which might be involved in DNA replication. In order to overcome the drawback, the intergenic sequences, which contain more than three conserved motifs, will be also predicted as oriCs. BLAST searches are also performed against DoriC, a database of bacterial and archaeal replication origins, to search the homologs. The conserved motifs of ORB sequences used in FIMO were obtained from the DoriC database. All the records in DoriC database were organized into several clusters by the taxonomy. And the conserved ORB motifs were calculated from the corresponding cluster by MEME program. Each run of Ori-Finder 2 is assigned a job ID and takes several minutes to predict oriCs in the whole process. Users could retrieve their job with the job ID or be notified by email if an email address on the submission page is specified.

    In fact, during the prediction of Ori-Finder 2, when users select the motif of Halobacteriaceae, Methanobacteriaceae, Methanomicrobia, and Methanococcaceae, the intergenic sequences, which are not adjacent to replication–related genes and within more than two motifs, will be predicted as oriCs. For the motif of Sulfolobacea and Thermococcaceae, the intergenic sequences, which are not adjacent to replication–related genes and within all three species of motifs, will be predicted as oriCs. This method can strike a balance between accuracy and sensitivity.

    In the output webpage, the information of the genome including size, GC level, location of replication genes and location of oriCs is displayed as an HTML table. In addition, detailed information of repeats in oriCs identified by REPuter program, motifs recognized by FIMO and homologous oriCs in DoriC database are also presented. The ORB sequences are also displayed in the corresponding oriC table including matched sequence, location, strand, score and P-value. Finally, a PNG figure embedded in the output webpage are generated to display RY, MK, GC or AT disparity curves, replication proteins, and oriCs.

    Finally, we recommend that users select the 'common' motif to search the oriCs at the first prediction. Because the common motif is the most conserved and frequent sequence in all the archaeal oriCs. Then users could select the special motif based the taxonomy of archaea to perform Ori-Finder 2. Through analysis of the two results, users could find the appropriate oriCs.
4. What are the origin recognition boxes (ORB)? [Down]
    Origin recognition box (conserved DNA sequence) is found to be integrated in multiple copies at oriCs. The sequences often play as an enhancer for oriC, and a G-rich inverted repeat always locates in at the end of ORB.

    The ORB motifs used in Ori-Finder 2 are obtained from DoriC database. All the records in DoriC database are classified to several clusters by organism taxonomy, such as, Methanobacteriaceae, Methanomicrobia, Methanococcaceae, Sulfolobaceae and Thermococcaceae. The conserved motifs are calculated from corresponding cluster by MEME program. In addition, the most common motif is calculated from all the records in DoriC database.

    Here, FIMO uses the position specific probability matrix (PSPM) to search these motifs in oriCs. And the motifs presented in the webpage are generated by WebLogo. Users could download PSPM file here.
5. What are the four disparity curves? [Down]
  • The Z curve is a unique three-dimensional curve representation for a given DNA sequence in the sense that each can be uniquely reconstructed given the other [Zhang and Zhang 1994].

  • The Z curve is composed of a series of nodes P0, P1, P2, ..., PN, whose coordinates xn, yn and zn (n=0,1,2,...,N, where N is the length of the DNA sequence being studied) are uniquely determined by the so-called Z-Transform of DNA sequence,

    where An , Cn, Gn and Tn are the cumulative occurrence numbers of A, C, G and T, respectively, in the subsequence from the 1st base to the n-th base in the sequence. We define A0=C0=G0=T0=0.
    The Z curve is defined as the connection of the nodes P0, P1, P2, ..., PN one by one sequentially with straight lines. Note that we define x0=y0=z0=0 such that the Z curve always starts from the origin of the three-dimensional coordinate system.

  • The three components all have biological meanings.
    The x-component of a Z curve displays the distribution of purine/pyrimidine (R/Y) bases along the sequence;
    The y-component of a Z curve displays the distribution of amino/keto (M/K) bases along the sequence;
    The z-component of a Z curve displays the distribution of strong-H bond/weak-H bond (S/W) bases along the sequence.

  • The xn and yn components are termed RY disparity and MK disparity curves, respectively. Similarly, the AT and GC disparity curves are defined by (xn + yn)/2 and (xn - yn)/2, which shows the excess of A over T and G over C, respectively, along the genome. The RY and MK disparity curves, as well as AT and GC disparity curves can be used to predict oriCs of various genomes.

  • Your could learn more about information and applications via visiting Z curve database.
6. MEME, a tool for motif discovery and searching. [Down]
    MEME is a tool used to discover motifs in a group of related DNA or protein sequences [Bailey, Boden et al. 2009]. In Ori-Finder 2, the ORB motifs are identified by MEME program. We classified the records of DoriC database to several clusters by taxonomy, and used MEME program to search the conserved motifs.
    Command: meme <File> -dna -mod anr -maxw 20 -minw 10 -revcomp -maxsize 200000 -oc all -nmotifs 3
    You could visit http://meme.nbcr.net/meme/doc/meme.html to get more information.
7. FIMO, a tool used to scan for occurrences of a given motif. [Down]
8. REPuter, a tool for fast computation of maximal repeats in complete genomes. [Down]
9. Performance evaluation. [Down]

    In order to evaluate performance of Ori-Finder 2, we predicted the oriCs of 13 annotated chromosomes whose all or part of oriCs have been confirmed by experiment. The following table presents the result.

    The number of positive cases (i.e., the oriCs confirmed by experimental methods) is 19, and the number of negative cases (i.e., the intergenic sequences excluding these oriCs confirmed by experimental methods) is 19,219 for annotated genome sequences. Consequently, the sensitivity, specificity, and precision of Ori-Finder 2 is 63.1%, 99.9%, and 41.3%, respectively.

    One reason of the low precision is that we only used the oriCs confirmed by experiments as positive cases. In fact, some archaea have multiple oriCs, but not all the oriCs have experimental evidence currently. It is possible that the oriCs with ORB sequences have the potential of initiating DNA replication. If the oriCs that have been identified in silico by other groups are also considered as positive cases, the precision will increase to be 62.1%.

    Another reason is that some pair-oriCs we predicted are flanking at one replication gene and one of them has been confirmed by experiment. In fact, the other oriC also has the potential of initiating DNA replication.

Table I. the formulas of Sensitivity, Specificity and Precision from Wikipedia.
Condition
(as determined by "Gold standard")
Condition positive Condition negative
Test
outcome
Test
outcome
positive
True positive False positive
(Type I error)
Precision =
Σ True positive
Σ Test outcome positive
Test
outcome
negative
False negative
(Type II error)
True negative Negative predictive value =
Σ True negative
Σ Test outcome negative
Sensitivity =
Σ True positive
Σ Condition positive
Specificity =
Σ True negative
Σ Condition negative
Accuracy
Table II. Prediction reuslt of 13 annotated archaeal genomes.
oriCs confirmed in vivo DoriC result Annotated genome prediction result
Organism Link Condition positive Condition negative Link Test outcome positive True positive Test outcome negative True negative
Aeropyrum pernix K1 NC_000854 2 1378 NC_000854 2 1 1378 1377
Pyrococcus abyssi GE5 NC_000868 1 1145 NC_000868 1 1 1145 1145
Methanothermobacter thermautotrophicus str. Delta H chromosome NC_000916 1 1507 NC_000916 2 1 1506 1506
Archaeoglobus fulgidus DSM 4304 NC_000917 1 1501 NC_000917 1 0 1501 1500
Pyrococcus horikoshii OT3 NC_000961 1 1170 NC_000961 1 1 1170 1170
Halobacterium sp. NRC-1 NC_002607 1 1652 NC_002607 4 1 1649 1649
Pyrococcus furiosus DSM 3638 NC_003413 1 1367 NC_003413 1 1 1367 1367
Hyperthermus butylicus DSM 5456 NC_008818 2 1328 NC_008818 1 1 1329 1328
Pyrobaculum calidifontis JCM 11548 NC_009073 1 1373 NC_009073 1 0 1373 1372
Haloferax volcanii DS2 NC_013967 3 2496 NC_013967 6 3 2492 2492
Haloarcula hispanica ATCC 33960 chromosome II NC_015943 2 379 NC_015943 7 1 374 373
Haloarcula hispanica ATCC 33960 chromosome I NC_015948 3 2537 NC_015948 1 1 2539 2537
Nitrosopumilus maritimus SCM1 NC_010085 1 1386 NC_010085 1 1 1386 1386
Total - 20 19219 - 29 13 19209 19202
Sensitivity = 13/20 = 65%
Specificity = 19202/19219 = 99.9%
Precision = 13/29 = 44.8%
Negative predictive value = 19202/19209 = 99.9%
 
oriCs confirmed in vivo or in silico DoriC result Annotated genome prediction result
Organism Link Condition positive Condition negative Link Test outcome positive True positive Test outcome negative True negative
Aeropyrum pernix K1 NC_000854 2 1378 NC_000854 2 1 1378 1377
Pyrococcus abyssi GE5 NC_000868 1 1145 NC_000868 1 1 1145 1145
Methanothermobacter thermautotrophicus str. Delta H chromosome NC_000916 1 1507 NC_000916 2 1 1506 1506
Archaeoglobus fulgidus DSM 4304 NC_000917 1 1501 NC_000917 1 0 1501 1500
Pyrococcus horikoshii OT3 NC_000961 1 1170 NC_000961 1 1 1170 1170
Halobacterium sp. NRC-1 NC_002607 2 1651 NC_002607 4 2 1649 1649
Pyrococcus furiosus DSM 3638 NC_003413 1 1367 NC_003413 1 1 1367 1367
Hyperthermus butylicus DSM 5456 NC_008818 2 1328 NC_008818 1 1 1329 1328
Pyrobaculum calidifontis JCM 11548 NC_009073 1 1373 NC_009073 1 0 1373 1372
Haloferax volcanii DS2 NC_013967 5 2493 NC_013967 6 5 2492 2482
Haloarcula hispanica ATCC 33960 chromosome II NC_015943 4 377 NC_015943 7 3 374 373
Haloarcula hispanica ATCC 33960 chromosome I NC_015948 5 2535 NC_015948 1 1 2539 2535
Nitrosopumilus maritimus SCM1 NC_010085 1 1386 NC_010085 1 1 1386 1386
Total - 27 19211 - 29 18 19209 191200
Sensitivity = 18/27 = 66.7%
Specificity = 19200/19211 = 99.9%
Precision = 18/29 = 62.1%
Negative predictive value = 19200/19209 = 99.9%