The <mds_ies_db> is a dynamic, interactive, and mobile friendly database that features an assortment of searches and visualizations. This document serves as a user's manual for searching the database and controlling the interactive displays.
This manual is organized as follows:
- Search - contains information about the navigation bar quick searches, and the search tab advanced search forms.
- Display - contains information about the contig display pages that show the association of the MAC and MIC genomes with MDS, IES, and pointer annotations and chord diagram visualizations.
- Data - contains information about data processing methods and internal naming convention for stored sequences.
- Technical Notes - contains information about the database architecture and the software libraries used to create the website.
This database showcases annotations for features of ciliate genome rearrangement, namely, macronuclear destined segments (MDSs), internal eliminated segments (IES), and pointer sequences. Additionally, the <mds_ies_db> contains genome and proteome assemblies, and their corresponding gene annotations, for the macronucleus (MAC) and micronucleus (MIC) of Oxytricha trifallax, Tetrahymena thermophila, and other ciliates.
The Quick Search navigation searches provide a method to quickly access the data of a contig by directly inputting its name or alias. The Contig Search, Gene Search, and Sequence Search advanced searches return contig and gene names that satisfy the criteria specified in the search form. The sections below describe the navigation and advanced searches in greater detail.
In addition to navigation searches, several custom searches are also available. The advanced search forms can be used without knowledge of contig and gene ID numbers, so we recommend these searches to our first-time visitors.
<mds_ies_db> contains multiple annotations for the different organism/assemblies that are hosted. In order to search for detailed annotation information about a contig you must first select the database to search against in the Database panel. The Database panel has three fields: Database drop down, Parameters dialog box, and Description text field. The Database drop down allows you to select one of the available organism databases. The Parameters button opens up a dialog box that shows you the parameters that were used to annotate the selected database. The Description text box gives you a brief description of the selected database.
The Contig Search can be used to browse through all stored contigs and scaffolds, and filter by chromosomal properties (e.g. nucleus type and length). The search form contains filtering options at the top of the page, and displays the search results at the bottom of the page. There are four filter panels: Contig, Contig Attributes, Feature Count, and Arrangement Count. By default, the Feature Count and Arrangement Count filter panels are initially hidden, but they can be shown by clicking on either the "Feature Count" or "Arrangement Count" buttons found at the top of the Contig Search panel.
Contig: The Contig filter panel consists of two fields: Name field and Nucleus checkboxes. The Name field list allows you to filter for sequences by name. The Nucleus checkboxes allow to filter for sequences that come from MAC, MIC, or both.
Contig Attributes: The contig attributes filter panel consists of two fields: Length field and Telomeres checkboxes. The Length field allows you to filter for sequences according to their length in base pairs. There is a text box for specifying a number and a drop down list to choose whether the length is at least (≥) the specified number, is at most (≤) the specified number, or is equal (=) to the specified number. The Telomeres checkboxes allow you to specify whether the desired sequences should have both 5′ and 3′ telomeres, only 5′ telomere, only 3′ telomere, or no telomeres at all.
Feature Count: The Feature Count panel consist of the 5 Count fields (Genes, Hits, MDSs, IESs, Pointers) which limit the numbers of those features on the contig. A user can specify the exact (=) number of features, the largest (≤) number of features, or the smallest (≥) number of features the contig can have using the drop down list. Genes are the number of genes on the searched for contig; Hits are the number of contigs on the opposite nucleus that the searched for contig has MDSs hits with; MDSs are the total number of MDSs the searched for contig has with all contigs on the opposite nucleus; IESs are the total number of IESs the searched for contig has with all contigs on the opposite nucleus (only non-zero for MIC contig); Pointers are the total number of pointers the searched for contig has with all contigs on the opposite nucleus.
Arrangement Count: The Arrangement Count panel is similar to the Feature Count panel, and allows a user to specify the exact (=), largest (≤), or the smallest (≥) number of arrangements with a certain property that the searched for contig has with all hits on the opposite nucleus. The "Yes/No" drop down allows you limit either the number of arrangements with the property or the number of arrangements without the property. For a detailed description of the mathematical definition of each property refer to the SDRAP documentation's Augmenting annotation section here.
The filtered results can be ordered by one of the columns: Nucleus, Name, Length, Telomeres, Genes, Hits, MDS, etc. Clicking on a column title orders the results in ascending (or alphabetical) order according to the column selected, and clicking the column title again orders the results in descending (or reverse alphabetical) order. Users can also specify the number of entries shown by clicking on the "Show" drop down list, and selecting the number of displayed results per page. Clicking on the desired contig name links to the contig information page. Additional data columns can be made visible by clicking on the "Column visibility" button.
Click the "Search" button below to go to the Contig Search form.
The gene search is helpful for finding genes that are present on either a MAC or MIC contig. This search form contains all filtering options that are included in the contig search, which allows you to filter by properties of the contig that contains the gene being searched for. Please refer to the Contig Search section for these options. There are two additional panels specific to the gene search: Gene and Gene Attributes.
Gene: The Gene filter panel consists of one field, Name, which allows you to filter for gene names containing a specific string.
Gene Attributes: The Gene Attributes panel consists of two fields: Description and Length. The Description field allows you to filter for genes containing containing a specific string in its description (eg. "kinase"). The Length field allows you to filter for sequences according to their length in base pairs. There is a text box by specifying a number and a drop down list to choose whether the length is at least ("≥") the specified number, is at most ("≤") the specified number, or is equal ("=") to the specified number.
Search results can be ordered by one of the following columns: Gene Name, Gene Source, Gene Length, Gene Note, or Contig Nucleus. Click on the column name to order results in ascending (or alphabetical) order. Click on the same column name again to order results in the descending (or reverse alphabetical) order. You can also specify the number of displayed entries per page by clicking on the "Show" drop down list and selecting a value. Click on the desired gene to open a detailed information page about it.
Click the "Search" button below to go to the Gene Search window.Search
This search is designed to interface with BLAST to match a provided sequence of nucleotides or peptides to the one stored in the database. The search interface consists of the BLAST Database panel for selecting a database to search against, a text field for inserting the query sequence to search for, and an Advanced Settings panel. The latter panel is minimized by default. Click on the "Advanced settings" link to maximize the Advanced Settings panel.
BLAST Database: There are three drop down lists: Organism, Assembly, and Program. The Organism drop down allows you to select the organism being queried. The Assembly drop down allow you to specify the assembly (ie. the nucleus and version) of the organism being queried. The Program drop down allows to specify which BLAST program is to be used. Note that the choice of program determines whether the input will be regarded as an amino acid or DNA sequence, and whether the database being queried is the amino acid or DNA version of the assembly. To learn more about the different programs please refer to the BLAST website. The checkbox labelled "Only allow assemblies compatible with the selected annotation database" should be selected if once the search results are returned you wish to click on a contig/gene to open the detailed information page for the selected annotation database in the Database panel (described above).
Advanced settings: The Advanced settings panel allows you to specify additional BLAST parameters. For more details please refer to the BLAST website.
Sequence: The Sequence text box panel allows you to enter the query sequences which will be searced for in the database. A single sequence can be entered plainly or multiple sequences can be entered in FASTA format.
Once all search parameters are specified and the sequence is inserted into the Sequence text box, press the "BLAST" button to run the search.
Search Overview panel
The Search Overview panel give you a graphical summary of the alignments. There are 3 options in this panel: Reference, Query, and Hit. The Reference checkbox can be set to the "Query" or the "Hit", which allows you to see where the alignments occurred on either the query (the entered sequence) or the hit sequence (the sequence found in the BLAST database). The Query dropdown allows you to select which of the query sequences (if there are multiple) will be used as the reference when the reference checkbox is set to "query", and will allow you to control which query sequences are displayed in the overview when the reference checkbox is set to "hit". The Hit dropdown allows you to select which of the hit sequences (if there are multiple) will be used as the reference when the reference checkbox is set to "hit", and will allow you to control which hit sequences are displayed in the overview when the reference checkbox is set to "query". Under the options appears a color scale which indicates the score of the alignment as determined by BLAST: dark gray for < 40, purple for 40-49, blue for 50-79, green for 80-199, and red for ≥ 200. Under the color scale is the scale bar which shows the base pair positions relative to the hit or query sequence being used as a reference. Finally, under the scale bar are thin segments that represent either a hit or query sequences that the BLAST search aligned with the reference sequence. These segments can be clicked in order to navigate to the corresponding position in the results table described below.
Search Results Table
The results panel appears on the bottom with all the contigs/scaffolds that matched the query. As with all other search results, users can sort them by one of the shown column: Query, Hit, Length, Evalue, Bitscore, or % Identity. Click on the column name once to order results in ascending (or alphabetical) order, click on the column name twice to order results in descending (or reverse alphabetical) order. Click on the green "+" sign to view additional details such as the visual alignment between the query and hit sequences. Click on the desired Hit name to open detailed information about it.
Click the "Search" button below to go to the Sequence Search window.Search
Once the desired sequence is found, the sequence page can be opened to read further details about it. The sequence information page contains the Genome Browser, Chord Diagram window, Properties / MDS / Pointer / Arrangements Tables, Downloads window, and Information Sections.
On the top of the sequence information page is the Genoverse genome viewer display. This display is divided into four tracks: Sequence, Transcript, MDS/IES/Pointer, and MDS/IES/Pointer Legend.
The Sequence track shows the nucleotide sequence of the reference contig.
The Transcript track shows each of the gene transcripts of the reference contig. Exons of the transcript are represented by a thick red bar. Introns are represented by a thin line connecting exons (or before the first exon and after the last exon). Each feature can be clicked to display more information about it in a popup.
The MDS/IES/Pointer track shows the MDSs, IESs, and pointers that the reference contig shares with the matching contigs on the opposite nucleus. Each row in this track corresponds to the features that correspond to a single contig on the opposite nucleus. For example, if the reference contig is MAC contig each row will correspond to all the MDS and pointers that the MAC contig shares with a single MIC contig. Each row has a different base color to make visually differentiating them easier. MDSs have a lighter shade and pointer have a darker shade of this color. On MIC contigs, IESs are always displayed in bright red regarless of the row. The MDS features are labelled numerically by their order on the corresponding MAC contig (from 5' to 3'). A negative index indicates that the mapping from MAC to MIC inverts the orientation of the segment. Each feature can be clicked to display more information about it in a popup.
The MDS/IES/Pointer Legend track shows the contig names on the opposite nucleus that are currently being viewed in the MDS/IES/Pointer track. The colored boxes next to the names are the same color used in the corresponding row of the MDS/IES/Pointer track. Each name or colored box can be clicked to display more information about the contig in a popup.
The dark grey vertical bars at the 5' or 3' ends of the viewer indicate a degenerate end (nucleotides beyond a telomere) and light grey vertical bars indicate a telomere. The gear shaped icon on the right of each track can be clicked to reveal more controls to enable/disable the legend (MDS/IES/Pointer track only), enable/disable the labels (Transcript and MDS/IES/Pointer tracks only), displaying a short description of the track, and remove the track from the display. Tracks can also be hidden or shown by clicking on the "Tracks" button in the top left corner and using the opened menu. The viewer is hyperlinked with other pages on the website so anywhere a contig or locus name occurs you can usually click on it to navigate to the corresponding information page The mouse scroll wheel can be used to zoom in and out of nucleotide regions. The entire view can be scrolled by dragging in an empty area of viewer or using the navigation bar at the top. Shift-click and drag to zoom in on a region. Additional navigation controls can be selected from the panel on the right. For more information on how to use the Genoverse genome viewer, please refer to this tutorial.
A chord diagram (a.k.a Circos Plot) is a way to visually represent sequence alignment information. The selected MAC/MIC sequence and all matching MIC/MAC sequences on the opposite nucleus are placed on a circle. The matching nucleotide segments of MAC and MIC are connected with arcs. Different MAC to MIC matches are connected with arcs of different colors.
This allows you to see what regions of the MIC are mapped to what regions of MAC and vice versa. To focus on only one set of MIC to MAC (or vice versa) arcs, hover your mouse over the circle segment belonging to the desired MIC/MAC arcs to decrease the opacity of all other arcs. To select which hits are displayed use the "Contigs" dropdown. To display the same visual information in the chord diagram in a linear format instead, select the "Line Diagram" tab at the top of the panel which has the same interface features as the "Chord Diagram" tab.
To see the chord diagrams, click on the "Chord Diagram" button under the genome browser.
Properties / MDS / Pointer / IES / Arrangements Tables
The Properties Table displays information about the mathematical properties of the rearrangement maps between the MAC/MIC contig being viewed and the contigs on the opposite nucleus that have matching MDSs. These properties have been computed using the annotation software SDRAP. Please refer to the SDRAP documentation here and the publication (currently in review) for the precise definitions. Click the "Properties Table" button under the genome viewer to view this table.
The MDS Table displays information about the MDSs (Macronuclear Destined Sequences) between the MAC/MIC contig being viewed and the contigs on the opposite nucleus that match. These MDS have been obtained using the annotation software SDRAP. Please refer to the SDRAP documentation here and the publication (currently in review) for more details. Click the "MDS Table" button under the genome viewer to view this table.
The Pointer Table displays information about the pointers between the MAC/MIC contig being viewed and the contigs on the opposite nucleus that have matching MDSs. These pointers have been obtained using the annotation software SDRAP. Please refer to the SDRAP documentation here and the publication (currently in review) for more details. Click the "Pointer Table" button under the genome viewer to view this table.
The IES Table displays information about the IESs (Internal Eliminated Sequences) between the MIC contig being viewed and the contigs on the MAC nucleus that have match MDSs. Note, that this table is only available for MIC contigs since IESs exist only on MIC contig. These IESs have been obtained after post-processing the output of the annotation software SDRAP. The IESs displayed are sometimes referred to as Strict IESs, and are regions between a pair of consecutive MDSs from the same MAC/MIC pair that do not intersect any other MDSs of any other MIC/MAC pair. Click the "IES Table" button under the genome viewer to view this table.
The Arrangements Table displays information about the rearrangement map between the MAC/MIC contig being viewed and the contigs on the opposite nucleus that have matching MDSs. For example, an arrangement map is a sequence such as: M3 M2 M1. The M1 indicates that the MDS in position 3 on the MAC contig appears in position 1 on the MIC contig, the M2 indicates that the MDS in position 2 on the MAC contig appears inverted in position 2 on the MIC contig, and M1 indicates that the MDS in position 1 on the MAC contig appear in position 3 on the MIC contig. These arragements have been obtained using the output of the annotation software SDRAP. Please refer to the SDRAP documentation here and the publication (currently in review) for more details. Click the "Arrangement Table" button under the genome viewer display to view this table.
Features common to all tablesThere are seven buttons common to every table: Select All, Deselect All, Copy, Excel, CSV, PDF, and Column Visibility. Select All selects all the matching contigs on the opposite nucleus to be viewed in the browser. Deselect All selects none of the matching contigs on the opposite nucleus to be viewed in the browser. Copy copies the viewed data to the clipboard. Excel downloads the viewed data in XLSX format. CSV downloads the viewed data in CSV format. PDF downloads the viewed data in PDF format. Column Visibility allows you to select or deselect columns to be displayed in the table. Note that selecting or deselecting contigs using the checkboxes in this table cause the corresponding contig to be added or removed from the genome viewer. To sort the table by a column, click on the up/down arrows next to the column name. To search for a matching contig by name use the search bar at the top left. To view more rows of the table use the page navigation buttons along the bottom right of the table. If not all columns are visible use the scroll bar at the bottom to scroll horizontally. To close the window click on the "Close" button at the bottom right corner of the window or the "x" at the top right corner of the window.
The Download Data window allows you to download sequences, annotations, and other information of the currently displayed contig. There are three download categories: Sequences, Annotations, Other. Click on the "Downloads" button under the genoverse display to open the download window.
Sequences: On the top of the download window, there is a Sequences field that consists of Nucleotide and Protein checkboxes and a Format field. Check Nucleotides and/or Protein to download the corresponding sequences. The only format for this category is FASTA, and it is selected by default.
Annotations: This category includes Genes, MDSs, and Telomeres check boxes. The data can be downloaded in GFF3 or BED formats.
Other: This category contains RNA Expressions and MIC Arrangements check boxes. The data can be downloaded in CSV or XLSX format.
Download: Once all desired information is checked, click the "Download" button to download the data as a zip archive.
Under the genome viewer and table buttons there are different information fields related to the displayed sequence.
The DNA Information field provides the length of the sequence (in nucleotides), information about telomeres (for MAC contigs), and the nucleotide sequence in text format (click the button).
The MDS Information field shows the number of MIC/MAC hits on the opposite nucleus, the MDS count, the pointer count, and the IES count (if viewing a MIC contig).
The Cross References section has two subsections: External Databases and Variants. The External Databases subsection provides links to other databases with information about the contig being viewed such as OxyDB, GenBank, and the 2015 version of <mds_ies_db>. The Variants subsection contains links to other contigs on <mds_ies_db> that are knows to be variants or isoforms of the contig being viewed.
The Gene Information section contains a table of all genes that are present on the displayed sequence. To view the transcripts of each gene click on the green "+" symbol on the corresponding gene row. To view additional features of each transcript such as exons, introns, CDS, etc. click on the green "+" symbol on the corresponding transcript row. It is possible to filter for a particular gene name or gene description by typing text into the Search field. To view the DNA and/or protein sequence of each feature in the table click on the button in the corresponding row.
This section describes the sources of data for the <mds_ies_db>.
The MDS-IES annotation comes from the MDS/IES Annotation sequence software SDRAP which is a free and open source program developed by Jasper Braun and collaborators at the USF Math-Bio Lab. The annotation process consists of BLASTing MAC contigs/scaffolds against MIC contigs/scaffolds and using high score pair information to identify MDSs on the MAC and MIC. Using the MIC's MDS information, additional processing is done to identify IESs for each MIC. Besides MDS-IES annotation, SDRAP also produces MAC telomeric sequence information and MIC's MDS arrangement pattern information. Both types of information are currently stored in the <mds_ies_db>. For more details on the SDRAP algorithm please refer to the GitHub page here.
Note: This section describes the contig naming schemes for the 2016 version of <mds_ies_db> which is no longer being used in the most recent update.
The <mds_ies_db> assigns its own name to every sequence that is stored in the database. The naming convention is described as follows:
- 6 uppercase digits that are related to the organism name (ex. Oxytricha trifallax - OXYTRI)
- Underscore symbol "_"
- "MIC" or "MAC" string to indicate whether this is a MAC or MIC nucleus
- Underscore symbol "_"
- Unique number that is assigned to the sequence
An example of the assigned MAC contig name for oxytricha trifallax is OXYTRI_MAC_1001, and for the MIC contig of tetrahymena thermophila is TTHERM_MIC_1464
This section describes the database architecture and lists software and libraries used during the development of <mds_ies_db>.
Currently, there are about 14 tables in the database which are listed below.
- Alias - contains information about sequence alias names found in different databases.
- Contig - contains information about each MAC/MIC contig.
- Count - contains summary information about the genes, MDSs, pointers, and IESs for each MAC and MIC contig.
- Coverage - contains information about MDS mapping between each MAC/MIC pair.
- Gene - contains information about genes that are found on MAC/MIC contigs.
- IES_strict - contains information about the strict IESs that are found on MIC contigs. Strict IESs are segments of a MIC contig that are between two MDSs such that both MDSs correspond to the same MAC contig and the segment between the MDSs does not overlap any MDSs of either that MAC contig or any other.
- IES_weak - contains information about the weak IESs that are found on MIC contigs. Weak IESs are segments of a MIC contig that are between two MDSs such that both MDSs correspond to the same MAC contig and the segment between the MDSs does not overlap any MDSs of that MAC contig, but it may overlap MDSs of other MAC contigs.
- Match - contains information about MDSs that were identified during the MDS annotation process.
- Parameter - contains information about the SDRAP parameters that were used during the MDS annotation process.
- Pointer - contains information about pointers that were identified during the MDS annotation process.
- Properties - contains information about arrangement map properties that were identified during the MDS annotation process.
- Protein - contains information about MAC protein transcripts.
- Stats - contains summary information about each annotation database.
- Variant - contains information about the variants/isoforms of each contig.
The <mds_ies_db> uses a number of open source software programs and libraries: