Glossary

[A] [B] [C] [D] [E] [F] [H] [I] [L] [M] [O] [P] [R] [S]

Alignment

A sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (usually denoted by dashes) so that wherever possible, columns contain identical or similar characters from the sequences involved:

   51 DRAILYRYDVTEETDVKNAVKFTI---GKLDILFSN     83
      |:|.:||.|:|:||:|:||||||:   ||||:||||
   55 DKASFYRCDITDETEVENAVKFTVEKHGKLDVLFSN     90

It is usually used to study the evolution of the sequences from a common ancestor, especially biological sequences such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment can also be used to study things like the evolution of languages and the similarity between texts.

bit-Score

The bit-Score is derived from the raw alignment score in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

BLOSUM50

The BLOSUM50 Blocks Substitution Matrix is a substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM50 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 50% identity. Sequences more identical than 50% are represented by a single sequence in the alignment to avoid overweighting closely related family members (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).
In SIMAP the BLOSUM50 matrix is used to compute alignments and scores to adjust the calculations to optimal sensitivity.

CRC64 checksum

The checksum allows to quickly identify identical sequences in SIMAP. The value of the crc64 checksum is derived from the pure amino acid sequence itself (all characters in upper case) using the 64 bit cyclic redundancy check function.

Database

A database in SIMAP is a collection of protein entries that have been imported into the SIMAP system. A database may represent all proteins of a genome (like PEDANT databases) or proteins from multiple species (like UNIPROT).

Description

The description of SIMAP protein entries is taken from the originating database entries. It can be searched using fulltext queries.

Domain-Architecture Similarity

The Domain-Blast Method uses a similarity measure which is based on the presence/absence of domain-signatures. This score is adapted from the paper "An initial strategy for comparing proteins at the domain architecture level." written by Lin et al. [Pubmed]

E-value

The E-value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially with the raw score that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0", the higher is the "significance" of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence, so shorter sequences have a high probability of occurring in the database purely by chance.
In SIMAP the E-values are calculated using the fixed estimates for an average database and the BLOSUM50 substitution matrix as in BLAST (Altschul et al. 1990) in order to gain speed.

FASTA

A straightforward way to compute similarities between protein sequences would be to compute for each pair of proteins the Smith Waterman alignment and to keep high scoring hits for further processing. Although efficient implementations (Rognes and Seeberg 2000) exist, the computational costs (i.e. the CPU time needed) are still high. So a number of heuristic approaches were introduced, like BLAST (Altschul et al. 1990) or FASTA (Pearson 2000). These heuristics speed up the search for biologically meaningful hits in a database significantly and are therefore widely used by bioinformaticians and biologists alike. The FASTA program looks for optimal local alignments by scanning the sequence for small matches called "words&quot. Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later, the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "ktup" variable, which specifies the size of a "word".
; As it was evaluated to be the best compromise between computational speed and sensitivity (Pearson 1991) we have chosen FASTA for finding all putative hits. The FASTA parameter ktup=1 is used to adjust the calculations to optimal sensitivity.

FASTA format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence (so-called multiple FASTA format).
An example single FASTA format as supported in the SIMAP sequence search:

>tr|Q988A4|Q988A4_RHILO Mll6828 protein - Rhizobium loti (Mesorhizobium loti).
MVTHPSLVLAATIALAVVLGAISIADFRRQIIPDGLNLALAGIGLSYQLAADADAMPQRL
LFAAATFAAAWLLRRGHFLMTGRIGLGLGDVKMLAAASCWISPLLLPVLLFIASASALLF
VGGQVVATGPAAARARVAFGPFIAIGLGASWALEQFAGLDMGLL

Fulltext

SIMAP fulltext queries use a fulltext index that was build from all protein IDs and descriptions. The text information is separated into words by all characters except on letters and digits. The minimum wordlength has been set to 3.

Homologs

The term homology refers to similarity attributable to descent from a common ancestor. Homologs are homologuous proteins that have sequences which are expected to share ancestry. Homology of sequences can be of two types: orthology or paralogy. Homologous sequences are orthologous if they were separated by a speciation event: if a gene exists in a species, and that species diverges into two species, then the copies of this gene in the resulting species are orthologous. Homologous sequences are paralogous if they were separated by a gene duplication event: if a gene in an organism is duplicated, then the two copies are paralogous. A pair of sequences that are orthologous to each other are called orthologs, a pair that are paralogous are called paralogs.

Identity

The percent identity value is a attribute of pairwise alignments that measures the number of identical residues ("matches") compared to the length of the alignment.

Low-complexity regions

Low complexity regions in amino acid sequences are regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of one or a few residues.
In SIMAP the seg program (Wootton 1994) is used to compute low complexity regions. To keep the sequence information in SIMAP sequences the residues in low complexity regions are not replaced by "X" but converted into lower-case characters.

MD5 checksum

The checksum allows to quickly identify identical sequences in SIMAP. The value of the MD5 checksum is derived from the pure amino acid sequence itself (all characters in upper case) using the 128 bit Message-Digest algorithm 5 function.

Overlap

The overlap value is a attribute of pairwise alignments that measures the length of the the alignment, including matches, mismatches and gaps.

Protein

SIMAP protein entries correspond to the entries from the originating databases. They are associated to their corresponding SIMAP sequence.

Protein ID

The ID of SIMAP protein entries is taken from the identifier field of the originating database entries. It can be searched using fulltext queries.

raw Sequence format

A sequence in raw format contains lines of sequence data. The lines may be formatted by whitespaces and may contain index numbers. An example sequence in raw format as supported in the SIMAP sequence search:

     MVTHPSlvla atialavvlG AISIADFRRQ IIPDGLNLAL AGIGLSYQLA ADADAMPQRl 60
     lfaaatfaaa wllRRGHFLM TGRIGLGLGD VKMLAAASCW ISPLLLPVll fiasasallf 120
     VGGQVvatgp aaararvafg pfIAIGLGAS WALEQFAGLD MGLL                  164

Selfscore

The selfscore is the sw-Score of the pairwise alignment of a amino acid sequence with itself. The selfscore represents the maximum value of the sw-Score of the particular sequence in pairwise alignments with other sequences. Low complexity regions in sequences are marked by lower-case characters.

Sequence

SIMAP sequences represent amino acid sequences and are stored separately from proteins in a non-redundant database. Identical sequences are detected by their equal MD5 checksum. One sequence may be associated to many proteins in several databases.

Sequence ID

The sequence ID is a unique number for each SIMAP sequence. It is used in SIMAP itself and SIMAP related databases.

Smith-Waterman pairwise alignment method

A pair wise alignment is parameterized by its underlying substitution matrix which models the exchange probabilities of the amino acids as the BLOSUM50 matrix (Henikoff and Henikoff 1992), the costs of opening and extending gaps and the boundary condition whether the alignment should be optimized locally or globally. The optimal solution for the local case is the Smith Waterman algorithm (Smith and Waterman 1981).

SW-score

The score value is a attribute of pairwise alignments that measures the similarity of the the alignment by its underlying substitution matrix and gap penalties. The SW-score represents the score of Smith-Waterman alignments.