Copyright © 2001,2002 Warren R. Gish.
All Rights Reserved.
Last updated: 2002-09-26
All NCBI standard FASTA sequence identifiers (NSIDs) are supported for indexing. User-definable, uncontrolled identifiers (UCIDs) consisting of arbitrary text strings are also supported. The complete list of NSIDs is presented in Table 1. Note: the NSIDs include three user types denoted by the tags: lcl, gnl, and oth. In contrast, identifiers of the flexible UCID class do not use any tags. For a more complete description of UCIDs, see below.
Table 1. The complete NCBI standard FASTA sequence identifiers
Tag and Identifier Syntax |
Identifier Source Description |
bbm|integer |
NCBI GenInfo Backbone database identifier |
bbs|integer |
NCBI GenInfo Backbone database identifier |
dbj|coll-accession|locus |
DNA Database of Japan |
emb|coll-accession|entry |
EBI EMBL Database |
gb|coll-accession|locus |
NCBI GenBank database |
gi|integer |
NCBI GenInfo Integrated Database (“jee-aye”) |
gim|integer |
NCBI GenInfo Import identifier |
gnl|database|idstring |
General (user-definable) database and identifier |
gp|coll-accession|locus_cds# |
GenPept (GenBank protein) identifier |
lcl|integer |
Local (user-definable) identifier |
oth|accession|name|release |
Other (user-definable) identifier* |
pat|country|patentid|serialno |
Patent sequence identifier |
pdb|entry|chainid |
Brookhaven Protein Database |
pir|accession|entry |
Protein Information Resource International |
prf|accession|name |
Protein Research Foundation |
ref|coll-accession|locus |
NCBI RefSeq |
sp|coll-accession|locus |
SWISS-PROT database |
tpd|coll-accession|name |
Third party annotation, DDBJ |
tpe|coll-accession|name |
Third party annotation, EMBL |
tpg|coll-accession|name |
Third party annotation, GenBank |
*The NCBI has discontinued support for “oth” identifiers, but support for them is maintained in xdformat/xdget.
Yes, while “accession” appears in several of the identifiers described above, Accessions assigned by the International Nucleotide Sequence Database Collaboration between the DDBJ, EBI (EMBL) and NCBI are guaranteed unique by these organizations. To reflect their special nature, the collaboration’s Accessions are labeled coll-accession in Table 1. These Accessions are all treated as being derived from the same identifier name space. Consequently, xdget can retrieve a sequence by Accession (or rather coll-accession) without having to know specifically which of the collaborating organizations assigned the identifier. Locus and Entry identifiers do not work this way, however, as the uniqueness of these identifiers is not controlled between the collaborators.
A compound identifier is a concatenation of multiple NCBI standard FASTA sequence identifiers (NSIDs) each separated from the next by a single vertical-bar character, ‘|’ (also known as the “logical-or”, “pipe”, “pling”, “gozinta” or “pipesinta” character). White space (e.g., one or more blank or tab characters) is used to delimit the identifier string from the accompanying sequence description.
Here is an example of a definition line containing a simple or atomic sequence identifier:
>gi|12346 hypothetical protein 185 – wheat chloroplast
Here is an example of a compound identifier, containing both a gi and a gp (GenPept) identifier:
>gi|12346|gp|CAA44030.1|CHTAHSRA_4 hypothetical protein 185 – wheat chloroplast
While the order of identifiers in a compound identifier is technically irrelevant, gi identifiers typically appear first.
A compound definition is a concatenation of multiple component definitions, each separated from the next by a single Control-A character (sometimes symbolized ^A; hex 0x01; or ASCII SOH [start of header]). Compound definitions are frequently seen in “nr” (quasi-non-redundant) databases, where multiple instances of the exact same sequence are replaced by a single instance of the sequence with a concatenated definition line. Note: each component of a compound definition begins with an identifier which itself may be compound.
Yes, xdformat can index uncontrolled identifiers of your choosing (UCIDs), either alone or in combination with NCBI standard FASTA sequence identifiers (NSIDs). A UCID consists merely of a non-blank string of text, lacking any identifier tag that would be required of an NSID.
UCIDs
are subject to a few restrictions:
The purpose of imposing the above restrictions on UCIDs is to aid in the detection of syntax errors on input.
When an error is encountered in the left-to-right parsing of a string of identifiers, parsing stops and all subsequent identifiers in the current identifier string are ignored. Any identifiers parsed correctly prior to the error are indexed. In the case of a compound definition line, parsing and indexing resume at the identifier string in the next component definition. Regardless of whether any syntax errors are detected in the identifiers, the entire definition line will be stored in the XDF database “as is”.
Here are a few examples of definition lines whose identifiers will all be completely parsed and indexed. All but the first two examples contain a compound identifier.
>gi|12346
>MYID001 my
first sequence (NOTE: UCID is acceptable as the first identifier, iff
it is the only identifier in the string)
>gi|5902966|gp|AAD55586.1|AF055084_1 very large
GPCR-1 [Homo sapiens]
>gp|AAD55586.1|AF055084_1|gi|5902966 (NOTE: order
of NCIDs is unimportant)
>gp|AAD55586.1|AF055084_1| very large GPCR-1
[Homo sapiens] (NOTE: vertical-bar is acceptable at end of identifier
string)
>gp|AAD55586.1|AF055084_1|gi|5902966|MYID001 my first sequence (NOTE: UCID at
end of identifier string will be properly indexed)
Here are a few examples of improperly constructed strings that will cause an identifier – or the entire string of identifiers – to be omitted from the index.
>gi|5902966|gp|AAD55586.1
very large GPCR-1 [Homo sapiens] (NOTE: gp identifier is missing the locus token and will be skipped)
>fb|AAD55586.1|AF055084_1|gi|5902966
(NOTE:
unrecognized tag “fb”; none of the identifiers will be indexed)
>gi|5902966|MYID001|gp|AAD55586|AF055084_1
(NOTE: UCID not listed last; gp identifier
will not be indexed)
>MYID001|gp|AAD55586.1|AF055084_1|gi|5902966
(NOTE:
UCID not listed last; none of the subsequent
identifiers will be indexed)
Yes, assuming no parse errors are encountered in any of the identifier strings among all component definition lines, all of the identifiers are indexed by default. If only a subset of identifier types needs to be indexed for later use in retrieval, indexing can be restricted to a subset of types with one or more ‑T specifications on the xdformat command line. Similarly, indexed retrieval can be restricted to a subset of identifier types by specifying one or more ‑T specifications on the xdget command line. Of course, ‑T restrictions are only effective if the corresponding identifiers actually appear in the database.
Any ‑T
index restrictions imposed during database creation on the xdformat command line automatically (and unconditionally) remain in
effect during appends of additional data to the same database; the restrictions
need not be replicated on the xdget
command line unless even tighter restrictions are desired during retrieval.
Tighter restrictions upon retrieval can be obtained by specifying a subset of
the ‑T restrictions originally indicated on the xdformat command line.
The
size of the index and the speed of index creation and retrieval will be
improved by limiting the index to those identifiers of interest.
NOTE: The
left-to-right order of multiple ‑T specifications may be important in
future versions of xdformat and xdget.
Just as the ‑T<tag> option can be used to restrict indexing and retrieval to a subset of NSIDs, the special tag specification ‑Tuser will restrict indexing to UCIDs. NSID and UCID restrictions can be combined on the same command line. For example, “xdformat ‑Tuser ‑Tgi …” will restrict indexing to UCIDs and NCBI gi identifiers.
When the definition line for a single sequence record contains multiple instances of the same identifier within the same name space, each instance following the first is called redundant. Redundant identifiers may appear in the same or different components of a compound definition line. Depending on circumstances, redundant identifiers may or may not be problematic, because they all refer to (are associated with) the same sequence record.
The xdformat program reports redundant identifiers.
When a database contains instances of the same identifier in a name space in different sequence records, the identifiers are called duplicate. Duplicate identifiers are more prone to being problematic than redundant identifiers, because the association between database records (sequences) and duplicate identifiers is not unique. An identifier can be both redundant and duplicate.
The xdformat program reports duplicate identifiers.
A qualified identifier is one which conforms to the NCBI standard FASTA identifier (NSID) syntax outlined in Table 1. An unqualified identifier is just a bare word, lacking any indication of its database domain or name space in which it was assigned. For instance, while “U38670” could represent a GenBank Accession, it might also be an uncontrolled identifier (UCID). The string “gb|U38670|” tells us unambiguously that the identifier is a GenBank Accession.
Table 2. Examples of unqualified and qualified
identifiers
Unqualified
ID |
Qualified
ID |
Interpretation |
U85245 |
gb|U85245| |
U85245 is a GenBank ACCESSION |
1857636 |
gi|1857636 |
1857636 is a GenBank gi identifier |
HSU85245 |
gb||HSU85245 |
HSU85245 is a GenBank LOCUS |
AF218085.2 |
gb|AF218085.2| |
AF218085.2 is a GenBank ACCESSION.VERSION |
P18646 |
sp|P18646 |
P18646 is a SWISS-PROT ACCESSION |
11S3_HELAN |
sp||11S3_HELAN |
11S3_HELAN is a SWISS-PROT ENTRY name |
A00008 |
pir|A00008| |
A00008 is a PIR accession |
Note that all fields in a qualified identifier must be accounted for by vertical-bars, but all fields need not contain data. A field can be left empty if its value is unset or unknown. Furthermore, retrieval of the corresponding database entry will succeed if one or more fields in a qualified identifier are instantiated.
First of all, it is important to know that when indexing, all identifiers are assigned to a specific name space, with unqualified or uncontrolled identifiers in the UCID class being assigned to an ad hoc “user” name space. The xdformat and xdget programs maintain an internal priority list of the possible name spaces. When provided with an unqualified identifier, xdget works its way down the priority list, successively looking for the requested identifier in each name space. The program stops at the name space in which the first matching identifier is found; any further work the program must do (e.g., to identify the earliest appearance of the identifier in the database) will be performed in this one name space.
Name spaces are examined in the decreasing priority order shown in Table 3. The qualifiers 1 and 2 on any given tag correspond respectively to the 1st and 2nd fields in the tag’s full identifier syntax. Note that non-standard “accession” tag may be used with the –T option as a synonym for the unified name space of Accessions assigned by the DDBJ/EBI/NCBI collaboration. The nonstandard tags “locus” and “entry” are both synonyms for the 2nd field in all dbj, emb, gb, gp, ref, and sp identifiers, although xdformat actually stores these identifiers in 4 distinct name spaces; xdget then looks up unqualified identifiers using the priority list in Table 3.
Table 3. Priority order of identifier name spaces,
from highest to lowest
-T<tag> |
Description |
Synonyms |
user |
Uncontrolled UCID class |
|
lcl |
|
|
gi |
|
|
dbj1, emb1, gb1, gp1, sp1,ref1 |
DDBJ/EMBL/GenBank Accession* |
-Taccession |
gb2,gp2,ref2 |
GenBank locus |
-Tlocus, -Tentry |
emb2 |
EMBL ID |
-Tlocus, -Tentry |
dbj2 |
DDBJ ID |
-Tlocus, -Tentry |
sp2 |
SWISS-PROT entry |
-Tlocus, -Tentry |
pdb |
PDB entry|chain |
|
pir1 |
PIR accession |
|
pir2 |
PIR entry |
|
prf1 |
PRF accession |
|
prf2 |
PRF entry |
|
pat |
country|number|seqno |
|
gnl |
database|idstring |
|
oth |
database|accession|release |
|
NOTE: the priority
list of Table 3 is currently used both in the presence and the absence of any
–T options when xdget looks up
unqualified identifiers. Future versions of xdformat and xdget will
likely use the left-to-right order of–T specifications as the priority order
for lookups; in the absence of any –T specifications on either program’s
command line, the order shown in Table 3 will be used by default.
Tag specifications similar to those shown in Table 3 can be used to suppress indexing of certain classes of identifiers, while permitting all others to be indexed. If a tag specification simply ends with a 0 (zero), then that tag will be suppressed. For instance, to suppress indexing of identifiers appearing in the 2nd field of GenBank, EMBL, and DDBJ identifiers, one would specify –Tlocus0. Or to suppress indexing of gi identifiers, use –Tgi0. Such tag specifications may also be provided on the xdget command line to suppress the use of particular classes of identifiers during retrieval.
Yes, the rapid append mode (‑a option) of xdformat is available for indexed databases; appends occur only marginally slower when an index is being maintained. Appended sequences will have their identifiers indexed using the same –T restrictions (if any) that were specified when the database was first created. Indexing of identifiers occurs automatically and unconditionally during appends to a previously indexed database, without the need to specify the –I or –X option when appending.
The numerical .VERSION extension that frequently accompanies Accessions assigned by the NCBI/EBI/DDBJ collaboration is automatically included in the index created by xdformat. Version information can then used by xdget to identify the latest version of a sequence, when keyed by its Accession alone. Specific versions can also be retrieved if xdget is provided with an identifier of the form ACCESSION.VERSION (e.g., AAB33294.2). The –N option of xdget can be used to report instead the first (-N0) or last (-Nn) instance of an Accession in the database; the –A0 option can be used to report the lowest-numbered Version present in the database rather than the highest (the default or –An). All instances of an accession will be reported by xdget if the ‑d option is specified.
Indexing and retrieval can be restricted to Accessions assigned by the NCBI/EBI/DDBJ collaboration using the special option ‑Taccession (or ‑Tacc for short).
Remember: Version numbers assigned by NCBI/EBI/DDBJ are only tied to changes in the sequence data, not the associated annotation. The annotation of a database record may change greatly, while the Version will remain the same if the sequence itself has not changed.
Assuming the underlying computer operating system and hardware have the capacity, index files produced by xdformat are currently limited to 8 TB (8,192 GB) in size, a limit that can be readily increased to 256 TB in the program if necessary. With its current configuration, however, an index of 50+ million entries requires less than 3 GB storage; and because storage requirements for the index increase only marginally faster than linearly with the number of entries, the current limit seems likely to suffice for some time. If the size of an index is problematic, or if faster retrieval is required, indexing can be restricted to the most important classes of identifiers using the –T option.
Return to the WU-BLAST Archives home page.