Update to "BlastReport: A Perl Script to Facilitate the Use of Sequence Databases for Mapping and Clustering"

BioTechniques 29: 1272-1276, 2000

Jeanette McClintick1 and Howard J. Edenberg1,2

Departments of 1Medical and Molecular Genetics and

2Biochemistry and Molecular Biology

Indiana University School of Medicine

Indianapolis, Indiana 46202-5122

 

 

 

We have updated BlastReport to support format changes in newer versions of NCBI's Blastcl3. BlastReport has been tested with the latest version of Blastcl3 (network blast from NCBI; modules are dated April 2001 to July 2001 depending on the operating system for which the module was compiled). Changes were also made to accommodate sequence position numbers of 7 digits, necessary for positions in genomic contigs of 1 million bases or more.

 

The program and a readme file explaining the parameters that must be changed for the program to run on your computer (e.g. file/directory locations) can be downloaded from:

http://bmbmac.biochemistry.iupui.edu/blastrep/Home.html

 

There were two aspects to our initial report: 1) suggestions for parameters that return reasonable numbers of hits that are likely to be interesting, and 2) the Perl script itself, BlastReport, which trims and organizes the often voluminous output of such searches.

 

NCBI has added functions to its Blastcl3 that can be used to refine a blast search. Searches can be restricted to results of an Entrez2 lookup in the same manner as on NCBI's interactive Blast web page, by using the -u parameter. For example, to limit a search of the nr or htgs database to human sequences use: -u human[ORGN] . NCBI has made available genome databases that may be searched by Blastcl3. Table1 gives the parameters for three sample queries. All three examples use the program blastn to search one of NCBI's nucleotide databases with a nucleotide sequence as the query. [Contact NCBI (blast-help@ncbi.nlm.nih.org) for the names of other genomic databases, which are being added.]

 

The first example searches the human genome database . This is the NCBI draft assembly of the public human genome sequence. The aim of this database is to have the entire genome represented once, so multiple hits usually reflect a repetitive sequence. This database is excellent for determining position of markers on the genome, within assembled contigs.

 

The second example searches the nr (non-redundant) database, but restricts the search to mouse sequences. Searches can be restricted to any organism by similar restrictions in the Entrez2 lookup parameter. This is very useful if your question is targeted to a single organism (e.g. Rattus norvegicus[ORGN]) or group of organisms (e.g. Primate[ORGN], or Eukaryota[ORGN]. You may also limit the search by modification date: 2002[Modification Date] will limit the search to sequences that have been added to the database or modified in 2002. Any of the restrictions found on the standard BLAST webpage should work.

 

The third example blasts the mouse genome contigs database, which contains curated NT contigs (non-redundant finished sequence for the mouse). At this time (March 2002) this database it is still quite small, less than 5 percent of the genome. For phase 0, 1, 2 and 3 BACs, continue to search the HTGS database, using the -u parameter to limit the search to mouse sequence.

 

In addition to these parameters, you must supply the name of the input file (-i) and the name for the output file (-o). [Note for Mac/OS users, use the parameter values in the Blastcl3 form omitting the parameter keyword.]

 

 

Table 1: Parameters for restriction searches

 

 

 

Human genome 1

Mouse nr2

Mouse contigs 3

Parameter

keyword

Value

Value

Value

program

-p

blastn

blastn

blastn

database

-d

hs_genome/genome

nr

mouse_contig/contigs

expectation

-e

1e-35

1e-35

1e-35

number of descriptions

-v

10

10

10

number of alignments

-b

10

10

10

Entrez2 lookup

-u

 

mouse[ORGN]

 

 

1 Search human genome database. 2 Search the nr database, restricted to mouse sequences. 3 Search the mouse genome contigs database.