Screencast: Download unfinished genomic sequences at NCBI Entrez

Over at RRResearch, Dr. Redfield is looking to download lots of incomplete H. influenzae genomes. So, I left a comment describing how I would solve the problem. Here, I expand the comment with a screencast (5 minutes) that show the procedures I take to download 288 nucleotide records from NCBI Entrez.

Updated comment

How to download a concatenated file of all H. influenzae genomes from NCBI Entrez Nucleotide

Note: I don’t know if this is the best way, just a way.

1. Go to NCBI Entrez Nucleotide.

2. Search for “Haemophilus influenzae [Organism] WGS” and get 576 results , 288 each from GenBank and Refseq. (”WGS” stands for whole genome sequencing, I think, but does not include completed genomes, for some reason.)

3. On the “Limits” tab, select either the “GenBank” or “RefSeq” results from the pop up menu titled “Only from:” to get the 288 GenBank records or the 288 RefSeq records. I am not sure which you should use, but I lean towards GenBank over RefSeq, as it is the submitted form of the record, so my example will proceed using the GenBank results.

4. From the “Display” pop-up, choose “FASTA”, and it will give you the FASTA of the first five results (Sorry, no static link available).

5. Next, choose to display a maximum 500 of the results, which gives us the FASTA of all 288 results in a web page format (Sorry, no static link available).

6. Finally, from the “Send to” pop-up, choose “file”, and your web browser should start downloading a text file of the results. .zip Archive of the resulting download (5 MB).

From here, you could probably use this file to make a BLAST database, however, there are 9 nearly empty FASTA records that should be deleted. You can do this many ways, but I like to do so graphically using the free TextWrangler, by the makers of BBEdit. To locate the empty records, I did a “find all” for “00000″, which are 5 zeros Once you delete these records, you end up with a multi-FASTA file containing genomic sequences from an additional 9 Hin strains:

22.4-21 (44 contigs), 22.1-21 (18 contigs), 3655 (23 contigs), PittAA (40 contigs), PittHH (59 contigs), PittII (25 contigs), R2846 (20 contigs), R2866 (4 contigs), and R3021 (46 contigs). I certainly can’t speak to the quality or completeness of these sequences, but you can download my results (.zip, 5 MB), if they would be useful.

I am sure there is some better way to do this, but I haven’t been able to find an FTP server where I can locate these files. A particular problem with this method is that it tends to slow your web browser to a crawl. I accomplished steps 4,5, and 6 with a liberal use of the Safari “stop loading page” button. Basically, I let the intermediate step pages begin to load, then stop them before they complete, so I can choose the next setting. With Firefox, I was unable to complete this tutorial, because the browser would become completely non-responsive for at least 10 minutes. Be careful!

Is there a better way to do this?