Over at RRResearch, Dr. Redfield is looking to download lots of incomplete H. influenzae genomes. So, I left a comment describing how I would solve the problem. Here, I expand the comment with a screencast (5 minutes) that show the procedures I take to download 288 nucleotide records from NCBI Entrez.
Updated comment
How to download a concatenated file of all H. influenzae genomes from NCBI Entrez Nucleotide
Note: I don’t know if this is the best way, just a way.
1. Go to NCBI Entrez Nucleotide.
2. Search for “Haemophilus influenzae [Organism] WGS” and get 576 results , 288 each from GenBank and Refseq. (”WGS” stands for whole genome sequencing, I think, but does not include completed genomes, for some reason.)
3. On the “Limits” tab, select either the “GenBank” or “RefSeq” results from the pop up menu titled “Only from:” to get the 288 GenBank records or the 288 RefSeq records. I am not sure which you should use, but I lean towards GenBank over RefSeq, as it is the submitted form of the record, so my example will proceed using the GenBank results.
4. From the “Display” pop-up, choose “FASTA”, and it will give you the FASTA of the first five results (Sorry, no static link available).
5. Next, choose to display a maximum 500 of the results, which gives us the FASTA of all 288 results in a web page format (Sorry, no static link available).
6. Finally, from the “Send to” pop-up, choose “file”, and your web browser should start downloading a text file of the results. .zip Archive of the resulting download (5 MB).
From here, you could probably use this file to make a BLAST database, however, there are 9 nearly empty FASTA records that should be deleted. You can do this many ways, but I like to do so graphically using the free TextWrangler, by the makers of BBEdit. To locate the empty records, I did a “find all” for “00000″, which are 5 zeros Once you delete these records, you end up with a multi-FASTA file containing genomic sequences from an additional 9 Hin strains:
22.4-21 (44 contigs), 22.1-21 (18 contigs), 3655 (23 contigs), PittAA (40 contigs), PittHH (59 contigs), PittII (25 contigs), R2846 (20 contigs), R2866 (4 contigs), and R3021 (46 contigs). I certainly can’t speak to the quality or completeness of these sequences, but you can download my results (.zip, 5 MB), if they would be useful.
I am sure there is some better way to do this, but I haven’t been able to find an FTP server where I can locate these files. A particular problem with this method is that it tends to slow your web browser to a crawl. I accomplished steps 4,5, and 6 with a liberal use of the Safari “stop loading page” button. Basically, I let the intermediate step pages begin to load, then stop them before they complete, so I can choose the next setting. With Firefox, I was unable to complete this tutorial, because the browser would become completely non-responsive for at least 10 minutes. Be careful!
Is there a better way to do this?
nuin | 18-Jul-07 at 10:43 am | Permalink
Hi
You can use Geneious with the same search pattern you used:
Open Geneious
Got to NCBI-> Nucleotide
On the search tab click more options
On the drop down list select “Organism”
Enter Haemophilus influenzae
Click on the plus sign and enter “WGS” in the new search box
Click search and you getthe same 576 results.
All the resulting sequences will be displayed on the tab below the search. The donwload takes a couple of minutes.
Go to Edit->Select all and then on Tools->Concatenate …
That’s it. The free version has the same features.
Hari Jayaram | 02-Oct-07 at 6:22 am | Permalink
Hi Brian,
There is another nice way of doing batch downloads on the beta.uniprot.org site.
I think screencasts are a great way of communicating this kind of information. A bunch of us started a site called bioscreencast.com to organize a community of people involved in the life sciences that creates , shares and consumes screencasts.
Please could you upload your screencast created with jing on that site…
I have been meaning to check out geneious for a while; didnt know the free version could do what nuin talks about.
Hari