In the Letters section of the May issue of Microbe, Tatiana Tatusova at NCBI writes a great summary and comparison of GenBank, RefSeq, UniProt, and Swiss-Prot in an article titled “GenBank, RefSeq, TPA, and UniProt: What’s in a Name?”. Certainly, a useful introduction to these resources.
“While nearly every aspect of the modern research enterprise changes quickly, the lab notebook hasn’t changed in over 100 years”,
which he means as an indictment of the paper based notebook. I would take the opposite tack; If it has worked for 100 years or more, maybe we shouldn’t be so quick to throw paper based notebooks out.
The problem with ELNs is that they are inconvenient, compared to paper notebooks. ELNs require access to a computer to read or write. Also, it is easier to lose digital information, than paper and pen based information (I’ve had far more hard disk crashes than I care to think about).
Most laboratory computers aren’t at the bench, they are at the desk. Since they aren’t located where the work is occurring, there are three main options for writing up experiments.
Write up experiments completely before the experiment occurs, then perform the experiment just as planned.
Try to remember everything and type it up afterwards.
Write things in pen and paper, then transpose into the ELN.
The problem with each of these, respectively, is:
Things rarely go exactly as planned.
The longer between observing something and writing it down increases the possibility of errors.
Transposing information is twice the work, and also increases the possibility of errors.
In the real world, a mixture of all of these would be the way to go, but I think it is unlikely to be readily adopted by academic scientists, because of the hassle. Instead, I think we should try to leverage the work people put into their venerable old paper and pen lab notebooks with digital technologies.
One easy path to searchable, shareable notebooks might be Optical Character Recognition, where one scans in the lab notebook pages and lets the computer figure out what’s what, but this approach will be limited by how often a researcher drags the notebook over to the scanner.
An interesting alternative would be the recently announced pen-top computer from Livescribe. Basically, this is a pen that can “remember” what you write, and upload what you have written to a computer. I envision writing in my notebook, then when I (and the pen) get back to the computer, the recorded text is uploaded to the computer and “auto-blogged” into my ELN. Sure, it’s not perfect: any images I paste into my paper notebook wouldn’t be saved, and the formatting won’t be perfect. However, I could have a date and keyword searchable archive of the things I have written, from which I can then reference to my actual notebook.
If ELNs are going to come into common use, there are only two ways: forced by the industry/academic hierarchy, or so easy and simple to adopt that researchers would be fools not to start using it.
Over at RRResearch, Dr. Redfield is looking to download lots of incomplete H. influenzae genomes. So, I left a comment describing how I would solve the problem. Here, I expand the comment with a screencast (5 minutes) that show the procedures I take to download 288 nucleotide records from NCBI Entrez.
Updated comment
How to download a concatenated file of all H. influenzae genomes from NCBI Entrez Nucleotide
Note: I don’t know if this is the best way, just a way.
2. Search for “Haemophilus influenzae [Organism] WGS” and get 576 results , 288 each from GenBank and Refseq. (”WGS” stands for whole genome sequencing, I think, but does not include completed genomes, for some reason.)
3. On the “Limits” tab, select either the “GenBank” or “RefSeq” results from the pop up menu titled “Only from:” to get the 288 GenBank records or the 288 RefSeq records. I am not sure which you should use, but I lean towards GenBank over RefSeq, as it is the submitted form of the record, so my example will proceed using the GenBank results.
4. From the “Display” pop-up, choose “FASTA”, and it will give you the FASTA of the first five results (Sorry, no static link available).
5. Next, choose to display a maximum 500 of the results, which gives us the FASTA of all 288 results in a web page format (Sorry, no static link available).
From here, you could probably use this file to make a BLAST database, however, there are 9 nearly empty FASTA records that should be deleted. You can do this many ways, but I like to do so graphically using the free TextWrangler, by the makers of BBEdit. To locate the empty records, I did a “find all” for “00000″, which are 5 zeros Once you delete these records, you end up with a multi-FASTA file containing genomic sequences from an additional 9 Hin strains:
22.4-21 (44 contigs), 22.1-21 (18 contigs), 3655 (23 contigs), PittAA (40 contigs), PittHH (59 contigs), PittII (25 contigs), R2846 (20 contigs), R2866 (4 contigs), and R3021 (46 contigs). I certainly can’t speak to the quality or completeness of these sequences, but you can download my results (.zip, 5 MB), if they would be useful.
I am sure there is some better way to do this, but I haven’t been able to find an FTP server where I can locate these files. A particular problem with this method is that it tends to slow your web browser to a crawl. I accomplished steps 4,5, and 6 with a liberal use of the Safari “stop loading page” button. Basically, I let the intermediate step pages begin to load, then stop them before they complete, so I can choose the next setting. With Firefox, I was unable to complete this tutorial, because the browser would become completely non-responsive for at least 10 minutes. Be careful!