Molecular Systematic Project Tips

By Peter Unmack

When one first starts working in a molecular lab there is a bamboozling variety of details that you should be trying to keep track of. Here I outline my strategies for keeping track of different projects. They key point is not that you do it my way, but that you develop your own scheme that works for you.

The first point is start off with everything electronic. Do not hand write PCR sheets, or simply write everything into a lab notebook. At some stage in the process you will need to enter things into electronic format. Do it at step 1 and save yourself some duplication.

I start off by entering the list of individuals I extract directly into excel (ideally with their locality and other important details recorded too). Then when I set up my first PCR on those samples I simply copy and paste the names into a PCR sheet (save it with the name of the group you ran the reaction on, what gene and the date). I usually print the PCR sheet so that I can attach a print out of the picture to it. I find that when I am sorting through different PCR reactions and figuring out what worked and what didn't that a paper copy is crucial. When I setup my sequencing reaction I copy and paste the sample names and primer details from my electronic PCR sheets to an electronic sheet that lists the cleaned PCR samples. Then I simply copy and paste that list into a sheet that lists the samples for sequencing. When I submit the cleaned sequenced samples I copy and paste into a sheet formatted for submission and I also paste it into a master list of sequences. Everything is set up to simply that process. Different labs will have a different workflow depending on what stage samples get sent off for sequencing and how the sample details get transmitted to those doing the sequencing.

You can download examples of these excel files. Extraction file, PCR file, Clean PCR file, Sequencing file, Master list.

Come up with a consistent naming scheme for your DNA samples. I usually try and limit it to less than 10 characters as that is all I can write on the lid of a 0.6 ml tube. When I make collections in the field I assign them with a unique field code, typically like PU0935, PU is my initials, 09=year, 35=collection within that year. When I extract DNA from this collection I put the species initials in so that Galaxias occidentalis becomes GO0935.1, GO0935.2, etc. for each individual of that species from that locality. If I don't have a field number I use the first four letters of the creek name (making sure it is unique). Don't use dashes to separate things, use periods instead (some analysis programs will give you grief if you try to include - in the sample name as a dash is the character for a gap).

I use a unique number for every sequence I do, e.g., PU24354, and when I submit stuff I give it the sequence number, DNA name and primer name. I also have all of this recorded in excel as well as notes on whether the sequence worked or failed. I organize my sequences by year and by gel number so that I can easily find any specific chromatogram.

Keep an excel sheet with all of your samples in it, with one line per individual. When you get a good complete sequence for an individual have a column and mark it as done. Without some formal tracking system you will waste lots of time trying to figure out what still need to be done. An updated excel sheet will save you lots of time. Using excel's filter function will quickly and easily show you what you are missing, which can then be pasted into your next PCR sheet. Try and put your samples in some logical order and/or match that order to your DNA samples in the freezer to make them easier to find. Here is a link to an example excel file for a phylogenetic project with many genes.

As you collect new data, run quick and dirty analyses as you go. Don't wait until the end. It only takes a couple of minutes to take your data from BioEdit and into a program like MEGA and run a quick neighbor joining tree to see that things are coming out "correctly". Some folks don't like NJ, but it is a great tool for quick and dirty data checking/exploration.

It is good to develop a standard naming system for your various files. The list below is typical of what I have for my data analysis files, although I usually end up with many more. I start with the taxon or group of interest, the gene and then what is in the content of the file. You will potentially end up with many many files for different analyses that you do and it is critical to be able to identify which is which. Inevitably you will find errors and have to go back and update/change files as well, thus add something to the name to make that clear (e.g., final, fixed.july.7, etc). Simply sorting by date may not work if you are actively working with several of the files.

birdshead.cb.final.meg	birdshead = group of interest, cb = gene abbreviation
birdshead.cb.meg	mega format
birdshead.cb.fas	fasta format
birdshead.cb.phy	phylip format
birdshead.cb.nex	nexus format
birdshead.cb.mt.nex	Modeltest generation file
birdshead.cb.model.scores	output from paup for obtaining model scores
birdshead.cb.model.scores.out	output from modeltest containing the model scores
birdshead.cb.ml.nex	ML analysis file
birdshead.cb.ml.tree.nex	ML tree file
birdshead.cb.mlb.nex	ML bootstrap analysis file
birdshead.cb.mlb.tree.nex	ML bootstrap tree file
birdshead.cb.mp.nex	MP analysis file
birdshead.cb.mpb.nex	MP bootstrap file
birdshead.cb.mpb.tree.nex	MP bootstrap tree file
birdshead.cb.each.spp.group.meg	mega file with species groupings
birdshead.s7.meg	S7 = gene abbreviation
birdshead.comb.fas	combined file with both genes
birdshead.comb.phy	combined file with both genes

You may find that some labs have a fairly fixed way of doing things. It is often a good idea to speak to folks in other labs about how they do things, as many labs do things differently. The key is to find a work flow that works best for you.

Back to Unmack's Molecular Phylogenetics page.