We will use the conda
software installation tool to install all of our software locally.
source ~/.bashrc
' after installation## example argument entry to conda
# conda install -c <channel> <package-name>
## install software from my channels
conda install -c ipyrad ipyrad
conda install -c eaton-lab toytree
## install raxml from the commonly used channel bioconda
conda install -c bioconda entrez-direct sratools
We'll return to conda later to install more software.
And introduction to the ipyrad setup and parameter settings.
Create an empty params file with default values by calling ipyrad with the -n command.
Use a text-editor, such as nano
, to enter new values into the params file.
-p
. -s
. -r
-h
to see several other commands. See a great Pycon keynote address on this topic here (I mean, watch it later if you're interested.)
An example we will reproduce today:
http://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-empirical-API-1-pedicularis.ipynb
Published examples from my own work:
https://github.com/dereneaton/RADmissing
https://github.com/dereneaton/virentes
When you call jupyter from a terminal it will start a notebook server, which you can access through your browser to send and receive data with computing kernels to run code and display its output (e.g., Python, R, Julia, or bash).
## call this command from a terminal on your laptop
jupyter-notebook
This can be a little tricky, but once you figure it out it is very rewarding. You can leave the notebook running on a cluster, and connect to when you want to run analyses, disconnect, and reconnect later. This link provides a more in-depth description with a short video tutorial (with sound) for doing this same thing through a job submission script on a cluster. But for now, just run the commands below on the MBL cluster.
This should be run on the cluster (remote)
## first, let's make a password for our notebook
jupyter-notebook password
## Next, launch a notebook server with this command
jupyter-notebook --no-browser --ip=$(hostname -i) --port=$(shuf -i8000-9999 -n1)
Look at the output of the notebook server, it will list a port number that was randomly selected (this is to avoid ya'll using the same port) and it will show an ip address (e.g., 127.0.1.1).
This should be run on your laptop (local)
## on your laptop (local) call this command
## ssh -N -L port:ip:port user@login
ssh -N -L 8888:127.0.1.1:8888 deaton@class05.jbpc-np.mbl.edu
ipyrad makes use of the ipyparallel library for large-scale parallel computing which is particularly designed for interactive parallel computing in jupyter notebooks.
## start a cluster with all available cores
ipcluster start
## start a cluster with N cores
ipcluster start --n=4
## start a cluster with a specific name and settings
ipcluster start --n=4 --profile='deren'
## start a cluster that will connect to all cores on multiple hosts
ipcluster start --n=80 --engines=MPI --ip=*
When ipyrad connects to the ipcluster instance you can see details of the connection.
import ipyrad as ip
import ipyparallel as ipp
## connect to client
ipyclient = ipp.Client()
## print client info
ip.cluster_info(ipyclient)
host compute node: [4 cores] on oud
The general approach is to distribute work among the kernels (engines), wait for them to finish, and then collect the results. All of this happens under the hood in ipyrad, but a simple understanding of it helps to understand how to easily parallelize any type of code within a jupyter-notebook.
def f(x):
return x + 10
## send a job to the cluster
async = ipyclient[0].apply(f, 10)
## wait for job to finish before leaving loop
while 1:
if async.ready():
break
## print result
print 'result is', async.result()
result is 20
## an Assembly object
data = ip.Assembly("example")
## the parameters
data.set_params("project_dir", "tutorial")
data.set_params("sorted_fastq_data",
"/class/molevol-shared/ipyrad-tutorial/fastqs-Ped/*.gz")
## run the analysis on a specific cluster
data.run('1', ipyclient=ipyclient)
## branch and subsample
subsamples = [i for i in data.samples if "prz" not in i]
sub = data.branch("sub", subsamples=subsamples)
## branch and change params over a range of values
assemblies = {}
for ct in [0.80, 0.85, 0.90, 0.95]:
newname = "clust_{}".format(ct)
newdata = data.branch(newname)
newdata.set_params("clust_threshold", ct)
assemblies[newname] = newdata
## split an Assembly into two groups
allsamples = data.samples.keys()
data1 = data.branch("data1", allsamples[:5])
data2 = data.branch("data2", allsamples[5:])
## set different params on each assembly
data1.set_params("mindepth_statistical", 10)
data2.set_params("mindepth_statistical", 5)
## run each Assembly
data1.run("5")
data2.run("5")
## merge the samples back into a single Assembly
full = ip.merge("full", [data1, data2])
full.run("67")
Open the link below for one of the following notebook. They're very similar, the first is a group of plants, the second is a group of birds, so choose your favorite. We will then each create a new notebook from scratch by copy/pasting code from these notebooks to produce our own new analysis notebook (run either on your laptop or on the cluster).
The ipyrad analysis toolkit provides additional downstream formatting and filtering of data sets, as well as wrapper tools that allow running external software in efficient and reproducible ways.
## the general design of analysis tools
import ipyrad.analysis as ipa
## initate an analysis object with a name, data file, and other args
r = ipa.raxml(
name="example",
data="analysis-ipyrad/output.phy",
*args)
## see and change the object.params
r.params
## run command to execute the code or send to client
r.run(ipyclient)
You are now ready to go on and assemble RAD-seq data sets to your very specific and particular needs, and to keep a very detailed record of what you did to share with your friends and cruel reviewers.