Wednesday, February 24, 2010

Using Aspera instead of FTP to download from NCBI

If you often download large amounts of data from NCBI using their FTP site you might be interested in knowing that NCBI has recently started using the commercial software Aspera to improve download transfer speeds. This was announced in their August newsletter and at first was only for the Short Read Archive (SRA). However, I recently found out that they are now making all of their data available.

How to use it (web browser)
  1. Download and install the Aspera browser plugin software.
  2. Browse the Aspera NCBI archives.
  3. Click on the file or folder you want to download and choose a place to save it.
  4. The Aspera download manager should (see below) open and show the download progression.
How to use it (command line)
  1. The browser plugin also includes the command line program: ascp (In linux this is at: ~/.aspera/connect/bin)
  2. There are many options but the standard method is:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/source_directory /destination_directory/

e.g.:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/genomes/Bacteria/all.faa.tar.gz ~/

Critique
  • Windows machine with Firefox worked with no problems and download speeds at my institution were much faster than with FTP (~0.5 - 4.0Mbps vs 50-300kbps)
  • Browser plugin with Firefox on Linux would not work! Plugin seemed to be loaded properly, but Aspera download manager would not start. Update: This was due to me trying to install the plugin as root and causing a permission error. The plugin is installed in your home directory and must not be installed as root.
  • Download with command line in Linux was unreliable. This was a huge disappointment as this was the primary method I was hoping to use. Files would start to download correctly with very fast transfer speeds (1-4Mbps), but connection would drop with error: "Session Stop (Error: Connection lost in midst of data session)". Unfortunately, there is no way to resume the download so each time I had to start over. On about the 8th try it downloaded the file (6889MB) correctly. Update: see below
Personal Opinion
Although I was excited to see NCBI trying to improve data transfer speeds I was not very impressed with the Aspera solution. Hopefully, it will become more reliable in the future.
Of course, my personal solution would be for NCBI to embrace BitTorrent technology and make use of BioTorrents, but I will save that discussion for another day.


Update:
All ascp options are shown below (by typing ascp without arguments). However, I can't find any further documentation on these options. As noted in the comments below, -k2 is supposed to resume a download, but this didn't work for me when I tested it.
usage: ascp [-{ATdpqv}] [-{Q|QQ}] ...
[-l rate-limit[K|M|G|P(%)]] [-m minlimit[K|M|G|P(%)]]
[-M mgmt-port] [-u user-string] [-i private-key-file.ppk]
[-w{f|r} [-K probe-rate]] [-k {0|1|2|3}] [-Z datagram-size]
[-X rexmsg-size] [-g read-block-size[K|M]] [-G write-block-size[K|M]]
[-L log-dir] [-R remote-log-dir] [-S remote-cmd] [-e pre-post-cmd]
[-O udp-port] [-P ssh-port] [-C node-id:num-nodes]
[-o Option1=value1[,Option2=value2...] ]
[-E exclude-pattern1 -E exclude-pattern2...]
[-U priority] [-f config-file.conf] [-W token string]
[[user@]host1:]file1 ... [[user@]host2:]file2

-A: report version; -Q: adapt rate; -T: no encryption
-d: make destination directory; -p: preserve file timestamp
-q: no progress meter; -v: verbose; -L-: log to stderr
-o: SkipSpecialFiles=yes,RemoveAfterTransfer=yes,RemoveEmptyDirectories=yes,
PreCalculateJobSize={yes|no},Overwrite={always|never|diff|older},
FileManifest={none|text},FileManifestPath=filepath,
FileCrypt={encrypt|decrypt},RetryTimeout=secs

HTTP Fallback only options:
[-y 0/1] 1 = Allow HTTP fallback (default = 0)
[-j 0/1] 1 = Encode all HTTP transfers as JPEG files
[-Y filename] HTTPS key file name
[-I filename] HTTPS certificate file name
[-t port number] HTTP fallback server port #
[-x ]]

Update 2:
After spending an afternoon with Aspera Support, I have some answers to my connection and resume issues when using ascp. The problem has to do with me not using the -l option to properly limit the speed at which ascp sends data. I thought this limit would only be relevant if 1) I wanted to not use all of my available bandwidth or 2) my computer hardware could not handle the bandwidth of the file transfer. Surprisingly, the recent for my disconnects was because NCBI was trying to send more data than my bandwidth allowed and thus causing my connection to drop. I would have thought that ascp would look after these type of bandwidth differences considering that all other data transfer protocols that I know of can control their rate of data flow. If this is the case, it would suggest that my connection may be broken if for some reason my available bandwidth drops (which would happen often due to network fluctuations at a large institution) even if I set the limit appropriately. Hopefully, Aspera can make their data transfer method a little more robust in the future. I don't think I will be replacing ftp with ascp in my download scripts quite yet.

Update 3:
Michelle from Aspera finally let me know that -Q is default option I should be using to allow adaptive control. Now, I am trying to get a entire directory to download, but I am still having connection issues. Here is a screenshot of my terminal showing that the directory resume is not working and I am losing my connection:


Reblog this post [with Zemanta]