Tuesday, September 2, 2014

GridFTP-HDFS Corruption Issue Workaround

Some sites in OSG have observed data corruption when transferring files with GridFTP-HDFS. In particular, the problem arises when pthreads is enabled (by setting GLOBUS_THREAD_MODEL="pthread") and GridFTP is using *single stream* transfers that span multiple HDFS blocks. In this condition, blocks may be written to the destination file in the wrong sequence. A few sites using GridFTP-HDFS have reported failures, including Fermilab and GLOW.

This issue affects the OSG gridftp-hdfs package, versions 0.5.4-14 and newer, because they have pthreads enabled by default.



DETAILS

Transfers that use parallelism (we tried from 2-10 streams) and single stream transfers that only span a single HDFS block seem to be fine.

Transfers using a single stream but spanning multiple (3 or more) HDFS blocks result in the correct size, but usually the wrong checksum at the destination. The issue was reported originally by a remote user transferring via srm-copy, and OSG testing has observed the same failures using local globus-url-copy tools.

The issue has been reported to the Globus GridFTP developers, but there is no fix yet. But see below for possible workarounds.



WORKAROUNDS

On the server side, the problem can be avoided by disabling pthreads. To do this step, comment out the following line in /etc/gridftp.d/gridftp-hdfs.conf:

# $GLOBUS_THREAD_MODEL pthread

Until the bug is fixed, we recommend making this change on all gridftp-hdfs servers 0.5.4-14 and newer.

Alternatively, if pthreads cannot be disabled for the GridFTP server, it is sufficient on the client side to run globus-url-copy with parallelism greater than 1. For example:

globus-url-copy -p 2 gsiftp://$host:2811/path/file.in file:///path/file.out

FOR MORE INFO

https://globus.atlassian.net/browse/GT-547
https://jira.opensciencegrid.org/browse/SOFTWARE-1495
https://ticket.grid.iu.edu/21825
https://ticket.grid.iu.edu/21157