1. Introduction --------------- In the context of the NSF ExTENCI project, our group is exploring job submission paradigms that allow application scientists to transparently use XSEDE and Open Science Grid (OSG) infrastructure resources side by side as well as concurrently, based on their application requirements and compute- / data-characteristics. The high-level idea of our approach is to add a thin API layer on top of the heterogeneous job-submission and data management interfaces of XSEDE (Globus, PBS, GridFTP) and OSG (Condor, Condor-G, Glidein-WMS, SRM) to provide a unified access layer to both worlds. Our group has extensive experience with implementing interoperability in High-Percormance Computing (HPC) environments: we have successfully provided computational scientists with the tools and interfaces to coordinately use multiple XSEDE and FutureGrid resources in a production research environment. Two key technologies have been conceptualized and developed over the past five years and are continuously fostered as strategic core components for implementing infrastructure interoperability: (1) SAGA: The Simple API for Grid Applications (SAGA) is an Open Grid Forum standard (OGF GFD-90) and provides a unified interface for job submission, file and data management as well coordination and communication. We are developing two implementations: (a) a C++ implementation (called SAGA-C++), and (b) a Python implementation (called Bliss). Both implementations are similar in that they implement the adaptor (plug-in) pattern to connect the API with various distributed computing middleware systems and services. SAGA supports a wide range of backends, the more prominent ones being PBS, LSF, Globus, Torque, SSH, Condor and Amazon EC2. SAGA is deployed as part of the 'Community Software Areas' (CSA) on most XSEDE and FutureGrid resources where it is used by multiple computational science projects as well as compute science research experiments. (2) BigJob: BigJob is a user-space Pilot-Job implementation written in Python. BigJob uses SAGA as the access mechanism for distributed middleware, like PBS and LSF batch queuing systems. It uses light-weight 'agents' as resource proxies on the target systems to create its own resource overlay network in which arbitrary compute- and data-scheduling decisions can be made, based on application requirements. Our group has shown that BigJob can help to significantly reduce the overhead for high-throughput (HTC) and many-task computing (MTC) applications by overriding resource-level job submission and management overhead and by employing compute and data co-allocation strategies. BigJob exposes a programming interface called 'Pilot-API' which provides application developers with high-level handles to 'resources' (to instantiate resource overlays) and 'work-units' (jobs/tasks and data). The combination of SAGA and BigJob has been used with great success in multiple interoperability projects, hence the decision to use the same technology stack to bridge the interoperability-gap between XSEDE and OSG. In this report, we describe the different approaches (Section 2) we have explored in order to provide SAGA/BigJob capabilities on OSG. We will focus especially on the misconceptions and problems that we have encountered in the process. We will distinguish between 'general problems' that we have encountered as 'novice' OSG users/developers (Section 3) and 'data-related' problems (Section 4) that seem to be more intrinsic to the overall OSG architecture and present a non-neglectable obstacle for the succsessful execution of data-intensive HTC/MTC applications. In the last section, we discuss the implications that our findings may have for OSG interoperability, as well as possible solutions and strategies to overcome current limitations. 2. BigJob and SAGA on OSG ------------------------- Three scenarios: Pilot-API on top of glidein-WMS BigJob using Condor-G BigJob using Condor (vanilla) ? (what about condor-diane?) 2. General Problems ------------------- When we describe our problems in this and the following section, it is important to mention beforehand, that nobody in our group has ever worked with OSG infrastructure before. Most of our group-memebers have and extensive track-record of working with applications and application frameworks on HPC grids and it is fair to say that we are obviously biased towards the XSEDE- and FutreGrid-mode of operation. However, we do have some experience with the European Grid Initiative (EGI) infrastructure, which seems to be conceptually similar to OSG. During the course of the first phase of the ExTENCI project, we worked on several different prototypes (see Section 2) in order to explore how to integrate our Pilot-API with OSG. During this endeavor, we touched many different parts of OSG -- these were our observations from the perspective of a software developer: (A) The most apparent problem that we have encountered is, that OSG doesn't provide a mechanism to submit jobs remotely from a non-OSG machine to OSG resources. This implies that all interoperability experiments between OSG and other infrastructures have to be run from within the OSG domain and can't be hosted on a third-party host or a user's machine. It also implicates that most of our development had to take place remotely on the command-line of one of the OSG 'gateway' machines (engage-submit3.renci.org), which is more than tedious, especially over slow network connections. In XSEDE, it is common practice for users and developers alike to install the Globus client tools (globus.org/toolkit/) on 3rd-party machines / laptops and initiate job submission and data transfers remotely (SAGA provides a plug-in for this) -- this doesn't seem to be possible with the Condor software stack. We have implemented a 'condor-over-ssh' tunneling mechanism in SAGA to circumvent these limitations, however, this strategy only works satisfactory for very simple use-cases. (B) Another peculiarity we have encountered, is that the official support channels for OSG don't seem to be well defined. Support links on the website often lead to nothing (e.g., "Find you support center contact information here"@http://bit.ly/Iyy6kD) and the general structure is not quite clear to us. Statements like "Interaction at resource level is determined by the VO. Please contact your VO representative for more information" is not very helpful either. While this probably not too much of a concern for people who are deeply involved in larger OSG project (help can probably be found 'down the hallway'), this is a problem for people engaging with OSG for the first time, on a deeper, technical level. It must be mentioned that direct contact with the Engagement VO core team (esp. Mats Rynge) was crucial and very helpful. Without this unofficial support channel, we wouldn't have been able to develop our OSG interoperability prototypes. (C) The unclear support structures somewhat go along with a lack of detailed technical documentation. The current OSG user (i.e., non-admin) documentation hub is available online at https://www.opensciencegrid.org/bin/view/Documentation/Release3/NavUserMain. An unofficial blog at http://osglog.wordpress.com/ also provides some useful information. The official documentation provides an overview of fist steps, available job submission methods and the different storage systems available. The problem with this tool-centric documentation approach is, that the user / developer is left mostly alone with the task to figure out which of the available technologies in the complex OSG ecosystem make the most sense in a specific application scenario. We would have found a guide which maps application requirements (i.e., high-level metrics processing & storage requirements, data transfer volume, communication) to the capabilities of available tools and software (including their limitations) as most useful. We have also noticed, that not all of the advertised tools are necessarily available and functional in all VOs and on all CEs. (D) The general concept and especially the resource selection for Condor-G jobs has strikes us as moderately obscure. While the Glide-In WMS (Vanilla Condor) pool for single processors (up to 4? cores) conveniently doesn't require (or allow?) explicit resource selection, jobs that require more than 4(?) cores per task have to switch to a different job submission paradigm: Condor-G. This paradigm requires the explicit selection of a ("grid") resource and exposes aspects of Globus RSL (Resource Specification Language) on Condor level. We see a problem in this approach, since it introduces (enforces) two different paradigms and interfaces for the same fundamental task: submitting jobs. This makes the practical implementation of applications and frameworks that need to support both types of workload non-trivial. It struck us as a case for interoperability within OSG. The resource selection process for Condor-G jobs seems highly-non trivial: resources that are available for a specific VO can be discovered using an LDAP client. However, there doesn't seem to be any guarantee that the resources returned by a query will actually accept Condor-G jobs. Through a trial-and-error process, we have established a list of resources that 'tend to work' and that have all required software tools (e.g. SRM) installed. (E) The globus toolkit seems to be installed on most OSG resources, presumably, because the Condor-G as well as the Glide-In mechanisms rely heavily on it. As mentioned at the beginning, our group is very familiar with the Globus software stack and our own software stack consisting of SAGA and BigJob would not have required too much adaption, since it already provides support for Globus. However, on more than one occasion, we were discouraged to interface with individual OSG resources via Globus directly, since this would potentially interfere and/or override internal OSG accounting and scheduling policies / mechanisms. (F) Lastly, another big source of frustration was the overall stability and the error reporting mechanisms in case of failures. At many occasions, tasks would just stay in 'hold' state over days and further investigation (i.e., parsing condor .log files) would reveal in the best case an unavailable resource and in the worst case a way more cryptic error passed on from the Globus subsystem -- or just nothing. It appears that one has to be familiar with the details of the otherwise encapsulated OSG architecture in order to track-down and understand possible causes of failure. Another problem with this non-uniform way of error handling is, that it makes it highly non-trivial to react appropriately to errors programmatical on (user-space) application level, since errors often seem to require human interpretation. We conclude that the OSG environment seems to enforce a very specific usage-mode, dictated by the Condor software stack. While the OSG environment provides a convenient 'fire-and-forget' methodology for the types of applications originally envisioned by the system designers (applications that can be decomposed into compute-centric HTC / Many-Task workloads), it provides a serious limitation to more complex applications and user-space application frameworks. While HTC grids like XSEDE or FutureGrid also favor a certain usage mode (HTC, tightly-coupled applications), they generally don't try to encapsulate system details, which makes it much easier to implement user-space software frameworks that support different usage-modes. Furthermore, the OSG software stack bears a significantly higher amount of complexity, compared with e.g., the XSEDE and FutureGrid stacks. For this reason, OSG requires dedicated, pre-installed 'submission' or 'gateway' nodes to which application users and developers have to log in in order to access OSG resources. This, together with the fact that it doesn't seem possible to submit jobs 'from the outside' (remotely, from a non-gateway system) to OSG makes the design and implementation of simple (grid-) interoperability solutions if not impossible, then at least unnecessarily complex. 3. Data-Related Problems ------------------------ (A) SRM et. al. Decoupled from Condor (B) Co-allocation not possible 4. Implications --------------- What does this mean for further development? BigJob on OSG? How? Useful at all?