home > documentation > scenarios > working with blobs >

Using OGSA-DAI with binary large objects (BLOBs)

OGSA-DAI has several activities that are useful when reading BLOBS from, and writing BLOBs to, relational databases. The page presents the recommended way of using these activities and also raises some database specific issues that can be important when manipulating BLOBs.

Reading BLOBs

To read BLOBs from relational data service resources using OGSA-DAI the sqlQueryStatement activity must be used to specify an SQL query that selects the BLOBs from the database. Having obtained the BLOB from the database there are various ways of delivering it back to the client. This section discusses the various delivery options.

Return in the response document using base 64 encoding

The simplest way to deliver BLOBs back to the client is to pass the output from the sqlQueryStatement activity to either the sqlResultsToXML or sqlResultsToCSV activities as shown:

The sqlResultsToXML and sqlResultsToCSV activities convert an SQL result set into a textual format that can be returned to the client within the response document. Within these textual formats binary data will be represented using base64 encoding. Base 64 encoding is a way of encoding arbitrary binary data using only printable ASCII characters. The size in bytes of encoded data will be approximately 135% of the original data.

The advantages of this approach are:

  • Supports multiple BLOBs columns in the result set.
  • Supports multiple rows in the result set.

The disadvantages of this approach are:

  • Returning the result within the response document requires the textual encoding of whole result set to be held in memory at one time. This can lead to the JVM requiring large amounts of memory which can cause the server or client to run very slowly or even run out of memory.
  • It is only really suitable for small BLOBs.
  • The use of base 64 encoding produces larger messages.

Return as an attachment built using byte arrays

The expense of base 64 encoding can be avoided by returning the binary data to the client as an attachment to the response message. In order to deliver the binary data as an attachment we must first extract the binary data from the SQL result set. This can be done using the sqlResultToBytes activity the output of which can then be sent to the deliverToAttachment activity as shown:

The sqlResultToBytes activity can extract a single BLOB from the result set. It does not support multiple BLOBs within a row nor multiple rows.

The sqlResultToBytes allows the client to specify the number of bytes in each block and hence allows the binary data to be streamed. The deliverToAttachment activity reassembles these blocks into a single big byte array that is then passed to the SOAP engine for delivery as an attachment. Thus the deliverToAttachment activity does not support streaming and must hold the whole BLOB in memory at one time. For large BLOBs this is an undesirable approach to take.

The advantages of this approach are:

  • Data is passed from server to client in binary format.

The disadvantages of this approach are:

  • Can extract only one BLOB from a single colulmn of the first row of the result set.
  • Requires the full BLOB to be held in memory by the deliverToAttachment activity

Return as an attachment built using a file

The deliverToAttachment activity also supports inputs of type java.io.File. When the activity gets such an input it does not need to hold the whole contents of the file in memory in order to send it as an attachment. It therefore has a much smaller memory requirement than if the same data was passed as a sequence of byte array blocks. In order to make use of this functionality the output from the sqlResultToBytes activity can be passed to a bytesToTempFile activity and then onto the deliverToAttachment as shown:

The advantages of this approach are:

  • Data is passed from server to client in binary format.
  • The full BLOB need never be held in memory at a single time.

The disadvantages of this approach are:

  • Can extract only one BLOB from a single colulmn of the first row of the result set.

Return using FTP or GridFTP

If the client wishes the BLOB data to appear as a file on their system then the you can deliver the file via FTP with the deliverToURL activity or via GridFTP using the deliverToGFTP activity. Both of these approachs require the intermediate sqlResultToBytes activity as show:

Both of these delivery activities can stream the data to the destination in blocks. It is therefore good practice to keep the memory usage small by selecting a suitable block size to use with the sqlResultToBytes activity.

The advantages of this approach are:

  • The data can be delivered direct to client's file system.
  • The data can be delivered to third party.
  • The data is passed from server to client in binary format.
  • The full BLOB need never be held in memory at a single time.

The disadvantages of this approach are:

  • Can extract only one BLOB from a single colulmn of the first row of the result set.

Streaming using Data Transport functionality

Another delivery mechanism that may be useful when moving BLOBs is the OGSA-DAI Data Transport functionality. This has the advantage of allowing data be to delivered from one OGSA-DAI perform document to another. The chain of activities is shown:

OGSA-DAI's Data Transport functionality also allows clients to obtain the data in streamed blocks so it can be useful for more than service to service data transport.

The advantages of this approach are:

  • The data can be pulled in chunks by the client or third party.
  • The data is passed from server to client in binary format.
  • The full BLOB need never be held in memory at a single time.
  • Data can be passed between OGSA-DAI servers.

The disadvantages of this approach are:

  • Can extract only one BLOB from a single colulmn of the first row of the result set.
  • Too many repeated calls to the server to pull the next block of data can result in a slow data transfer.

Issues

All of the recommended approaches for handling all but the smallest of BLOBs recommend using the sqlResultToBytes. This activity can only handle a single BLOB in the result set. It should be easy to extend the activity so that it can have multiple outputs and so can handle multiple BLOBs in each row.

Handling multiple rows is a bit more tricky. Doing so would require outputting special markers to separate one BLOBs data stream from the next. The various other activities to which this output is delivered would also be required to handle these new markers and behave accordingly. This is something the OGSA-DAI team hope to look into for future releases.

Writing BLOBs

To write BLOBs to relational data service resources using OGSA-DAI the sqlUpdateStatement activity must be used. This activity must specify a parameterized query and also specify inputs to which the parameter values must be sent.

The sqlUpdateStatement expects each input block to contain a whole column value. To pass BLOB data you have two options:

  • Pass a single byte array containing the whole BLOB.
  • Pass a reference to a file containing the BLOB data.
In keeping with OGSA-DAI's aim to stream data and avoid large amounts of data being held in memory at one time the second of these approaches is by far the best. The bytesToTempFile activity allows binary data to be streamed into a temporary file that can then be passed to the sqlUpdateStatement activity.

There are various options for creating the byte stream that is piped into the bytesToTempFile activity. This section discusses these options.

Get data using FTP or GridFTP

Remote files can be delivered as blocks of byte arrays into OGSA-DAI using the deliverFromURL or deliverFromGFTP activities. The output of these activities can be piped into the bytesToTempFile activity and on to the sqlUpdateStatement activity to insert the data in the remove file into the database as show:

The advantages of this approach are:

  • The data can be pulled directly from the client's file system.
  • The data can be pulled from a third party's file system.
  • The data passed to server in binary format.

The disadvantages of this approach are:

  • Only a single row can be inserted into the database.

Get data using Data Transport functionality

OGSA-DAI's Data Transport functionality can be used to provide the input to the bytesToTempFile activity. This has the advantage of allowing data be to delivered from one OGSA-DAI perform document to another. The chain of activities is shown:

OGSA-DAI's Data Transport functionality also allows clients to send the data in streamed blocks so it can be useful for more than service to service data transport.

The advantages of this approach are:

  • The data can be pushed in chunks by the client or third party.
  • The data is passed from server to client in binary format.
  • The full BLOB need never be held in memory at a single time.
  • Data can be passed between OGSA-DAI servers.

The disadvantages of this approach are:

  • Only a single row can be inserted into the database.
  • Too many repeated calls to the server to push the next block of data can result in a slow data transfer

Issues

Unfortunately OGSA-DAI does not yet provide a deliverFromAttachment activity. Such an activity would allow BLOBs to be delivered to the OGSA-DAI server as an attachment. We hope to provide this in the future.

Similarly we do not provide a mechanism for BLOB data to be sent to OGSA-DAI within the perform document using base 64 encoding. We will like to provide this functionality in the future.

Database and operating system specific issues

Processing large BLOBs can require that service administrators and client have a good understanding of the database and operating system the are using. Any database specific issues we have discovered are logged here.

MySQL

MySQL has a variable called max_allowed_packet that controls the maximum size of the communication buffer between the database and a client. This variable limits the size of a row in a result set. In order to return rows that contain large BLOBs this variable may have to be set to a value higher than the default.

Windows operating system

Too many repeated calls to the server within a short period of time can cause clients running on windows to run out of reserved ports. This can happen down using the Data Transport functionality to transfer a large BLOB in lots of small chunks. When this happens clients will see the following error:

java.net.BuildException: Address already in use: connect
A detailed explanation of the problem and suggested solutions can be found here. In many cases the easiest solution is simply to transfer the data in a smaller number of larger chunks.