Skip to main content.

HDF5DataHandler

Background information sources

Data type mappings

James, Please help with this:

  1. Is DAP supporting 64-bit and 128-bit integer? No. We might if there are many users requesting it, but it's better to not support some types if they are infrequently used than make all of the clients support them. Of course, clients can choose to not support them, but making clients hard(er) to write seems to limit use more than not supporting some data. [jhrg 5/10/07]
  2. Is the datatype size in DAP fixed? For example, is byte always 8-bit and Int32 always 32-bit? [ky 2/22/2007] Yes. [jhrg 5/10/07]

The following table summarizes the difficulty of mapping between HDF5 and DAP. It is created based on the HDF5-DODS Data Model and Mapping paper and DAP 2.0. -- JoeLee - 15 Feb 2007

HDF5 Objects DODS Objects
Dataset Integer Integer Atomic
Float Float
String String
Reference ?
Date Time ?
Bit Field ?
Compound Array? Constructor
Group Group Structure?
? Grid
? Sequence
Attribute Attribute Attribute Attribute

Add Group by subclassing Structure and maybe adding some new operators. This would show up in a DDX as a new type distinct from Structure, but in the DDS/DAS it would show up as a Structure and the client would not be able to use the new operations.

The Group has to be different than the general Structure in DDS/DAS since HDF5 supports a compound datatype which should map to Structure.

If you think to add new operators, the HDF5 link operations may give you some hints. It basically links the relations among different groups; groups and datasets.

Since group is a pretty important concept in HDF5, it is better to be supported gracefully.

[ky]

Todo: Define an abstract data type for Group. [jhrg]

James will define an abstract data type for Group at DAP4. In DAP2, Group will be mapped to structure with a special attribute. [ky 2007-3-8]

Reference/Link:Map to URL?

This needs to be thought carefully, again, if it uses URL, it must have special tags to separate the reference/Link from the general URL.

[ky]

Todo: Define an abstract data type for Link. [jhrg] I think you mean both reference and link. Reference is in a high priority. [ky] James will define an abstract data type for object reference and regional reference at DAP4. In DAP2, object reference and regional reference won't be mapped to DAP2. [ky 2007-3-8]

Reference to DAP

HDF5 reference includes object and region references. To represent an object reference: A Permeanant HDF5 object ID needs to be stored in DAP To represent a region reference: A Permeanant HDF5 object ID as well as the selection shape needs to be stored in DAP [ky]

More about Datatype Mapping

We definitely need to resolve this part.

HDF5 Group must be distinguished from the compound datatype when mapping to DAP. In Pydap, HDF5 Group is mapped to DAP structure. There is no way to map HDF5 compound datatype to DAP.

Object reference needs to be mapped appropriately to DAP with the suggestions above. [ky]

Several sample HDF5 files that help understand the mapping of group,object reference and data region reference can be found under ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/Samples-for-dap-enhancement

A readme file that describes these files can be found on the parent directory: ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/

The h5dump header output of h5group.h5,h5_objref.h5 and h5_regref.h5 can be found under ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/Samples-for-dap-enhancement/group/h5group.txt ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/Samples-for-dap-enhancement/references/h5_objref.txt ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/Samples-for-dap-enhancement/references/h5_regref.txt

There two important reasons to map HDF5 group to DAP: 1) We want to well preserve the attribute information of HDF5 Group in this way. Otherwise, the attribute information inside group may not be easily recognized from the client.

2) It will be easy for a future HDF5 client to retrieve the information and rebuild the HDF5 file.

Based on the discussion today, use the example at ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/Samples-for-dap-enhancement/group/h5group.txt

We propose to map the group hierarchy with following information:

Dataset {
    Structure {
        Structure {
            Structure {
                string h5_comp;
            } foo2;
            string h5_array_1;     
        } foo_1;
        Structure {
            string h5_array_1;
        } foo1_2;
    } "/";
} h5_group.h5;

The string will store the absolute path of the HDF5 dataset(DAP variable) name.

In this way, the attribute information of HDF5 group can be preserved nicely. And the ambiguity caused by HDF5 compound datatype dataset and HDF5 group can be avoided since no HDF5 dataset can have the name "/". However, we do need a careful document for this in the new HDF5 to DAP mapping. Please review this, James. One thing we would like to do is:

Is an empty structure legal? If this is not legal for DAP. Then we have a big problem!!! It is perfectly fine to have an empty group inside HDF5.

No one ever asked that, but I don't see why it would be a problem. However, there may be a better way to handle this case. While every variable must have an attribute container there can be other attribute containers which are bound to no particular variable. [jhrg 5/10/07]

Dataset { 
    Structure {
    }foo3;
};

If an empty DAP structure is legal, I think we get this problem solved. Otherwise, you do need to think to create a new object inside DAP to support this; perhaps.

[ky 2007-5-3]

What's left

Need to obtain typical sample NASA files and opinions from NASA users. [ky] This has been done, see the above.[ky]

Implementation language

C++: Build on the existing HDF5 handler

Python: Use PyDAP and PyTables. What types of HDF5 files could this not read? What capabilities would it support that the regular HDF5 library does not? How to Integrate Python-based handlers into the Server4 BES framework?

Check the answer of the first question below at the pros and cons of python-server. Here's some information about using Python from C++:

PyTables Has a good FAQ.

PyDAP provides an interface to the DAP.

But more importantly, the PyDAP folks have already done this! Look at the info for the HDF5 plugin. I think one approach we might take is to look at how much work it will take to modify/extend this code to provide a better HDF5 server and then feed those changes back into PyDAP. This would include:

  1. Interfacing the PyDAP/PyTables HDF5 plugin to the BES
  2. Extending PyTables to read more of HDF5
  3. Improving efficiency in PyDAP

I really think we should help build a python HDF5 handler at least since python is such a promising language.

I've spent some time learning and installing Pydap HDF5 plugin in the past few days. The followings are my thoughts regarding to this issue:

Pros:

Cons:

My suggestions:

The performance of Pydap and maintenance of Pydap and Pytables are my big concerns. We are taking great risks if we decide to use Python rather than the original C++ handler. However, I also like the Pros. So maybe we can focus on implementing C++ handler but help Roberto improve HDF5 plugin. If possible, persuade Pytables people or advertise other python fans to provide a little funding to improve pytables. I will spend the next few days reading Roberto's HDF5 plug-in and possibly provide the help. James, Feel free to persuade me to change my mind.

[ky]

I really like the PyDAP/PyTables solution, but based on our requirements, I think we should stick with the original plan to extend the HDF5 Handler written in C++. Here are my reasons:

  1. We know that Python is a language targeted at rapid development but which is not as fast as C/C++ when dealing with large amounts of array data. We also know that's exactly the type of data the handler will have to read/process. (Handler performance with large amounts of data)
  2. There are unknown issues in interfacing our C++ server framework (aka Server4, aka the BES) to python and moving DAP objects across that interface. So there's significant risk and significant development costs for OPeNDAP (since OPeNDAP is the right group to do that work). However, OPeNDAP has dropped its level of participation and I think it's best to save as much of that money as we can for modifications to the DAP and to provide general support to THG (see plan for some potential risks the project will need to address). (Unworkable cost distribution).

I'm still really excited about Python support for Server4 because of the ease with which Dan was able to write a data handler for a on-off format, but the above leads me to the conclusion that even though it's a great idea, it's not right for this project.

[jhrg]

I totally agree with you. The unknown performance and interfacing issues make it too risky to go to Python approach. I think I will send out the suggestions to Roberto. If he wants to improve the pydap HDF5 handler, it will be good to NASA Python users. [ky]

Client support

Sample Data

Data files: We're initially thinking of making the handler work with AURA and NPOESSS data files. Is this a reasonable place to start? If so, we need sample files. If we're going after other groups of files, then lets get samples of those too.

I have several Aura sample files and one NPOESS sample file. Kent will also contact with potential NASA users for the typical sample files.[ky]

Kent contacted with several NASA people and received one reply from Bruce Vollmer at NASA GSFC GES DISC.See the attached.

The sample HDF5 files through OPeNDAP can be found at ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/opendap/. Please read the readme file for more information.

Performance testing

Kent will contact with several potential NASA user groups to ask their opinions. [ky]

Testsuite development

The how part of testsuite can follow the same methodology HDF4 used. By using 'make check', the expected outputs from h5 handler will be compared against the pre-written outputs. Another verification method is to use visualization tools(e.g. ferret) that can act as OpenDAP clients. This will ensure that users who are familiar with h5 data can visually examine the correctness of output quickly.

The what part of testsuite is still under investigation. The good starting point will be re-using the hdf4 testsuites by converting them into h5 format. In this approach, the first question is how extensive and valid the hdf4 testsuite is. The second question is the expected output files under hdf-testsuites directory cannot be directly used as is due to the h4toh5 conversion program. I ran the conversion program on hdf4 testsuite files and compared the results of hdf4 and hdf5 handlers, they are quite different.

Here is the URL for quick comparisons: http://hdfdap.hdfgroup.uiuc.edu:8080/

Aura EOS5 data support

Aura data support DAP grid data. So we MUST FIND a WAY to map Aura HDF-EOS5 (Grid,Swath,Point) Data to DAP (Grid,Swath,Point) Data using only HDF5 APIs. The problem can be described as follows: HDF-EOS5 data à Using HDF5 library to retrieve all information à Mapping to DAP correctly. For example: (HDF-EOS5 grid) à (HDF5 without geolocation APIs) à DAP Grid correctly All HDF-EOS5 grid geolocation information are put inside an internal HDF5 group called structMetadata. One unclear point is how different projection can be accepted by DAP. Step to tackle this problem: FOR JAMES, PLEASE HELP: How HDF4 handler works with HDF-EOS2 data since swath, grid and point are not new concepts. Please provide us some hints on this. Specifically: which part of code we should read and related documents we can read. [ky 2/22/07]


-- Main.muqun - 08 Mar 2007

-- JoeLee - 08 Feb 2007

-- JamesGallagher - 17 Nov 2006

-- JamesGallagher - 21 Dec 2006

Attachment sort Action Size Date Who Comment
bruce-vollmer-email manage 2.0 K 12 Feb 2007 - 17:24 Main.muqun email from NASA GSFC DAAC that are interested in HDF5-OPeNDAP