A better way to pass text data via the API

Added by Paul about 1 year ago

The problem

Users of the API from external environments (MATLAB, Python, Julia) need to pass data to modules such as pstext. The pstext input is complicated but can be thought of as records of (x, y, string). Internally, pstext expects entire text records and parses the records for the coordinates and strings (and optionally other attributes). This worked fine for command line file based input. At the moment, the only way to pass such data via the API would be as a character matrix in a GMT_MATRIX container. This means a fixed-length record (since a matrix is rectangular) and requires the developer to first convert coordinates and strings into text records stored in such a matrix. In the MATLAB toolbox, we do much of this in a C/mex wrapper (gmtmex_parser.c) that directly creates a GMT_TEXTSET container, but we still expect the user to have given use strings with leading coordinates. This is neither pretty nor elegant. At a minimum ("First Stage"), we need to provide a mechanism that allows API users to pass sets of coordinates (as a matrix or a set of column vectors) and a string array without having to reformat their data. In the long run ("Second Stage"), we may need to reconsider the GMT_DATASET vs GMT_TEXTSET containers.

1. First Stage

Most external API users maintain their own data organization and are not using the standard GMT containers. Instead, the two custom containers GMT_MATRIX and GMT_VECTOR are best suited for passing their data in and out of GMT modules. I propose that both of these containers add a new variable

char **text;
which can be used to hold a user-supplied array of strings. Because this pointer is simply added to the structure there are no backwards compatibility issues. All we need to do is to provide a new API function that allows this pointer to be set and obtained (similar to how we set and get the numerical arrays). On input, users will pass the address to their array of strings. On output, GMT will provide a pointer to a GMT allocated array of strings. These are deallocated when the session dies, like other GMT container data. When a GMT_MATRIX or GMT_VECTOR are passed to GMT and the text pointer is NULL then we continue to parse these as a GMT_DATASET. However, if text is not NULL then we instead route through the parsing for GMT_TEXTSET and we will do the conversion from coordinate and strings to internal text records. This will invariable entail wasted CPU cycles of converting coordinates to text and then back again to coordinates, but most modules expecting text (such as pstext) do not typically deal with very large files so the double conversion should not be noticeable. I note that MATLAB, when using importdata on a mixed data/text file, returns a structure that contains a data matrix and a cell array with text, sort of similar to what I envision.

2. Second Stage

Some GMT modules expect pure numerical data while others expect a mix. The genesis of these modules dictated how they were translated from stand-alone programs to modules almost 7 years ago). An early API decision was to support two GMT containers for table data: GMT_DATASET (just numerical) and GMT_TEXTSET (arrays of strings) that matched the expectation of the modules. However, as we develop more external connections the downsides to this decision are mounting:

  1. The API user needs to know which modules expect what type of table and pass the right table container.
  2. Depending on options, some modules are chameleons and change from expecting numerical data to text strings.
  3. Some modules behave the same way but for output, meaning the user has to pass the correct table knowing what the module produces with given options.
  4. We have GMT_Encode_Options in the API to help deal with this mess, i.e., tell calling applications what a module expects. This is as close to black magic as it gets. Even I (who wrote it) cannot really explain what it does...

I believe the best solution to this dilemma is to introduce a new container that will replace the GMT_DATASET and GMT_TEXTSET containers. I am calling this container GMT_RECORD (for now). It will basically duplicate the GMT_DATASET structure but also have a parallel GMT_TEXTSET "light" structure pointer. The light structure will not replicate all the counters and dimensions kept in the GMT_DATASET but only offer the tables and segments hierarchy with the text arrays. For numerical data, the top-level textset pointer is NULL so we only waste 8 bytes. However, when there is text as part of the data records then these are stored as text strings in the light GMT_TEXTSET structure. So coordinates and texts are in separate sub-structures of GMT_RECORD but share the same book-keeping (i.e, same table, segment, row), making it easy to use them together.

The benefits of switching from the GMT_DATASET and GMT_TEXTSET to a single more flexible GMT_RECORD are:

  1. All modules that deal with table data will expect GMT_RECORD as their container - no more guessing.
  2. Simplifies the API usage for developers depending on the API.

However, this is not a simple undertaking, hence this is presently just a proposal to consider. If the UNAVCO funding for GMT comes through then this will move to the frontburner. An overview of changes that would be required:

  1. Every table-reading module will need revisions since the fundamental current record model changes from either a row-array of doubles or a string to both (with the string often being NULL).
  2. There will be much work to do in gmt_io.c to generalize the import and export of the new GMT_RECORD entities. We have all the bits and pieces but still quite a bit of work.
  3. Modules doing record-by-record i/o via GMT_Get_Record and GMT_Put_Record will obtain and pass a structure pointer (instead of array or string) that holds both an array and a string (often NULL).
  4. The gmt_api.c will need quite a bit of rewrite to support the new GMT_RECORD container i/o as well.

I believe the benefits clearly outweight the work involved, but we will go slowly on this for now, hence the First Approach first.