Feature #290

pshistogram should also permit not to store all the input data in memory

Added by Alessandro over 4 years ago. Updated about 4 years ago.

Status:In ProgressStart date:2013-05-24
Priority:NormalDue date:
Assignee:-% Done:

50%

Category:-
Target version:Candidate for next minor release
Platform:

Description

Currently, pshistogram is storing all the input data (from stdin or a file) in memory. However, when submitting large datasets (several GBytes), memory is exhausted on standard desktop computers. In principle, an histogram can be created without holding all the input data in memory.

A suggestion would be to add a command line argument to the program to process directly the input data while it is being read line by line. However, several issues would raise in this "mode":
- no more automatic x range and tickmarks (because x_min & x_max are only available after the last line of input data has been read)
- no more L2, L1, LMS stats (input data cannot be sorted to extract median, etc... unless some algorithm exists ?)
- more "messy" code because the content of the "for" loop in the function "fill_boxes" needs to be put in the loop reading the input data line by line

Attached file is a modified "pshistogram.c" file to support this new feature.

pshistogram.c Magnifier - Suggested modifications in pshistogram.c (25.2 KB) Alessandro, 2013-05-24 07:03

History

#1 Updated by Paul over 4 years ago

  • Status changed from New to Feedback
  • Target version changed from 4.5.11 to 5.1.0

Hi Alessandro-

FYI, GMT4 is frozen so nothing will happen there, and your code changes are not directly portable to GMT 5. Hence, we will consider your request for GMT5 only.

#2 Updated by Paul over 4 years ago

  • Status changed from Feedback to In Progress

Alessandro, could you tell me what datasets you have in mind that are several Gb, and be more specific with what "several" means (2-3, 400, ?).

Your question and suggestion goes to the core of GMT philosophy: We are aware that at any given time someone's computer will not have enough RAM for a task. Rather than spend our precious time working around this problem with more complicated algorithms, space-saving tricks, or implementing dual-modes (either read into memory or offer a line-by-line option for a limited functionality), we have for over 25 years just ignored these problems and in a year or two the problem has been solved by Moore's law as RAM has gotten cheaper and larger. I am sure there might be exceptions to this when someone's data set is simply way to large for a GMT task, but for 99.9999% of situations this never is a problem. While a typical workstation might have a few Gb of RAM, more serious workstations might have tens of Gb, and tomorrow's workstations might push towards hundreds of Gb of RAM. One way to put this is to say that if you need to analyze several Gb of data then get a machine with sufficient RAM. And if, at the end of the day, you still have 1 Tb of data and wish to make a histogram then you just have to write a short program to do the binning and plot the result with psxy -Sb.

Cheers, Paul

#3 Updated by Alessandro over 4 years ago

Hi,

I have to admit that it is the first time I hit the memory limit of my desktop computer (4 GB of RAM) with a GMT task. Basically, I wanted to plot histograms of the same quantity associated to 2 different parameters of a processing of one week of hourly visible Meteosat images (3712 x 3712 pixels). I extracted the data in a temporary text file of about 25 GB/6GB compressed (I know that it is lazy but it is so convenient to use flat text file, because I never know if the results will be pertinent/nice or not, and I do not want to spent a lot of time if it is not).

I am aware that I could simply buy more RAM or wait and rely on Moore's law, but eventually any limit will be reached (it always ends up with limited funding...) because we simply want to consider bigger datasets or new imagers with much higher spatial and temporal resolutions are launched.

Considering the core of GMT philosophy, I agree that not all utilities of the suite should be able to process any size of dataset because this would increase the complexity of the algorithms. Indeed, I would not even dare to perform a least square fit on the whole dataset. But I was expecting that building histograms would not require to hold the whole dataset in memory (the basic binning algorithm does not require it), as well as I am expecting that "minmax" and gridding utils ("xy(z)2grd") are working line by line (I did not check the code yet). From my point of view, "pshistogram" is only doing that to allow automatic guessing of boundaries (while almost all others utils rely on the user for that task, which is, I guess, GMT philosophy) and compute some statistics (maybe these are important and cannot be ignored). But I can be wrong...

Finally, I must say that I am an great fan of GMT (having done almost all figures for my PhD and publications with it). I suggested this new feature because I just want to enlarge the scope of GMT usability (not totally sure of my english ;o) ) on very large datasets. Again, I am not expecting that all utils should be working with datasets of any size, but only core programs such as pshistogram, minmax, xy(z)2grd allowing already to display the data for further investigations.

Regards,

Alessandro.

#4 Updated by Florian about 4 years ago

  • Target version changed from 5.1.0 to Candidate for next minor release

Also available in: Atom PDF