What would be the main definitions for the perfect data, for Earth Observation data? Like for example, it should have perfect interoperability. We can have data coming from different satellites and still use it without additional preparation just to harmonize and homogenize it somehow. Or should it be fast? Or should it be smaller? Should it be better? Should be less artefacts? What would be three main characteristics that will drastically change the situation?
If you allow me, I would use another metaphor. There has been a debate whether algorithms such as super-resolution add information to the data. You have an algorithm, you have an input data, and then some magic happens. And how is it possible that the output, there is more information content than in the input? This has been a big dilemma.
In my view, it's very simple. You go to the train station, and you see a big clock. And you see it's 11am. What does it tell you? It tells you very different things depending whether you know the schedule or not. If you have the time schedule for the train - it tells you a lot, right? If you know that at 11:10am there is a train, or even if you know the train was at 10:55am and you missed it, right? This is a big deal. If you don't know the timetable - then it only tells you the time of the day, not that much.
So this is exactly the same, it's more than a metaphor, it's an exact situation. It's a situation where you see a data point, and depending on what is your prior knowledge, you can determine different amount of information from the same data point.
So now the question is, would it help you if you had a timetable from France and you're in Switzerland? Not that much, right? This is an exact situation, not a metaphor, exact situation with any machine learning.
If your prior doesn't exactly correspond to the data point that you observe, then your prior is useless. Your machine learning is useless if the statistics of your model, of your training data does not match data point. The only difference between timetable and the machine learning is that here, you would have to actually spend several days standing on this platform and recording the arrivals of the train. This is the training process.
However, if you did it in Paris, and then you are standing in a train station in Zurich - it’s not very helpful. This is exactly the situation with the Earth Observation data or any data in machine learning for that purpose.
So having the statistical distribution exactly corresponding to the data that you about to observe, is absolute key to the performance of any machine learning algorithm. You can absolutely add information, you can make magic with this machine learning algorithms if indeed your training data is perfectly matching your observed scenario.
Now there is another, in particular in Earth Observation, another challenge is that if you have a timetable in Zurich, but it's from last year. Then maybe it can help because you know that in the morning the trains arrive once an hour and it's somewhat similar and you can derive some information.
But it's still nowhere as good as having up-to-date timetable. This is exactly the situation with Earth Observation. That if you trained in this region, but it was on the data that was collected several years ago - it's exactly the same. So it's not exact answer to your question, but I think this is a one thing that is absolutely key.
Your training data has to be perfectly matched to your observed data, which sometimes is an enormous challenge to do. Particularly in Earth Observation, because in most cases, if you have data for what you observe, you don't need to observe it. It is a real challenge to find the training data that would perfectly fit your observed data. But if you have this, you can do real magic.