Luminous Landscape Forum
Raw & Post Processing, Printing => Colour Management => Topic started by: Guillermo Luijk on May 07, 2019, 05:53:16 pm

In order to practice with NN I have thought of doing a camera calibration exercise, to find out if a simple NN can perform similarly or even better than classical ICC profiles based on LUT's. I have no much idea about how ICC profiles and calibration in general work so here are some questions:
1. In a regular calibrated pipeline, I guess white balance is applied BEFORE the ICC profiling conversions take place. Is that correct?
2. To measure the performance of a calibration workflow, a standard procedure is to measure L*a*b deviations (deltaE) over different patches using some colour card, correct?
3. Let's assume we use an IT8 card for being the card with more patches (that is good in order to train the NN): the L*a*b values with which we need to compare our calibrated output are standard (synthetic) expected values or are Lab values measured (spectrophotometer) over the specific IT8 card used?
4. To do proper WB before training the network, I guess we just need to use the gray patches on the card, correct? later on the white balanced RAW RGB values will be used to train the network.
This is the structure of the NN I plan to implement. Will try different number of nodes since overfitting could lead to very precise colour reproduction for the patches but undesired behaviour for colours unseen by the NN. From a ML point of view, we should preserve some patches in the card as validation set, but on the other hand even an IT8 has very few samples to properly train a NN so I'm not sure what's the best approach here to prevent overfitting.
(http://guillermoluijk.com/misc/nnprofiling.png)
Regards

In order to practice with NN I have thought of doing a camera calibration exercise, to find out if a simple NN can perform similarly or even better than classical ICC profiles based on LUT's. I have no much idea about how ICC profiles and calibration in general work so here are some questions:
I wouldn't recommend using camera profiling as a good choice for such an exercise. The images captured by a camera of a test chart are subject to a lot of interfering factors that take skill and effort to minimize, such as uneven lighting and flare. Commercially available test charts have a limited number of test patches, and a true validation requires an independent test set, so you really have to be prepared to manufacture and spectrally measure your own test charts. A profile is typically never a perfect fit, due to the spectral differences between the camera sensor and the standard observer. These are all reasons why serious camera profiling takes the approach of measuring the sensor spectral sensitivity curves, rather than taking photo's of test charts.
You could use printer profiling as a test exercise, since the repeatability is much better, and it's relatively easy to generate different test sets. (There have been a paper or two in the CIC proceedings on using neutral nets for this type of thing.)
The basic nature of the task is scattered data interpolation. There are many possible techniques that can be used for this, and it's fair to say that I'm not much of a fan of NN's, although you may get acceptable results out of them with such a low dimensional model as printer profiling, and at least you can comprehensively explore continuity behavior etc., something that's basically impossible at higher dimensions. For regularization, check out "neural network dropout regularization", which seems to be a note of sanity amongst the craziness of sparsely trained, over fitted NN based modelling systems that are currently all the rage (turtles being recognized as guns is a classic).

Will try different number of nodes since overfitting could lead to very precise colour reproduction for the patches but undesired behaviour for colours unseen by the NN. From a ML point of view, we should preserve some patches in the card as validation set, but on the other hand even an IT8 has very few samples to properly train a NN so I'm not sure what's the best approach here to prevent overfitting.
Just to be sure: "number of nodes" and "overfitting" are not related. Nonuniformly distributed learning samples are a problem. In linear equations we need as many equations as there are unknowns to solve the system. In ML we just need representative equations (samples) from the entire set while learning, and preferably a different set of samples while verifying.
The trick in this case is to not bias or overfit a certain color because that colorsample happened to be overrepresented during the learning fase.
Your assessment of WB before colormatch is indeed correct, since you generally want to apply a single colorresponse in different WB situations.
WB is interesting in this case, because what NN configuraton would you design to solve just the WB multipliers?

If you do go ahead, please keep us updated. I'd be very interested in the results of such an exercise.

I wouldn't recommend using camera profiling as a good choice for such an exercise. The images captured by a camera of a test chart are subject to a lot of interfering factors
(...)
The basic nature of the task is scattered data interpolation. There are many possible techniques that can be used for this, and it's fair to say that I'm not much of a fan of NN's, although you may get acceptable results out of them with such a low dimensional model as printer profiling, and at least you can comprehensively explore continuity behavior etc., something that's basically impossible at higher dimensions. For regularization, check out "neural network dropout regularization"
I understand the limitations of camera profiling using color charts, but the goal here is not profiling any camera for future use, it's just to practice with NN and find out if a NN (defined in a much simpler way that LUTs) can be as good or better than the classical ICC profiles. I chose camera because I know a person whose company makes IT8 cards and will provide a high quality shot, so as measured Lab values over the chart.
My concern is how will the NN perform at interpolating unseen colours. If it behaves softly between seen patches the result should be good, but if it does this:
(http://www.guillermoluijk.com/datosimagensonido/mlpwithmissingdata.png)
colours will be impredictable. That is why I plan to start by using simple structures (few nodes), and try dropout in the training process if the interpolation starts to display undesired responses.
Regards

Just to be sure: "number of nodes" and "overfitting" are not related. Nonuniformly distributed learning samples are a problem. In linear equations we need as many equations as there are unknowns to solve the system. In ML we just need representative equations (samples) from the entire set while learning, and preferably a different set of samples while verifying.
The trick in this case is to not bias or overfit a certain color because that colorsample happened to be overrepresented during the learning fase.
Your assessment of WB before colormatch is indeed correct, since you generally want to apply a single colorresponse in different WB situations.
WB is interesting in this case, because what NN configuraton would you design to solve just the WB multipliers?
I expect many nodes could more easily lead to overfitting because it provides the NN with more nonlinear mapping capabilities, while I'm looking for soft interpolations. But I'll try an increasing number of nodes. The IT8 chart tries to have a good representation of colours evenly distributed in the CIELAB space:
(http://www.imatest.com/wpcontent/uploads/2011/11/Multicharts_IT8_ab.jpg)
Of course these 200300 patches are a minimal set for NN training, but I want to give it a try anyway. My input (RGB)/output(Lab) correspondences will be nearly noise free thanks to patch averaging, that's an advantage here.
Regarding the WB, I'll just set a linear scaling of the input RAW RGB values to make a middle gray patch on the IT8 chart become neutral (R=G=B), just like many RAW developers do. It's what I tried to represent in the previous scheme by connecting the RGB to R'G'B' values by a single scaling. It will not be a part of the NN itself to make the process more flexible and adequate to use the same profiling with an arbitrary WB.
If you do go ahead, please keep us updated. I'd be very interested in the results of such an exercise.
Sure! ;)
Regards

I expect many nodes could more easily lead to overfitting because it provides the NN with more nonlinear mapping capabilities, while I'm looking for soft interpolations. But I'll try an increasing number of nodes.
Yes, "less is more" in the case of NN.
Of course these 200300 patches are a minimal set for NN training, but I want to give it a try anyway. My input (RGB)/output(Lab) correspondences will be nearly noise free thanks to patch averaging, that's an advantage here.
The number of patches is not so much the problem: each run may consist of only 3 patches, as long as those patches are either an (R,G,B) variant or (C,M,Y) variant. That is unfortunately some prior pattern logic you need to feed it to ensure a solution that converges.
Regarding the WB, I'll just set a linear scaling of the input RAW RGB values to make a middle gray patch on the IT8 chart become neutral (R=G=B), just like many RAW developers do. It's what I tried to represent in the previous scheme by connecting the RGB to R'G'B' values by a single scaling. It will not be a part of the NN itself to make the process more flexible and adequate to use the same profiling with an arbitrary WB.
That's fine of course, I merely mentioned it as a perhaps simpler "exercise" to design and learn NN.
As for your output: you might want to consider XYZ as output, or normal conversion from that to Lab. Otherwise you might want to design a separate NN first to convert XYZ to Lab. (Another interesting exercise...).

I merely mentioned it as a perhaps simpler "exercise" to design and learn NN.
As for your output: you might want to consider XYZ as output, or normal conversion from that to Lab. Otherwise you might want to design a separate NN first to convert XYZ to Lab. (Another interesting exercise...).
Actually this is already a second NN exercise. In the first (IMAGE PROCESSING REVERSE ENGINEERING USING NEURAL NETWORKS (https://www.datosimagensonido.com/2019/04/ingenieriainversadeprocesadode.html)) I bruteforce trained a NN to mimic an arbitrary image processing (including nonlinear curves, desaturation and hue rotation), and the result was amazing:
http://guillermoluijk.com/datosimagensonido/fotografia_mlp_64_64.jpg
Here the thing is more complicated since the IT8 card just provides a few samples so a huge amount of mapping correspondences have to be interpolated. I plan to do some deltaE calculations to compare the performance of the NN vs ICC profile obtained using the same IT8, and a "standard" best effort RAW developer such as DCRAW, that's why Lab makes things simpler. What would be the advantage of using XYZ?.
Regards

What would be the advantage of using XYZ?.
Regards
It might make things simpler or more insightful since it gets closer to known relations, but perhaps you're right and you might as well go directly to Lab since NN logic is based on nonlinear relations anyway.

Actually this is already a second NN exercise. In the first (IMAGE PROCESSING REVERSE ENGINEERING USING NEURAL NETWORKS (https://www.datosimagensonido.com/2019/04/ingenieriainversadeprocesadode.html)) I bruteforce trained a NN to mimic an arbitrary image processing (including nonlinear curves, desaturation and hue rotation), and the result was amazing:
http://guillermoluijk.com/datosimagensonido/fotografia_mlp_64_64.jpg
There's nothing magical about NN's  like all models they have to exist within the constraints of logic and mathematics.
Take ICC profile models for example. In the context of modelling a devices RGB>XYZ behavior, the simplest realistic model would be a gamma curve common to the device values followed by a 3x3 matrix. Such a model has 10 free parameters. To fit the model ("train") to test values gathered from the real world logically requires at least 10 patches. In practice it is not that simple though, since real world samples have uncertainty, and the fitting function from samples to model parameters may well be illconditioned, which means that uncertainty in the test values could result in wildly erroneous model behavior in areas of color space that are not near the test values. So you either have to increase the number of test samples and their coverage to the point where the fit is not illconditioned, or/and add regularization constraints that push the poorly constrained parameters in the direction of realism.
At the other end of ICC profile complexity would be using a cLUT. This is a model that is basically unconstrained except for the chosen resolution of the table. For instance, a 33x33x33 cLUT has 107811 free parameters, but can model any function that has a continuity at a scale of less than 3% of the input value. Given that it's generally unrealistic to expect a test set containing of the order of 33000 test points uniformly distributed through the device space, an approach has to be taken to make it work with a lot less test points. Typically a regularization constraint of some form of continuity is applied, such as continuity of value or slope. This is effectively making assumptions about typical device behavior.
So given a realistic number of test points, there is always a tradeoff between how closely an unknown device behavior can be modeled, and how well behaved it is at all the points in the gamut that are not at test points.
In color profiling, many other types of models have been applied that pick some other point of tradeoff between assumption about how a device behaves, and freedom to fit to the actual test patch values.
Exactly the same constraints apply to a NN model. Depending on its nonlinearity and construction, the fitting function could be illconditioned. Depending on its size, it may have more free parameters than test points. To really know its final performance, you need to be able to check the model against the ground truth in fine detail throughout the whole of the gamut. And you haven't shown how you intend to obtain the ground truth of your camera behavior to carry out such a performance verification. Splitting your test chart points and using some for training and some for verification will give you an indication of how well your model is working, but having a ground truth is far better. Comparing against other modelling approaches may tell you whether you are in the ball park, but doesn't give you any indication as to whether the differences are in the direction of better or worse compares to the ground truth.
If I were attempting to work on developing models for camera profiling, then the approach I would take would be to first construct a realistic mathematical model of camera behavior based on known devices. This would involve spectral response, channel linearity, sensor noise characteristics, channel cross talk etc. If the intention was to fit the model from test charts, then lens distortions of all sorts would have to be added to the model too. The photographing of any sort of test chart can then be realistically simulated, and the resulting NN model compared in fine detail against the ground truth.

(...)
which means that uncertainty in the test values could result in wildly erroneous model behavior in areas of color space that are not near the test values.
(...)
given a realistic number of test points, there is always a tradeoff between how closely an unknown device behavior can be modeled, and how well behaved it is at all the points in the gamut that are not at test points.
(...)
the same constraints apply to a NN model. (...) Depending on its size, it may have more free parameters than test points. To really know its final performance, you need to be able to check the model against the ground truth (...). Splitting your test chart points and using some for training and some for verification will give you an indication of how well your model is working, but having a ground truth is far better.
Thanks for your helpful insights GWGill. You're totally right, no matter how few patches the IT8 chart has, some of them need to be preserved as a validation set, at least before finding out the complexity of the NN that can be used so as the degree of training to avoid loss of accuracy in unseen colours. This leads me to think about methodical flaws in the usual workflow chart makers sell: shoot the chart, make an ICC profile, and if the patches get small deltaE's you're doing a good profiling. Later on the ICC profile is applied to unseen images (test sets), and colours are assumed to be correct without bearing in mind the strong interpolations taking place.
I could preserve half the patches as a validation set, and even interchange the train an validation sets to compare the results. My concern is to find a NN that interpolates softly unseen colours between seen patches, something that can be checked as well.
(http://guillermoluijk.com/misc/it8cielab.png)
I think the ratio between network complexity (no. of layers and nodes) and patches in the training set is a very important hyperparameter to tune. And another one will be the number of iterations. The good thing of these patches is that they will be noise free, so there are not outliers or wrong samples that coould fool the model.
(https://i.stack.imgur.com/V9r3G.jpg)
Regarding the more rigurous ways to proceed, I agree this is not ideal but precisely I try to compare the NN with the also far from ideal way to proceed that is supported by so many colour consultants: shoot the IT8, build the ICC profile, apply to all your images.
Regards

Making some progress this week:
 Hugo Rodríguez (https://www.hugorodriguez.com/blog/) kindly provided me with an optimium RAW file shot from his personally produced IT8 card
 I have decided that the NN won't output Lab values but XYZ values (you were right at this 32BT! I had no idea XYZ was the best PCS to convert to Lab or any RGB afterwards)
 Maths come from the Colour Bible (http://www.brucelindbloom.com/) and have been checked vs Lindbloom's and EasyRGB's colour calculators, working fine. I don't need Photoshop for any stage of the process: RAW development + WB + profiling + sRGB conversion
This would be the scheme (sorry for the Spanish):
(http://guillermoluijk.com/misc/esquema.jpg)
Colour distributions from Hugo's IT8: ab for colour patches and L for gray patches:
(https://www.hugorodriguez.com/articulos/carta_it8egm/partes_it8.gif)
(http://guillermoluijk.com/misc/abpatches.png)
(http://guillermoluijk.com/misc/lpatches.png)
I'll use median rather than mean over the patches to avoid the bias introduced by some scratches
I'll preserve around 20% of the patches as validation set. If the NN peforms fine, I'll probably use all the patches in a final training roundup.
Regards

 I have decided that the NN won't output Lab values but XYZ values (you were right at this 32BT! I had no idea XYZ was the best PCS to convert to Lab or any RGB afterwards)
A camera typically has an additive characteristic (the three channels don't interact much), so of course if inputs and outputs of the model are proportional to light level (as RAW input and XYZ output are), the model tends to look highly linear (closest to a 3x3 matrix).

A camera typically has an additive characteristic (the three channels don't interact much), so of course if inputs and outputs of the model are proportional to light level (as RAW input and XYZ output are), the model tends to look highly linear (closest to a 3x3 matrix).
Begging the question of course what advantage to expect from the hidden layers???

Begging the question of course what advantage to expect from the hidden layers???
A camera matrix is generally not a perfect fit due to the spectral sensitivity difference to the standard observer. So a better fit is possible using something like a 2 dimensional LUT for the chromatic plane based on weighting the mapping by the typical scene spectral statistics, plus 1D curves to model any sensor luminance nonlinearity. [ As already stated, I'm not a fan of NN and wouldn't use them for something like this. In fact, having wrestled with the pitfalls of attempting to model higher dimensional functions for some time now, I don't think I'd choose to use NN for anything, much less call it "AI". ]

A camera matrix is generally not a perfect fit due to the spectral sensitivity difference to the standard observer. So a better fit is possible using something like a 2 dimensional LUT for the chromatic plane based on weighting the mapping by the typical scene spectral statistics, plus 1D curves to model any sensor luminance nonlinearity. [ As already stated, I'm not a fan of NN and wouldn't use them for something like this. In fact, having wrestled with the pitfalls of attempting to model higher dimensional functions for some time now, I don't think I'd choose to use NN for anything, much less call it "AI". ]
In that respect it would probably be a better experiment to let the NN solve the entire path from camera RGB to perceptual Lab.
An additional experiment which might be far more relevant and interesting: design the smallest NN config to convert XYZ to Lab, and then teach this config on Munsell samples. Then see if it results in an actual hue constant perceptual space instead of an errorbased version. Then see if it can better predict color compression for outofgamut color. (NNs and extrapolation not exactly being best friends...)

In that respect it would probably be a better experiment to let the NN solve the entire path from camera RGB to perceptual Lab.
An additional experiment which might be far more relevant and interesting: design the smallest NN config to convert XYZ to Lab, and then teach this config on Munsell samples. Then see if it results in an actual hue constant perceptual space instead of an errorbased version. Then see if it can better predict color compression for outofgamut color. (NNs and extrapolation not exactly being best friends...)
Being the XYZ to Lab conversion (or to any RGB output colour space) a well known deterministic formula (much likely as WB is), I don't see the point in making it a part of the NN.
I know this is not a typical usage of NN: first we have very few samples, and secondly we are modelling subtle nonlinearities when NN's stronghold is modelling nonlinear behaviours.
I still want to try it. An advantage here is that samples can be considered noiseless, and by using a simple NN I think we can still get soft interpolations between the patches.
A question for GWGill: in LUT based ICC profiles, who takes the decision of how many samples are those LUT made of? And secondly: how/where is later decided the way to interpolate between the LUT values? for example when opening profiled image data in Photoshop and assigning it to a given ICC profile. What kind of interpolation used here, just linear or something more sofisticated?.
Regards

Being the XYZ to Lab conversion (or to any RGB output colour space) a well known deterministic formula (much likely as WB is), I don't see the point in making it a part of the NN.
It's an extremely simple, nonlinear deterministic formula, which makes it ideal for NN experiments.
you only need a simple NN, and once it seems to work for standard Lab conversion, you could reset it, and perhaps train it for Munsell samples, at which point it becomes less deterministic, but all the more relevant. You could design a small NN that actually spits out a truly perceptually uniform space with constant hue characteristics.
I know this is not a typical usage of NN: first we have very few samples, and secondly we are modelling subtle nonlinearities when NN's stronghold is modelling nonlinear behaviours.
I still want to try it. An advantage here is that samples can be considered noiseless, and by using a simple NN I think we can still get soft interpolations between the patches.
Absolutely, it is just why I also mentioned that going directly to Lab may have an advantage over XYZ because of the nonlinearities. So you might as well train it to do linear camera input to perceptual output (Lab, or even sRGB directly). Note that I am primarily making suggestions for potential experimentations. I'm certainly not trying to dismiss this particular experiment in its current form, au contraire.

Note that I am primarily making suggestions for potential experimentations.
Understood!!! and your suggestions are very much appreciated. Thanks!. Looking at the formulation, I think the conversion from XYZ to Lab is quite trivial for a NN in fact.
Regards

Little progress over the weekend:
 Play with image library to read/save 16bit images
 White balanced RAW extraction (DCRAW)
 Read chart, crop and calculate median over each patch
(http://guillermoluijk.com/misc/IT8_crop_median.png)
The code here (https://github.com/gluijk/cameraprofilingmlp/blob/master/cameraprofilingmlp.ipynb).
Regards

A question for GWGill: in LUT based ICC profiles, who takes the decision of how many samples are those LUT made of?
I can't speak for other profiling software, but in ArgyllCMS it is set by the "quality" parameter, or by an explicit override parameter. (I've often considered changing the name of "quality" to "speed" or "slowness", since people often misinterpret the tradeoffs being made.)
And secondly: how/where is later decided the way to interpolate between the LUT values? for example when opening profiled image data in Photoshop and assigning it to a given ICC profile. What kind of interpolation used here, just linear or something more sofisticated?.
By convention it is linear interpolation, although the implementation has discretion on exactly how this is done. Typical choices are multilinear or simplex interpolation. The latter is faster, and will have better accuracy when the colorspace has the neutral axis along the diagonal (i.e. device spaces). The former is often better when the output space has the neutral parallel to an axis (i.e. L*a*b* output space). I guess if an implementation wanted to use higher order interpolation it could, but speed would suffer greatly, and memory consumption may get high. (Although it's not usual, consider that the interpolation could be in up to 15 dimensional space. The higher order terms will be very numerous.)

In that respect it would probably be a better experiment to let the NN solve the entire path from camera RGB to perceptual Lab.
Reading through the thread, this was also my first thought. Possibly stopping at XYZ because of the rudimentary way that Lab deals with adaptation. Although Lab would be useful for a perceptual DE2000like cost function.
Thinking outloud, get the Spectral Reflectances of each patch and generate corresponding XYZ/LAB values for a large set of synthetic illuminants (e.g. P1800, P1900, ..., A, P3000, ..., D4000, D4200, D4400. .., D10000 or maybe just concentrate on daylight or whatever). Then take it from there.
Jack

Being the XYZ to Lab conversion (or to any RGB output colour space) a well known deterministic formula (much likely as WB is), I don't see the point in making it a part of the NN.
One of the problems with Lab is that it is not bad but also not good: it is not very perceptually 'accurate' and not very good at adapting for different illuminants (the values are normalized to the white point of the illuminant, i.e. XYZ scaling, which is considered to be the worst kind of adaptation). When you solve for a matrix (https://www.strollswithmydog.com/determiningforwardcolormatrix/) using the normal equation with reference values in XYZ, results are not nearly as good as when using DE2000 as a cost function, which gives more weight to perceptual effects.
And so called 'WB' (area under SSFs normalized to 1) is a completely arbitrary convention: only one matrix is needed to project to wherever one wants to go, though it may be broken down as the product of 2 or more (for instance M = diag(WB)*rgb_2_xyzW1*xyzW1_2_xyzW2*xyzW2_2_sRGB, shown in reversed order for clarity).
So it would be good if the learning algorithm were able to incorporate the latter while dealing with the nonlinear perceptual issues of the former.
Jack

in LUT based ICC profiles, who takes the decision of how many samples are those LUT made of? And secondly: how/where is later decided the way to interpolate between the LUT values? for example when opening profiled image data in Photoshop and assigning it to a given ICC profile. What kind of interpolation used here, just linear or something more sophisticated?
FYI in the DNG/dcp world the lookup tables can be any size but are often 90x30x30 (Hue, Saturation, Value, with Value often gamma encoded). HSV is a cylinder reached from XYZ via ProPhotoRGB. Resulting values are interpolated 'trilinear'ly.
Jack

FYI in the DNG/dcp world the lookup tables can be any size but are often 90x30x30 (Hue, Saturation, Value, with Value often gamma encoded). HSV is a cylinder reached from XYZ via ProPhotoRGB. Resulting values are interpolated 'trilinear'ly.
Jack
That's quite a compact definition Jack! Thanks for all your comments.
I began to try some NN for the WBRAW RGB to XYZ conversion. Brute force for now, not caring of reserving a validation set nor overfitting effects.
First I tried a simple NN with no hidden layers. This is equivalent to a 3x3 matrix plus three bias terms (want to investigate if the bias terms can be set to 0 before training to get a real 3x3 camera matrix conversion):
(http://guillermoluijk.com/misc/nn3x3.png)
(http://guillermoluijk.com/misc/nn_matrix3x3.png)
Correlations show the good intentions of the linear transform but are far from perfect. I have no clear interpretation of the fact that the largest errors take place along the gray patches. Any idea?
These are the weights of the NN (i.e. the 3x3 RGB to XYZ matrix):
[array([[ 0.79471269, 0.35876139, 0.06018896],
[ 0.27784208, 0.99636603, 0.26390342],
[ 0.12707532, 0.11484679, 1.21830966]])]
and here the bias terms (0..1 range), very close to 0 as expected:
[array([0.01081357, 0.01022717, 0.00390697])]
Then I tried a dense 2 hidden layers NN with 200 neurons each. The result improves a lot (gray patches get their colour right), but I must confess I expected a perfect fit for such a dense NN and there're still some deviations:
(http://guillermoluijk.com/misc/nn_dense200.png)
Will work on the XYZ to Lab conversion to measure DeltaE's before training more NN structures.
The code here (https://github.com/gluijk/cameraprofilingmlp/blob/master/cameraprofilingmlp.ipynb).
Regards

Correlations show the good intentions of the linear transform but are far from perfect. I have no clear interpretation of the fact that the largest errors take place along the gray patches. Any idea?
These are the weights of the NN (i.e. the 3x3 RGB to XYZ matrix):
[array([[ 0.79471269, 0.35876139, 0.06018896],
[ 0.27784208, 0.99636603, 0.26390342],
[ 0.12707532, 0.11484679, 1.21830966]])]
and here the bias terms (0..1 range), very close to 0 as expected:
[array([0.01081357, 0.01022717, 0.00390697])]
Assuming that the input to the matrix is white balanced data and that the matrix is in the form shown in Figure 1 here (https://www.strollswithmydog.com/determiningforwardcolormatrix/), when the rgb input is a neutral tone the xyz output will be proportional to the sum of the matrix rows. For instance, with rgb = [1,1,1] you should get the illuminant white point in xyz; rgb = [0.18,0.18,0.18] should result in 0.18 of those coordinates, etc. If you know the white point of the illuminant in XYZ, you only have to solve for 6 variables (vs 9).
The matrix is a compromise, which means that some tones will be better achieved through it than others. If you use the normal equation to solve for the matrix you get the maximum likelihood solution. Your network found another solution, one where the vertex of the cube (white point) is not very accurate: with an overdetermined system like this one if you don't specify a criterion you can end up with an infinite number of them.
In your matrix the white point in XYZ (the sum of rows, normalized so that green is 1) is [1.2013 1.0000 1.2180] , which corresponds to a CCT of 4327K with large Duv of 0.036. Does this sound plausible  in other words, what was the illuminant at the time of capture? Have you tried the normal equation as a reference?
If you make the rgb data before white balance available with their reference values I have routines to easily calculate a 'normal' matrix and one based on a de2000 cost function. I am not used to python but use Matlab instead.
Jack

First I tried a simple NN with no hidden layers. This is equivalent to a 3x3 matrix plus three bias terms (want to investigate if the bias terms can be set to 0 before training to get a real 3x3 camera matrix conversion):
Are the output nodes linear functions or sigmoid functions? In the latter case the NN tries to overcome the curve, which it likely can't in such a small NN. Clearly, you don't want that, unless you output to Lab.

Also: are you comparing your specific sample of the IT8 with an average of several IT8? Or with a well measured version of your specific sample?

If you make the rgb data before white balance available with their reference values I have routines to easily calculate a 'normal' matrix and one based on a de2000 cost function. I am not used to python but use Matlab instead.
Thanks for such valuable information Jack, all this is new to me, will look at it closely. I'll send you the RGB file over the weekend in case you want to have a look at it.
Are the output nodes linear functions or sigmoid functions? In the latter case the NN tries to overcome the curve, which it likely can't in such a small NN. Clearly, you don't want that, unless you output to Lab.
Also: are you comparing your specific sample of the IT8 with an average of several IT8? Or with a well measured version of your specific sample?
For regression with MLP I always use linear output (identity function). In the hidden layers of the second NN, after trying different activation functions I stayed with ReLU (it's also linear in the positive range but clips negative values).
I am comparing the IT8 shot vs an accurate measurement of that precise chart (the author of the chart and of the measurement claims a measurement error below DeltaE=0,1).
I calculated the deltaE for the two NN, and the results seem nice. This is the Delta E distribution over the 288 patches for the linear NN and the dense NN (2 hidden layers with 200 neurons each):
(http://guillermoluijk.com/misc/histdeltaE.png)
If this is not wrong:
""" ΔE Quality:
<1 = Excellent (imperceptible)
12 = Good
24 = Normal
45 = Sufficient
>5 = Bad
"""
Even the linear solution produces good results for all 288 patches. The deep NN has a Max(DeltaE)=0.064, that is considered really good right?.
Regards

I just realised I wrote a superb XYZ to Lab routine, but didn't use it before calculating the Delta E values :P
So the previous histogram represents euclidean XYZ, not Lab, distances. Will calculate it right tomorrow.
Regards

My advice would be to stop using the 2x200 NN. It really obscures a lot of potentially interesting results. Use instead something like 2x4 with a sigmoid function. (Or at least add it as an additional configuration.)

Back to reality, when properly calculating Delta E over Lab values the scale gets around 2 orders of magitude higher:
() NN:
ΔE_max = 29.579874621089253 and ΔE_mean = 3.4460693382616014
(200, 200) NN:
ΔE_max = 7.035319348235826 and ΔE_mean = 1.035222151090219
(http://guillermoluijk.com/misc/histdeltaELab.png)
Yes, I want to try sigmoids and also less complex NN structures. And I'll train the NN to produce straight Lab output; I guess this makes sense for powerful NN's that can model the nonlinear transformations, but minimizing the losss (error) in the final space where Delta E's are going to be measured. The ideal training should minimize the Delta E formula, but unfortunately the loss function can't be set arbitrarily.
Regards

Tested (200, 200) Lab output NN with sigmoid function in hidden layers and identity output:
(http://guillermoluijk.com/misc/mlp200200_lab.png)
All three variables (L,a,b) show very low errors, although colour accuracy is higher than luminance which shows higher variance.
(200, 200) NN:
ΔE_max = 5.659828256166601 and ΔE_mean = 0.8287066247766278
(http://guillermoluijk.com/misc/histdeltaELab2.png)
It definitively improves the result. Now mean error is under 1, which I've been told is an excellent result. Max error also gets reduced to less than 6.
I also tested the 3x3 matrix model for RGB to Lab conversion, but as expected the linear model is totally unusable in approximating such a nonlinear transformation.
Regards

Ha, now it would be interesting to know the results for 2x3, then 2x4, 2x5, 2x6 etc and see whether it yields an optimum.
Additionally it remains interesting to know how a 1x3 or 1x4 sigmoid would solve rgb to lab...

I have loop nested several NN's to save training time. I don't like too much this way of doing things (just try a gridsearch combination of hyperparameters and see which one performs best, without knowing why), but here it is. The format is: XYZ/Lab output, NN hidden layers, hidden layers activation function, output activation function:
MLP_XYZ_()_relu_identity : ΔE_max = 29.5782 , ΔE_mean = 3.4459 , ΔE_median = 2.4032
MLP_Lab_()_relu_identity : ΔE_max = 82.2025 , ΔE_mean = 28.3684 , ΔE_median = 21.2329
MLP_XYZ_()_logistic_identity : ΔE_max = 29.5782 , ΔE_mean = 3.4459 , ΔE_median = 2.4032
MLP_Lab_()_logistic_identity : ΔE_max = 82.2578 , ΔE_mean = 28.3821 , ΔE_median = 21.2219
MLP_XYZ_(3, 3)_relu_identity : ΔE_max = 108.3430 , ΔE_mean = 41.8331 , ΔE_median = 37.0897
MLP_Lab_(3, 3)_relu_identity : ΔE_max = 112.5269 , ΔE_mean = 42.6017 , ΔE_median = 39.5577
MLP_XYZ_(3, 3)_logistic_identity : ΔE_max = 23.9131 , ΔE_mean = 4.5645 , ΔE_median = 2.9977
MLP_Lab_(3, 3)_logistic_identity : ΔE_max = 77.6024 , ΔE_mean = 25.7275 , ΔE_median = 20.5809
MLP_XYZ_(50, 50)_relu_identity : ΔE_max = 13.7616 , ΔE_mean = 2.1762 , ΔE_median = 1.6024
MLP_Lab_(50, 50)_relu_identity : ΔE_max = 12.9015 , ΔE_mean = 3.6170 , ΔE_median = 3.1430
MLP_XYZ_(50, 50)_logistic_identity : ΔE_max = 22.5918 , ΔE_mean = 4.0708 , ΔE_median = 2.6891
MLP_Lab_(50, 50)_logistic_identity : ΔE_max = 6.0237 , ΔE_mean = 0.9943 , ΔE_median = 0.6923
MLP_XYZ_(200, 200)_relu_identity : ΔE_max = 7.0373 , ΔE_mean = 1.0364 , ΔE_median = 0.6827
MLP_Lab_(200, 200)_relu_identity : ΔE_max = 7.4150 , ΔE_mean = 1.1333 , ΔE_median = 0.8822
MLP_XYZ_(200, 200)_logistic_identity : ΔE_max = 14.8826 , ΔE_mean = 2.7814 , ΔE_median = 1.8480
MLP_Lab_(200, 200)_logistic_identity : ΔE_max = 5.6598 , ΔE_mean = 0.8287 , ΔE_median = 0.4912
MLP_XYZ_(200, 200, 200)_relu_identity : ΔE_max = 6.3270 , ΔE_mean = 1.2530 , ΔE_median = 0.7609
MLP_Lab_(200, 200, 200)_relu_identity : ΔE_max = 7.3421 , ΔE_mean = 0.9603 , ΔE_median = 0.7042
MLP_XYZ_(200, 200, 200)_logistic_identity : ΔE_max = 14.4747 , ΔE_mean = 2.7297 , ΔE_median = 1.9047
MLP_Lab_(200, 200, 200)_logistic_identity : ΔE_max = 5.6715 , ΔE_mean = 0.7346 , ΔE_median = 0.3988
I find that the best tradeoff between complexity and performance is:
MLP_Lab_(50, 50)_logistic_identity : ΔE_max = 6.0237 , ΔE_mean = 0.9943 , ΔE_median = 0.6923
Training loss:
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_LOSS.png)
Prediction vs Real correlation:
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_CORR.png)
Again L seems to contain more errors than colour (a,b). I still didn't check which patches worked best and worse.
Delta E distribution:
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_HIST.png)
I also defined and checked some needed conversion functions:
 XYZ (D50) to Lab conversion
 Lab to XYZ (D50) conversion
 XYZ (D50) to sRGB (D65) conversion
 XYZ (D50) to ProPhoto RGB (D50) conversion
 Delta E calculation
(https://1.bp.blogspot.com/GnHsVTykfqs/XPRLTWpB2KI/AAAAAAAAGnw/wY79pps1SAcqUcKeEAoDblPtASdlfxVggCLcBGAs/s1600/colour_spaces.png)
I did a complete prediction with the NN over the input RAW values, and compared it to the expected theoretical values (right half rectangle on each patch). Obviously something went wrong because large errors are clearly visibe. I need to check where is the fault.
(http://guillermoluijk.com/misc/IT8_pred_ProPhotoRGB_comp.jpg)
Regards

I did a complete prediction with the NN over the input RAW values, and compared it to the expected theoretical values (right half rectangle on each patch). Obviously something went wrong because large errors are clearly visibe. I need to check where is the fault.
Good progress, and well done.
A useful observation is that gray at least remains neutral. If you figure out why it is too bright, it may show you why the colors are off. Something as simple as clipping may be the culprit.
What surprises me though is that (3, 3) doesn't give better results already. Is the logistic node positive output only? I'm pretty sure that something like (3, 4, 4, 3) should absolutely suffice. The additional layers may be necessary for the NN to scale the values internally. The 4 nodes layers should be optimal in the same way that affine transforms are 4x4 matrixes. If you allow the NN to contort the colorcube using 4 degrees of freedom (if you will) then it should absolutely be able to come close to perfect.

Back to reality, when properly calculating Delta E over Lab values the scale gets around 2 orders of magitude higher:
() NN: ΔE_max = 29.579874621089253 and ΔE_mean = 3.4460693382616014
Thanks for the iT8 capture and relative XYZ reference values Guillermo. The matrix that takes whitebalanced raw data to XYZ suggested by the Normal Equation is
0.5740 0.2535 0.0404
0.2023 0.7195 0.1887
0.0769 0.0907 0.8439
Note that feeding it white balanced raw white [1,1,1] results in values lower than expected (if everything is balanced there should be at least one entry near 1), [0.8680 0.7331 0.8301]. CCT of the matrix is 4141K, .032Duv, not far from what your NN found. RawTherapee and FRV white balance readings report about 4400K/14 CCT, suggesting that the lighting at the time of capture was most likely not a good approximation of D50.
If I feed the raw data, white balanced on most neutral patch GS11 around midgray and normalized so that it is the same value as the relative reference Y, to the dE76 optimizing routine using the XYZ spectral measurements you provided  but White Point provided by a blackbody at 4400K (that's as good a guess as any at this point)  we get the first attachment, which is comparable to what you did. If I change the WP to a blackbody at 5000K little changes. Most of the action is at less than 10 dE76, I have never played with a target with so many patches so I don't know whether this is good or poor performance: Max dE76 21.63, mean dE76 3.59, SMI 80.5.
I prefer dE2000 as a metric, so using it and a P5000K WP results in the second attachment. This last one (or one with the correct WP) is the one I would use in practice, and this is the resulting matrix from wbraw to sRGB(D65) under this lighting:
1.6054 0.5544 0.0510
0.1025 1.3965 0.2940
0.0109 0.4450 1.4341
But back to the point about raw matrix values being low: roughly 85% of expected from the normal equation and k of about 86% in the attachments suggests uneven lighting. I did not adjust the image for light gradients, did you? There is apparently about a 6% gradient from the white square near A1 (9770 DN) to the one near L22 (10370 DN). Also when the image fills so much of the frame it is likely that there is some light falloff that could mess with captured values compared to the well controlled spectro measurements even at f/16. Lack of linearity in the capture may be a hint as to the errors you are seeing in your sidebyside comparison.
Jack

Comparison Adobe DNG Forward Matrix + HSV table rendition vs the raw data rendered by the wbraw>xyzD50 matrix above only, this time to Adobe RGB (so gotta look at it with a properly color managed viewer). Some differences, mostly in the darker tones, mine are the darker ones.

Thanks for the good stuff Jack. Definitively I need to better understand the implications of white balance and capture lighting in the whole process. Will read your article and posts carefully.
I didn't correct for light gradients but checked the four gray squares in the RAW data and they seemed OK to me (topleft, topright / bottom left, bottom right values):
(35,35,34) (35,35,35)
(36,36,35) (35,35,35)
Regarding the appropiate patches to calculate WB, I used them all in average discarding GS0 and GS1, that according the measurements were by far the less accurate (GS7 seems best):
(http://guillermoluijk.com/misc/it8_wb_patches.png)
Surprisingly I ran the prediction again from scracth ang got this:
(http://guillermoluijk.com/misc/IT8_pred_ProPhotoRGB_comp2.jpg)
I think in the former wrong prediction I applied the NN over and already converted to ProPhoto RGB version (DCRAW output). Anyway, clearly neutral patches are the weak point of the prediction, I need to understand this, specially why only gray patches seem to have large L errors. I could understand all patches would (L has a different scale as a/b after all, and I didn't normalise the Lab data to train the NN), but only the neutral ones?.
Maybe the key is that GS predicted patches are darker than the exact ones, while the prediction for the column 16 is lighter than the exact values, and this could be fooling the NN if measured chart values mismatch. I.e. for some reason (spectrophotometer vs camera behaviour differences vs printed inks) lower captured RAW values correspond to brighter patches and viceversa:
Patch L16 has higher L values than GS23 in RAW_WB, but a lower L value in the theoretical chart:
 RAW_WB: L16=12,94 GS23=11,38
 Lab: L16=3,66 GS23=6,72
There is not a continuous solution for such crossover. I will train the NN alternatvely dropping column 16 and then GS patches.
Regarding the Delta E calculation, I saw the dE2000 metric and got lost in the formulation. It would be great if it could be used as a loss function for the NN training, but just for testing purposes I'll stick on the primitive dE76.
Regards!

Could it be an indexing problem? Some array index off by 1?

Could it be an indexing problem? Some array index off by 1?
Would make sense: it is training with incorrect reference in the GS, but with correct samples in column 16. There is very likely an indexing problem in the GS patches, somewhere during training.

Regarding the Delta E calculation, I saw the dE2000 metric and got lost in the formulation. It would be great if it could be used as a loss function for the NN training, but just for testing purposes I'll stick on the primitive dE76.
Regards!
You can safely stick to dE76. It's fast and fine for the purposes here. The later dE variations are more interesting for other purposes: being able to describe the perceptual differences between our grayscale perception and colorperception for example. It does not apply to anything here.

Note that once you get the bug resolved, I do expect a renewed comparison between (3, 3), (4, 4), etc...
No need to compare linear, we already know that won't work for perceptual output.
If you would need to speed things up: you could also check whether you really need 20000 epochs each time if your optimisation function shows that 5000 epochs does the trick.
In order to check the code you could perhaps separate the output logs from the code in your github. Currently there is a couple of lines of code and then thousands of lines of output, which kind of makes it unreadable.

I think in the former wrong prediction I applied the NN over and already converted to ProPhoto RGB version (DCRAW output). Anyway, clearly neutral patches are the weak point of the prediction, I need to understand this, specially why only gray patches seem to have large L errors. I could understand all patches would (L has a different scale as a/b after all, and I didn't normalise the Lab data to train the NN), but only the neutral ones?
That looks much better. I think remaining differences could be due to the different processing in the two cases.
Just thinking aloud: I believe DCRAW applies a standard tone curve. Do you? In my example I did not and processed the two renderings exactly the same other than for color: the left portion of each color square is white balanced on GS11 and has just the dE2k P5000 matrix adjustment, while the right portion uses GS11 and Adobe's interpolated forward matrix followed by HSV corrections with no 'look' or 'tone' applied (that goes top and bottom in the neutral patches). DCRAW uses incamera multipliers by default (corresponding to about 5100K as I remember). Etc.
Jack
PS For the general public: many colors are outside of sRGB so differences may be more difficult to spot on non widegamut monitors.

I need to better understand the implications of white balance and capture lighting in the whole process.
One last thing on the lighting: Flare really messes up this type of calibration by making tones look lighter than the reference, therefore affecting how lightness is modeled. I don't know this target so I don't know whether it exhibits flare, though it looks like it might in some areas.
Jack

PS For the general public: many colors are outside of sRGB so differences may be more difficult to spot on non widegamut monitors.
Just a thought: In the past I have used a diagonal split for comparison (and ended up using an Sshaped diagonal for most noticeable difference). Made sense at the time considering that the vast amount of horizontal and vertical patterns may obscure the differences of rectangular comparison, especially if they happen to coincide with the L steps.

Thanks for your feedback guys. I'm pretty sure I didn't offset the patches when training but will check that. I firmly believe the problem is in the chart itself as you suggest Jack, or we'd rather say in the discrepancy between measured lightness values (spectrophotometer) and camera (sensor + optics) for column 16 vs gray row. The lightness crossover between group 16 and gray patches is real. I think it can even be seen in the L real vs prediction plot, where alternatively errors were below and above the expected value:
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_CORR.png)
It this is true, it confirms how difficult is to do a proper capture of these glossy charts. Years ago I had to give up because I couldn't eliminate undesired brights. My bet is that as long as column 16 is dropped from the training set, the result for the gray patches below will be good.
Regarding DCRAW Jack, I used it in such a way that it does not apply any tone curve (option 4, linear) nor any colour conversion (o 0). This means it only performs black point substraction, linear scaling to 16bit range, and perchannel white balance scaling (in the chosen WB it pushes R by 2.299 and B by 1.805), giving 100% linear scaled (WB) RAW data. This is the DCRAW command used:
dcraw v r 2.299 1 1.805 1 t 0 o 0 4 T IT8.NEF
The GitHub repository (https://github.com/gluijk/cameraprofilingmlp/blob/master/cameraprofilingmlp.ipynb) is now clean of training data so it's easier to follow.
Regards

My bet is that as long as column 16 is dropped from the training set, the result for the gray patches below will be good.
Nope.
I'm willing to take on your bet.
The more I think about it, the more I'm convinced it is an indexing problem. First of all: you can't have both lighter and darker results in gray from any conversion. There is a minute difference in source colors there, but it is so insignificant that an otherwise reasonable conversion can not in any way result in both lighter and darker patches that still look neutral. Second: the reflections mentioned do not seem to be anywhere near significant, proven by Jack's own conversion.
Find the indexing problem, it is there 99,999999% sure.

See, more proof there is an indexing problem. This is NOT some fluke coincidence...

The more I think about it, the more I'm convinced it is an indexing problem. First of all: you can't have both lighter and darker results in gray from any conversion. There is a minute difference in source colors there, but it is so insignificant that an otherwise reasonable conversion can not in any way result in both lighter and darker patches that still look neutral. Second: the reflections mentioned do not seem to be anywhere near significant, proven by Jack's own conversion.
Find the indexing problem, it is there 99,999999% sure.
Can you elaborate on what you mean by 'indexing' problem 32BT?

Can you elaborate on what you mean by 'indexing' problem 32BT?
An array index problem.
If we look closely at the GS result patches then the GS1 result matches the GS0 source, the GS2 result matches the GS1 source, etc...
Since this doesn't seem a drawing problem, it's likely that the NN is being trained to match GS1 to GS0, and so on. This would explain the original curved deviation in L and the slightly dark column16. Whether this is a souce index problem or a reference index problem, I don't know.

Nope.
I'm willing to take on your bet.
The more I think about it, the more I'm convinced it is an indexing problem. First of all: you can't have both lighter and darker results in gray from any conversion. There is a minute difference in source colors there, but it is so insignificant that an otherwise reasonable conversion can not in any way result in both lighter and darker patches that still look neutral. Second: the reflections mentioned do not seem to be anywhere near significant, proven by Jack's own conversion.
Find the indexing problem, it is there 99,999999% sure.
Not me, not you ;) it was the measurement that caused the lightness anomalous crossover. The author provided me with another measurement of the same card taken on a previous day and this one doesn't show this anomaly. However I am not convinced at all with this measurement because gray patches are measured as strongly bluish.
However regarding the NN, once trained it works really fine. Mean delta E increases a bit, but max delta gets reduced and L on the gray patches seems much more adequate (except for the two darkest patches L16 and GS23; I bet patch 22A is responsible for this because on it prediction remains darker than target, again a measurement anomaly).
The NN makes the gray patches remain neutral (white balance surely has an influence on this), so in some way it fixes the colour tint measured in the gray patches:
MLP_XYZ_()_relu_identity : ΔE_max = 31.2143 , ΔE_mean = 3.5602 , ΔE_median = 2.3695
MLP_Lab_()_relu_identity : ΔE_max = 80.7121 , ΔE_mean = 28.4108 , ΔE_median = 21.4503
MLP_XYZ_()_logistic_identity : ΔE_max = 31.2143 , ΔE_mean = 3.5602 , ΔE_median = 2.3695
MLP_Lab_()_logistic_identity : ΔE_max = 80.7667 , ΔE_mean = 28.4247 , ΔE_median = 21.4171
MLP_XYZ_(3, 3)_relu_identity : ΔE_max = 108.2972 , ΔE_mean = 41.9445 , ΔE_median = 37.2037
MLP_Lab_(3, 3)_relu_identity : ΔE_max = 112.5166 , ΔE_mean = 42.6690 , ΔE_median = 39.2641
MLP_XYZ_(3, 3)_logistic_identity : ΔE_max = 18.9269 , ΔE_mean = 3.8357 , ΔE_median = 2.7222
MLP_Lab_(3, 3)_logistic_identity : ΔE_max = 76.0081 , ΔE_mean = 25.5393 , ΔE_median = 20.7859
MLP_XYZ_(4, 4)_relu_identity : ΔE_max = 89.5776 , ΔE_mean = 22.5197 , ΔE_median = 17.3923
MLP_Lab_(4, 4)_relu_identity : ΔE_max = 83.8646 , ΔE_mean = 32.4117 , ΔE_median = 31.2353
MLP_XYZ_(4, 4)_logistic_identity : ΔE_max = 14.0973 , ΔE_mean = 2.6132 , ΔE_median = 1.9514
MLP_Lab_(4, 4)_logistic_identity : ΔE_max = 67.2113 , ΔE_mean = 17.6860 , ΔE_median = 12.3790
MLP_XYZ_(16, 16)_relu_identity : ΔE_max = 13.3266 , ΔE_mean = 2.4412 , ΔE_median = 1.6316
MLP_Lab_(16, 16)_relu_identity : ΔE_max = 18.4030 , ΔE_mean = 5.1635 , ΔE_median = 4.3559
MLP_XYZ_(16, 16)_logistic_identity : ΔE_max = 13.5694 , ΔE_mean = 2.2652 , ΔE_median = 1.5868
MLP_Lab_(16, 16)_logistic_identity : ΔE_max = 7.6640 , ΔE_mean = 1.6545 , ΔE_median = 1.3866
MLP_XYZ_(50, 50)_relu_identity : ΔE_max = 11.1737 , ΔE_mean = 2.1659 , ΔE_median = 1.5782
MLP_Lab_(50, 50)_relu_identity : ΔE_max = 10.4095 , ΔE_mean = 2.8398 , ΔE_median = 2.3855
MLP_XYZ_(50, 50)_logistic_identity : ΔE_max = 21.1758 , ΔE_mean = 3.6650 , ΔE_median = 2.5787
MLP_Lab_(50, 50)_logistic_identity : ΔE_max = 4.9034 , ΔE_mean = 1.1084 , ΔE_median = 0.8211
MLP_XYZ_(100, 100)_relu_identity : ΔE_max = 14.9724 , ΔE_mean = 1.8887 , ΔE_median = 1.5196
MLP_Lab_(100, 100)_relu_identity : ΔE_max = 6.9355 , ΔE_mean = 1.7798 , ΔE_median = 1.5225
MLP_XYZ_(100, 100)_logistic_identity : ΔE_max = 31.8153 , ΔE_mean = 5.3645 , ΔE_median = 3.2542
MLP_Lab_(100, 100)_logistic_identity : ΔE_max = 4.6104 , ΔE_mean = 1.0028 , ΔE_median = 0.6827
MLP_XYZ_(200, 200)_relu_identity : ΔE_max = 4.1306 , ΔE_mean = 0.8433 , ΔE_median = 0.5115
MLP_Lab_(200, 200)_relu_identity : ΔE_max = 3.9394 , ΔE_mean = 1.1816 , ΔE_median = 0.9588
MLP_XYZ_(200, 200)_logistic_identity : ΔE_max = 24.1265 , ΔE_mean = 3.3331 , ΔE_median = 2.4439
MLP_Lab_(200, 200)_logistic_identity : ΔE_max = 5.0772 , ΔE_mean = 0.8826 , ΔE_median = 0.5278
Simple NN's perform really bad in predicting Lab values. From 16 neurons/layer, Lab models start to perform better than XYZ models, although I'd rather call it Lab/XYZ convergence.
(http://guillermoluijk.com/misc/deltae_vs_complexity.png)
MLP_Lab_(50, 50)_logistic_identity : ΔE_max = 4.9034 , ΔE_mean = 1.1084 , ΔE_median = 0.8211
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_LOSS2.png)
Look at the L correlation, the alternative ups & downs don't exist anymore:
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_CORR2.png)
(http://guillermoluijk.com/misc/MLP_Lab_(50,%2050)_logistic_identity_HIST2.png)
(http://guillermoluijk.com/misc/IT8_pred_ProPhotoRGB_comp_2.jpg)
Previous vs current measurements on gray patches:
(http://guillermoluijk.com/misc/gray_patches_ab.png)
Regards

Do you have the original it8.txt and the new it8.txt available?

MLP_XYZ_(3, 3)_logistic_identity : ΔE_max = 18.9269 , ΔE_mean = 3.8357 , ΔE_median = 2.7222
MLP_Lab_(3, 3)_logistic_identity : ΔE_max = 76.0081 , ΔE_mean = 25.5393 , ΔE_median = 20.7859
Good show Guillermo!
If I understand correctly, 3x3 represents a 5 layer network: input, output and 3 hidden layers with 3 activation units each. The output layer explodes back out to the same size as the input, it is then compared to the given reference data, feeding the result into the minimization algorithm. Correct?
The logistics activation function seems to produce similar performance as classical linear methods with an XYZ reference. I am curious as to why it does not work nearly as well with a Lab reference. I suspect the linear output layer. What do you guys think?
Jack

The logistics activation function seems to produce similar performance as classical linear methods with an XYZ reference. I am curious as to why it does not work nearly as well with a Lab reference. I suspect the linear output layer. What do you guys think?
Yes, you already gave the key to understand this: there is a nearly linear relation between RGB_WB and XYZ, but Lab is totally nonlinear vs RGB_WB or XYZ. This means a (3,3) NN (2 hidden layers with 3 neurons each) is not capable to deal with such a level of nonlinearity. As long as we introuduce 16 neurons/layer we have a higher degree of freedom and the nonlinear abilities of the NN start to shine, and since the loss function is computed in the Lab euclidean space Lab models performs better than XYZ (surely a XYZ NN with a custom defined loss function, would be the best model because the XYZ to Lab transformation is deterministic so RGB_WB > XYZ would be the logical way for the mapping, but I cannot define de loss function):
MLP_XYZ_(3, 3)_logistic_identity_CORR:
(http://guillermoluijk.com/misc/MLP_XYZ_(3, 3)_logistic_identity_CORR.png)
MLP_Lab_(3, 3)_logistic_identity_CORR:
(http://guillermoluijk.com/misc/MLP_Lab_(3, 3)_logistic_identity_CORR.png)
Regards

Yes, you already gave the key to understand this: there is a nearly linear relation between RGB_WB and XYZ, but Lab is totally nonlinear vs RGB_WB or XYZ. This means a (3,3) NN (2 hidden layers with 3 neurons each) is not capable to deal with such a level of nonlinearity.
But the problem might be caused by the early assessment selecting relu and logistic as the best performers. Now that you know that the entire experiment works properly, you might retry tanh for the Lab case. tanh allows the NN to produce smooth transitions between negative and positive values within one node. Otherwise it needs increased complexity to achieve the same.

Maybe we can introduce a new activation function! We might finally have found a potentially useful application for this formula: https://forum.luminouslandscape.com/index.php?topic=58257.0
:)

But the problem might be caused by the early assessment selecting relu and logistic as the best performers. Now that you know that the entire experiment works properly, you might retry tanh for the Lab case. tanh allows the NN to produce smooth transitions between negative and positive values within one node. Otherwise it needs increased complexity to achieve the same.
First of all I have to say you were right, gray patches were offset by one position because the author made a mistake in measuring GS0 twice and not recording the measure in patch GS23.
He has measured again the GS0GS23 patches and made the union of measurements. I don't pretty much like the idea of mixing measurements taken in different times and possibly conditions, but OK.
I dropped the relu activation function and introduced tanh, thanks for the suggestion. It converges faster than sigmoid and performs better, so my best tradeoff candidate this time is MLP_Lab_(50, 50)_tanh_identity:
MLP_XYZ_()_tanh_identity : ΔE_max = 33.6562 , ΔE_mean = 3.2097 , ΔE_median = 1.8020
MLP_Lab_()_tanh_identity : ΔE_max = 82.1350 , ΔE_mean = 28.3336 , ΔE_median = 21.5002
MLP_XYZ_()_logistic_identity : ΔE_max = 33.6562 , ΔE_mean = 3.2097 , ΔE_median = 1.8020
MLP_Lab_()_logistic_identity : ΔE_max = 82.1906 , ΔE_mean = 28.3471 , ΔE_median = 21.4991
MLP_XYZ_(3, 3)_tanh_identity : ΔE_max = 15.3840 , ΔE_mean = 2.4645 , ΔE_median = 1.8418
MLP_Lab_(3, 3)_tanh_identity : ΔE_max = 43.0895 , ΔE_mean = 11.6182 , ΔE_median = 9.2091
MLP_XYZ_(3, 3)_logistic_identity : ΔE_max = 18.1560 , ΔE_mean = 3.8059 , ΔE_median = 2.2673
MLP_Lab_(3, 3)_logistic_identity : ΔE_max = 77.2547 , ΔE_mean = 25.7759 , ΔE_median = 20.7278
MLP_XYZ_(4, 4)_tanh_identity : ΔE_max = 11.2728 , ΔE_mean = 2.0082 , ΔE_median = 1.5386
MLP_Lab_(4, 4)_tanh_identity : ΔE_max = 37.5632 , ΔE_mean = 8.2402 , ΔE_median = 6.7438
MLP_XYZ_(4, 4)_logistic_identity : ΔE_max = 16.7437 , ΔE_mean = 2.3609 , ΔE_median = 1.6683
MLP_Lab_(4, 4)_logistic_identity : ΔE_max = 68.0552 , ΔE_mean = 17.6988 , ΔE_median = 12.4104
MLP_XYZ_(16, 16)_tanh_identity : ΔE_max = 13.7044 , ΔE_mean = 2.0632 , ΔE_median = 1.2234
MLP_Lab_(16, 16)_tanh_identity : ΔE_max = 3.9206 , ΔE_mean = 1.0084 , ΔE_median = 0.8693
MLP_XYZ_(16, 16)_logistic_identity : ΔE_max = 13.1919 , ΔE_mean = 1.9344 , ΔE_median = 1.2701
MLP_Lab_(16, 16)_logistic_identity : ΔE_max = 6.1042 , ΔE_mean = 1.3190 , ΔE_median = 1.0214
MLP_XYZ_(50, 50)_tanh_identity : ΔE_max = 14.7397 , ΔE_mean = 2.4049 , ΔE_median = 1.6943
MLP_Lab_(50, 50)_tanh_identity : ΔE_max = 3.9451 , ΔE_mean = 0.6966 , ΔE_median = 0.5313
MLP_XYZ_(50, 50)_logistic_identity : ΔE_max = 26.3273 , ΔE_mean = 3.5872 , ΔE_median = 2.1889
MLP_Lab_(50, 50)_logistic_identity : ΔE_max = 4.0815 , ΔE_mean = 0.7272 , ΔE_median = 0.5018
MLP_XYZ_(100, 100)_tanh_identity : ΔE_max = 9.9610 , ΔE_mean = 1.5248 , ΔE_median = 1.0915
MLP_Lab_(100, 100)_tanh_identity : ΔE_max = 3.3160 , ΔE_mean = 0.5003 , ΔE_median = 0.3757
MLP_XYZ_(100, 100)_logistic_identity : ΔE_max = 29.8789 , ΔE_mean = 3.7231 , ΔE_median = 2.1818
MLP_Lab_(100, 100)_logistic_identity : ΔE_max = 3.7744 , ΔE_mean = 0.6649 , ΔE_median = 0.4671
MLP_XYZ_(200, 200)_tanh_identity : ΔE_max = 8.4625 , ΔE_mean = 1.4752 , ΔE_median = 0.9700
MLP_Lab_(200, 200)_tanh_identity : ΔE_max = 3.3737 , ΔE_mean = 0.4059 , ΔE_median = 0.2591
MLP_XYZ_(200, 200)_logistic_identity : ΔE_max = 16.8308 , ΔE_mean = 1.8797 , ΔE_median = 1.1590
MLP_Lab_(200, 200)_logistic_identity : ΔE_max = 4.2315 , ΔE_mean = 0.6553 , ΔE_median = 0.4166
Mean delta E is lower than 1, with max delta E below 4!
Quick and soft convergence:
(http://guillermoluijk.com/misc/MLP_Lab_(50, 50)_tanh_identity_LOSS.png)
Nice correlation (again lightness in dark and very bright patches gets the least accurate fit):
(http://guillermoluijk.com/misc/MLP_Lab_(50, 50)_tanh_identity_CORR.png)
Good Delta E distribution gathered below 1:
(http://guillermoluijk.com/misc/MLP_Lab_(50, 50)_tanh_identity_HIST.png)
Left half=prediction vs Right half=exact value (target):
(http://guillermoluijk.com/misc/MLP_Lab_(50, 50)_tanh_identity_comp.jpg)
Pretty impressive, right? the NN complexity is 2800 coeffs + 103 bias = 2903 numbers (if we consider 32bit floating point numbers, that means 11,3KB are needed to store the profile's definition)
I also predicted one of the simplest models: MLP_XYZ_(4, 4)_tanh_identity : ΔE_max = 11.2728 , ΔE_mean = 2.0082 , ΔE_median = 1.5386
(http://guillermoluijk.com/misc/MLP_Lab_(4, 4)_tanh_identity_comp.jpg)
Not bad for a NN defined by 51 numbers:
(http://guillermoluijk.com/misc/MLP_(4, 4).png)
Now I need to check how the NN behave for unseen colours, i.e., interpolating colours that are not in the chart. I expect the interpolations to be soft; any ringing behaviour would be bad news. I can also validate it on real images. I think tanh could help in providing soft transitions?
Regards

Excellent, even the 16x16 Lab network looks good. Based on Torger's comments I have a feeling that, as long as results are 'good enough', the smaller the network the better in terms of avoiding overfitting problems.
I am curious as to how such a network would perform with a nonlinear output activation function, say tanh since it seems to work well. I am asking because the neutrals are still not quite right  perhaps because L is not linear (identity)?
Also, in order not to have to do backward somersaults with the testing setup, I think for proper validation of the performance of a chosen network one would ideally grab Spectral Sensitivity Functions for the sample camera (from here (https://www.imageengineering.de/content/library/diploma_thesis/christian_mauer_spectral_response.pdf) for instance), choose an illuminant SPD and generate properly spaced training and cross validation sets. Then change the illuminant and determine how far one can go before it falls apart. Next add some subset of the illuminant SPD as an input, all the way down to just the wb multipliers.
What is the name of the correct XYZ reference file now Guillermo? I'll see if it makes much of a difference for the linear fits.
Jack

Very interesting results indeed, Guillermo.
I don't know about anyone else, but this turns out to be a very insightful experiment.
Some thoughts and suggestions: the (50, 50) result looks remarkably like the best results you'd expect from normal profiling where the card is either slightly unevenly lit or slightly bend. Under normal matrix conversion this has no effect on actual profile performance, because it simply is reproducing the camera capture correctly.
However:
In this case I think something else is happening and this is very important to understand. It gets to the core of NN design.
What we might be seeing is a combination of overfitting and the inability of the NN to properly represent the Lab gamma curve.
1. Overfitting
If you look at the attached annotation on your L graph, you can see we have outliers (the arrows) but not what seems as random deviation.
2. Gamma curve
In the same attachment in the circle you can see something that looks like ringing. I suspect that this is a result of an inability to properly represent Lab gamma. The tanh activation curve looks somewhat like the gamma curve, but isn't. (Nor is it a linear transition in case of XYZ).
Now, in my never humble opinion I would assess the results as follows:
(50, 50) allows too much variation in curves and fitting. There are several reasons you should NOT want to make the layers that large. One vitally important reason is that NN is supposed to encode patterns compactly that are either too large for us to comprehend or too hard for us to understand, or both. By applying large NN layers for what is essentially a really simple linear matrix conversion, we are not making the solution elegantly small and succinct.
So, in this case I would ask myself what would be necessary for the NN to better match the gamma curve (or the linear curve) which I suspect will better match the overall model without overfitting? Keeping it elegantly small?
My answer would be: add another hidden layer. The NN probably just needs another step for better matching the gamma curves. And, to keep it as small as possible, I would first try (4, 4, 4) and then if it confirms the suspicion, reduce to (4, 4, 3), (3, 4, 3), and maybe (3, 3, 3).
Another important and interesting approach would be culling: remove the connections that fall below a certain threshold, rinse, and repeat. Until you reach the optimal compactness in your NN design.
Please note, this is in no way criticism. I think you did a brilliant job implementing this and sharing the results. It is absolutely insightful.

Thanks for the new reference data Guillermo. For completeness the linear 3x3 matrix fit to the V3 reference spectro measurements results in the following:
dE76 mean = 3.569, dE76 median = 2.461, dE76 max = 21.709
The k value is now 0.961, meaning that we no longer have the earlier lightness red flag. SMI is 80.4, low by CC24 standards. The matrix from wbraw to D50 is
0.6756 0.2306 0.0514
0.2928 0.8662 0.1590
0.0270 0.1711 0.9726
The result of the Normal Equation is
0.6245 0.2767 0.0452
0.2213 0.8011 0.2165
0.0851 0.1089 0.9724
which yields a CCT of 4441K with a Duv of 0.03186.
The matrix from wbraw>sRGB is
1.6410 0.5781 0.0629
0.0998 1.4205 0.3207
0.0191 0.4207 1.4015
Jack

Hi guys, I did a quick test to check how good interpolation is done by the NN for unseen colours. I just took the 24 gray patches and linearly interpolated 20 values between every pair of patches in their RGB_WB values. Then I predicted the final extended data set (i.e. original gray patches + their interpolations) and plotted everything (in this case I took the RAW_WB G values vs L):
 blue is the exact L value
 black is the NN prediction for seen patches (predictions are a bit lower than exact L for the brightest patches, and a tiny bit higher for the darkest patches, something we already noticed in the comparison chart)
 red are the NN predictions for the interpolated patches
(http://guillermoluijk.com/misc/validation_gvsl.png)
If I did it right this is very good news. I cannot actually say if the L output behaviour follows linearly the linearly interpolated patches, but one can say there is no ringing or any undesired unstable behaviour between seen patches.
Tomorrow I will do the same with some more colorful patches, and will answer your comments.
Regards

I am curious as to how such a network would perform with a nonlinear output activation function, say tanh since it seems to work well. I am asking because the neutrals are still not quite right  perhaps because L is not linear (identity)?
As far as I know, the preferred output activation function (this is the function used in the last layer, the one that provides the output values; L, a and b in our case) for numerical regression is the identity. In other words, it is the rest of the network which is in charge of modelling the nonlinearities, the output layer just performs a linear combination over its inputs. Nonlinear output activation functions (like sigmoid or tanh) are preferred for logistic regression problems (i.e. classification problems). But I can do a quick test using sigmoid and tanh in the output layer.
For completeness the linear 3x3 matrix fit to the V3 reference spectro measurements results in the following:
dE76 mean = 3.569, dE76 median = 2.461, dE76 max = 21.709
That is clearly better, but in the same order of magnitude as the linear NN (MLP_XYZ_()).
With a fairly simple NN (MLP_XYZ_(3, 3)_tanh_identity) the NN beats the optimum linear fit.
MLP_XYZ_()_tanh_identity : ΔE_max = 33.6562 , ΔE_mean = 3.2097 , ΔE_median = 1.8020
MLP_XYZ_(3, 3)_tanh_identity : ΔE_max = 15.3840 , ΔE_mean = 2.4645 , ΔE_median = 1.8418
MLP_XYZ_(4, 4)_tanh_identity : ΔE_max = 11.2728 , ΔE_mean = 2.0082 , ΔE_median = 1.5386
What we might be seeing is a combination of overfitting and the inability of the NN to properly represent the Lab gamma curve.
1. Overfitting
If you look at the attached annotation on your L graph, you can see we have outliers (the arrows) but not what seems as random deviation.
2. Gamma curve
In the same attachment in the circle you can see something that looks like ringing. I suspect that this is a result of an inability to properly represent Lab gamma. The tanh activation curve looks somewhat like the gamma curve, but isn't. (Nor is it a linear transition in case of XYZ).
Now, in my never humble opinion I would assess the results as follows:
(50, 50) allows too much variation in curves and fitting. There are several reasons you should NOT want to make the layers that large. One vitally important reason is that NN is supposed to encode patterns compactly that are either too large for us to comprehend or too hard for us to understand, or both. By applying large NN layers for what is essentially a really simple linear matrix conversion, we are not making the solution elegantly small and succinct.
So, in this case I would ask myself what would be necessary for the NN to better match the gamma curve (or the linear curve) which I suspect will better match the overall model without overfitting? Keeping it elegantly small?
My answer would be: add another hidden layer. The NN probably just needs another step for better matching the gamma curves. And, to keep it as small as possible, I would first try (4, 4, 4) and then if it confirms the suspicion, reduce to (4, 4, 3), (3, 4, 3), and maybe (3, 3, 3).
That's very interesting insights, will give them a try. But with deep NN like (200,200,200) the improvement was none, so surely the complexity of the NN was far beyond the complexity of the problem.
What we might be having here is just an innacurate gamma curve fitting in the low end, and the undesired overfitting maybe cause by 'noise': samples having less accuracy because of noise and influence of undesired reflections in the IT8 capture. If we look at the somewhat gammalike curve I plotted in my previous post (output L values vs input RAW_WB G), the curve doesn't converge softly to (G=0, L=0), and it should. Instead, low G values correspond to even lower than expected L values so the NN seems to be clipping the shadows. This makes me think the RAW file in the dark shadows could be contaminated by some degree of reflection on the chart. A possible solution would be to drop the darkest patches in the training set they are not respecting the sensor linear response), and synthetically introduce (R=0, G=0, B=0) > (L=0, a=0, b=0) examples in the training set, because we really need L=0 in absence of light, but not before that.
A similar kind of issue may be taking place in the highlights: the NN has not been trained with a (R=255, G=255, B=255) > (L=100, a=0, b=0) example, nor with partial saturations (some channel clipped while the others are fine). This may explain this undesired behaviour in partially clipped highlights (this is the NN RAW RGB_WB to Lab output, later to ProPhotoRGB):
(http://guillermoluijk.com/misc/bodegon_crop.jpg)
Just using input values with 0.5EV exposure, the problem is not there:
(http://guillermoluijk.com/misc/bodegon_crop2.jpg)
Anyway I think this is a more complex to fix problem than the low end one, and solving all possible cases of clipped highlights is out of the scope of the exercise. In fact RAW developers need to implement complex highlight strategies to deal with this problem.

Before doing more simulations or picking some patch pairs to predict the interpolated colours between them, I did a brute force exercise feeding the NN with all possible RGB 8bit combinations in a synthetic image by Bruce Lindbloom, which shows smooth gradients:
(http://guillermoluijk.com/misc/lindbloom_original.jpg)
After being transformed by the NN, we get again smooth gradients in the output what makes me think again that the NN is not oscillating because of overftting when predicting inbetween colours:
(http://guillermoluijk.com/misc/lindbloom_predicted.jpg)
Maybe I'm oversimplifying my conclusions here, but if the NN would be generating unstable outputs for unseen colours, I think we should see that behaviour here, do you agree?.
Regards

Regarding the white and black point definition, reading here (https://ninedegreesbelow.com/photography/wellbehavedcameraprofile.html) it seems common when creating profiles to synthetically introduce black and white points:
Add perfect synthetic D50 white and black color patches to the ti3 file
We don't want colprof to use any of the target chart color patches to set the media white and black points. Rather we want colprof to use D50 white as the media white point, and solid black as the media black point. When using a scale from 0 to 100, D50 white has the XYZ values (96.4200, 100.000, 82.4910), and solid black has the XYZ values (0.0, 0.0, 0.0). So we'll add two lines to the ti3 file:
00W 96.4200 100.000 82.4910 100.000 100.000 100.000 0.000000 0.000000 0.000000
00B 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.000000 0.000000 0.000000
I am a bit confused about which XYZ values correspond to the exact white point using the D50 illuminant:
D50 0.96422 1.00000 0.82521
or
ICC 0.9642 1.0000 0.8249
With my conversion formula taken from Bruce Lindbloom:
Lab=(100, 0, 0) > XYZ=(0.96422, 1, 0.82521)
But on many sites I read the D50 illuminant is: 0.9642, 1.0000, 0.8249
They are very close but which one is the geunine XYZ (D50) reference white?
Regards

The ICC white is the correct version, because all other data will be using it. i.e. the reference data will have been measured and stored with the icc version.

I'm not really sure what the ti3 does, but for your purposes here, you can safely ignore the pure black and "pure white" references. First of all, your experiment is not about those details, and secondly it won't make a difference in training, considering the relative contribution of 1 sample in the entire trainingset.
To mitigate the clipping effect, simply apply channelclipping after conversion.
I have been fooling around with tensorflow and colabs to create a sheet for a simple Linear to Perceptual testcase. We might be able to see what minimum complexity is required for an Y to L match. Will post later today.

As far as I know, the preferred output activation function (this is the function used in the last layer, the one that provides the output values; L, a and b in our case) for numerical regression is the identity. In other words, it is the rest of the network which is in charge of modelling the nonlinearities, the output layer just performs a linear combination over its inputs. Nonlinear output activation functions (like sigmoid or tanh) are preferred for logistic regression problems (i.e. classification problems). But I can do a quick test using sigmoid and tanh in the output layer.
The output layer is usually meant to move and scale the result back to desired output range. Introducing nonlinearities limited to unity is mostly not helpful. (In classification problems it can act as an additional filter slope, in which case it is helpful.)

Maybe I'm oversimplifying my conclusions here, but if the NN would be generating unstable outputs for unseen colours, I think we should see that behaviour here, do you agree?.
Yes, agree. But we have to define what undesirable output means. The ringing in our case here is not excessive. What it might generate is small bands of slightly off color steps in a gradient. So you could perhaps try a granger rainbow to see what it does in that case.
You may be right that what appears as ringing occurs as a result of overfitting noise or unstable dark paches. At least for L. For the experiment it is interesting to note that: IF we don't know the actual model, how do we assess our results? It is precisely because we try to model a smooth curve that we know that the fluttering is undesirable. But, if we want to additionally model nonlinearities that may occur in camera capture (nonlinearities that would ordinarily not be covered by the normal matrix conversions), how do we know our measure of smoothness ?

They are very close but which one is the geunine XYZ (D50) reference white?
As always with color the answer is less than obvious. The White Point in XYZ is given by the Spectral Power Distribution of the illuminant times the XYZ Color Matching Functions. So one has a few choices:
1) What range of wavelengths should this be limited to? (normally 380:780nm)
2) How frequent is the sampling of the SPD/CMF? (normally 1 or 5nm, but 10 is also used)
3) What XYZ CMF should be used (1931 2 deg, vs more recent/accurate (http://www.cvrl.org/ciexyzpr.htm), vs ...)?
4) What SPD should be used? (normally the standard (http://www.brucelindbloom.com/index.html?Eqn_DIlluminant.html), related to the chromaticity of the WP, which depends on the CMF. It is sampled every 10nm)
Normally I use the 1931 2 deg CMFs for consistency with other published data (e.g. xy coordinates), interestingly they also provide slightly better fits than the CIE2006 version. Therefore the biggest variation comes from 1), the range of wavelengths used. Since the spectrometer I use provides data in the 400:730nm range, oversampled every 3.33333nm, it only makes sense to me to calculate all values (including recalculating Lab references) in that range only for best results, interpolating all curves linearly down to 1nm. In this case
XYZ_D50 = [0.9638 1.0000 0.8229]
But wait, now xy WP has changed, resulting in a slightly different D50 SPD... So as you can see there is a certain amount of wiggle room and perhaps it is not worthwhile to worry too much about the last couple decimal places.
Jack

The output layer is usually meant to move and scale the result back to desired output range. Introducing nonlinearities limited to unity is mostly not helpful. (In classification problems it can act as an additional filter slope, in which case it is helpful.)
You obviously know NNs, Oscar, while I am merely an interested observer. My suggestion of using a nonlinear output layer stems from having seen that the visibly worst offenders seem to be in the brighter neutral tones. Perhaps this is due to the fact that the network has learned to deal with the much more numerous changes further down the curve  but not up there. Since the identity activation function in the output layer is a fixed weighted sum of learned features perhaps it does not have enough oomph to make it up there (and possibly not enough unoomph to make it down to the deepest shadows Guillermo), hence the suggestion of a nonlinear activation function to help it along. Does this make sense?
It would be interesting to see whether some of the other networks that did not do so well by the generic metrics actually generalize better with extreme tones.
Jack

But, if we want to additionally model nonlinearities that may occur in camera capture (nonlinearities that would ordinarily not be covered by the normal matrix conversions), how do we know our measure of smoothness ?
Well, thinking aloud, current imaging systems can pretty well be considered to be linear. Any resulting nonlinearities are due to the fact that the camera's SSFs typically are not a simple linear transformation away from the standard observer's eye SSFs (or CMFs in XYZ if one wants to stick with CIE conventions). So our measure of smoothness is by definition smoothness as perceived by the standard observer, which brings us back to known metrics of just noticeable color differences (like MacAdam Ellipses or similar (https://www.strollswithmydog.com/justnoticeabledifferencecolor/)).
Jack

As always with color the answer is less than obvious. The White Point in XYZ is given by the Spectral Power Distribution of the illuminant times the XYZ Color Matching Functions. So one has a few choices:
1) What range of wavelengths should this be limited to? (normally 380:780nm)
2) How frequent is the sampling of the SPD/CMF? (normally 1 or 5nm, but 10 is also used)
3) What XYZ CMF should be used (1931 2 deg, vs more recent/accurate (http://www.cvrl.org/ciexyzpr.htm), vs ...)?
4) What SPD should be used? (normally the standard (http://www.brucelindbloom.com/index.html?Eqn_DIlluminant.html), related to the chromaticity of the WP, which depends on the CMF. It is sampled every 10nm)
Normally I use the 1931 2 deg CMFs for consistency with other published data (e.g. xy coordinates), interestingly they also provide slightly better fits than the CIE2006 version. Therefore the biggest variation comes from 1), the range of wavelengths used. Since the spectrometer I use provides data in the 400:730nm range, oversampled every 3.33333nm, it only makes sense to me to calculate all values (including recalculating Lab references) in that range only for best results, interpolating all curves linearly down to 1nm. In this case
XYZ_D50 = [0.9638 1.0000 0.8229]
But wait, now xy WP has changed, resulting in a slightly different D50 SPD... So as you can see there is a certain amount of wiggle room and perhaps it is not worthwhile to worry too much about the last couple decimal places.
Jack
That's not the issue. The spectral distribution of D50 is very precisely defined by the CIE. The difference is a result of emissive vs reflective conversion.
However, with nr 3 you are touching on an interesting point: I always thought that white point issues are very obviously not a 2degree observer problem, so you could say that perhaps the 10degree matching functions are better suited, but then we neither have the corresponding tricolor models, nor do we know whether the results can be mixed indiscriminately with 2degree observer logic.
Isn't colormanagement a wonderful swamp of uncertainties? It's a small wonder that it works at all...

Well, thinking aloud, current imaging systems can pretty well be considered to be linear. Any resulting nonlinearities are due to the fact that the camera's SSFs typically are not a simple linear transformation away from the standard observer's eye SSFs (or CMFs in XYZ if one wants to stick with CIE conventions). So our measure of smoothness is by definition smoothness as perceived by the standard observer, which brings us back to known metrics of just noticeable color differences (like MacAdam Ellipses or similar (https://www.strollswithmydog.com/justnoticeabledifferencecolor/)).
Jack
Yes, but I mean: how do we know we have achieved the proper smoothness relative to the noise of observations and capture? It's easy to define linearity and gamma, and then provide a stable match. We know the desired smoothness (mathematically). Once achieved, it is transferable to other cases. However, with NNs, once we exceed a certain complexity, this is no longer guaranteed.
It's a bit like interpolation with polynomials. We know cubic works really well and is very stable. Higher order polynomials are not.

Before doing more simulations or picking some patch pairs to predict the interpolated colours between them, I did a brute force exercise feeding the NN with all possible RGB 8bit combinations in a synthetic image by Bruce Lindbloom, which shows smooth gradients:
After being transformed by the NN, we get again smooth gradients in the output what makes me think again that the NN is not oscillating because of overftting when predicting intermedium colours:
Maybe I'm oversimplifying my conclusions here, but if the NN would be generating unstable outputs for unseen colours, I think we should see that behaviour here, do you agree?.
I maybe oversimplifying but since the perceivable issues seem to be with the deepest shadows and brightest highlights, isn't one of the symptoms of overfitting the fact that it tends to busily (though perhaps non perceptually) stick fairly close to the curve in the middle of the range but goes wild towards the extremes, especially beyond the range of the training set?
(https://upload.wikimedia.org/wikipedia/commons/6/68/Overfitted_Data.png)

I maybe oversimplifying but since the perceivable issues seem to be with the deepest shadows and brightest highlights, isn't one of the symptoms of overfitting the fact that it tends to busily (though perhaps non perceptually) stick fairly close to the curve in the middle of the range but goes wild towards the extremes, especially beyond the range of the training set?
(https://upload.wikimedia.org/wikipedia/commons/6/68/Overfitted_Data.png)
Yes, and, no. NNs are not particularly suited for extrapolation. So the edges may exhibit more problematic behavior. However, what happens in case of overfitting is that the NN can exactly reproduce all patches in the chart, because it's complex enough and because we are feeding it all patches. However, the chart is both noisy and possibly inconsistently lit. So, we are actually reproducing those problems.
The captured chart should be checked for lightness against the measured Y values. Plot G vs Y in perceptual space and you might see whether the captured data already exhibits any lightness issues.

Okay, seems to work.
I created a small test sheet for testing the interpolation capabilities of a NN. For a reasonably programming savvy person this should be relatively easy to follow. (If you can do matlab, you can do this).
1. Open a google colab sheet: https://colab.research.google.com
2. Choose File > upload notebook (remove the .txt extension from the attached file)
3. On the right in the toolbar choose Connect
4. Runtime > run all
Play around with the last two code blocks for different NN configurations. When you change something in the configuration you only have to rerun the last two blocks.
Note that we are trying to interpolate samples from a curve with a limited NN complexity. There is no issue with training size vs validation size etc. because under and overfitting, or ringing or any other effect is exactly what we are trying to see.

Okay, so that turns out interesting.
Apparently a single layer with several nodes is better able to approximate the perceptual curve than several layers with a single node.
In the picture below, the leftmost graph is a single layer with a single node, tanh activation function.
Incidentally, use the Nadam optimizer for better results:
model.compile(optimizer='Nadam', loss='mean_squared_error')

OK I have performed the 'rainbow test' to look for overfitting indications. I picked the following 8 patches (plus the dark patch for beginend reference):
(http://guillermoluijk.com/misc/rainbow_path.jpg)
Taking the RAW RGB_WB values on them, I interpolated linearly the transitions between those patches (100 interpolated samples). Then I predicted the ProPhoto RGB output values for the whole range:
This is how the Lab models evolve with incresing complexity (blue cross = exact value, red ball = prediction over the patch, red line = gradient prediction):
(http://guillermoluijk.com/misc/lab_models.gif)
This is how the XYZ models evolve with incresing complexity (blue cross = exact value, red ball = prediction over the patch, red line = gradient prediction):
(http://guillermoluijk.com/misc/xyz_models.gif)
To me it is clear now that, even if Lab models performed better on DeltaE for the seen patches when using high complexity NN's (50, 50), it was simply because the loss function was much closer to the output space where DeltaE are to be measured. And there was clear overfitting (low error on seen patches, but at a cost of oscillations for unseen colours).
So if I were to decide, I'd choose a much lower complexity NN and use the XYZ model. Making a summary:
 Linear optimum fit (Jack's): ΔE_max = 21.709, ΔE_mean = 3.569, ΔE_median = 2.461
 MLP_XYZ_(4, 4)_tanh_identity: ΔE_max = 11.2728, ΔE_mean = 2.0082, ΔE_median = 1.5386 > APPARENTLY NO OVERFITTING
 MLP_Lab_(50, 50)_tanh_identity: ΔE_max = 3.9451, ΔE_mean = 0.6966, ΔE_median = 0.5313 > BUT WITH OVERFITTING
So the MLP_XYZ_(4, 4)_tanh_identity would be a good candidate to improve the linear fit keeping compactness.
This is how MLP_XYZ_(4, 4)_tanh_identity compares to MLP_Lab_(50, 50)_tanh_identity:
(http://guillermoluijk.com/misc/overfitting.png)
The Lab (50,50) model gets closer to the exact values (blue crosses), but at the cost of oscillations (=variance) for unseen colours.
And this is how differences are rendered in the ProPhotoRGB rainbow (the white lines are the precise patch median RAW values):
(http://guillermoluijk.com/misc/rainbow.jpg)
The ringing on the overfitted model is not noticeable, but we know it's there.
Regards

Very revealing. You aced this!!!

And here the same animations, but for the output values represented in Lab.
This is how the Lab models evolve with incresing complexity (blue cross = exact value, red ball = prediction over the patch, red line = gradient prediction):
(http://guillermoluijk.com/misc/lab_models_laboutput.gif)
This is how the XYZ models evolve with incresing complexity (blue cross = exact value, red ball = prediction over the patch, red line = gradient prediction):
(http://guillermoluijk.com/misc/xyz_models_laboutput.gif)
For me it's clear that going beyond (16,16) NN's for Lab models and beyond (4,4) for XYZ models doesn't add, so it's just invoking: "overfitting, come to me". This makes sense looking at the DeltaE evolution we already saw, which was actually very insightful in finding out where to stop regarding NN complexity:
(http://guillermoluijk.com/misc/deltae_vs_complexity.png)
Regards

For me it's clear that going beyond (16,16) NN's for Lab models and beyond (4,4) for XYZ models doesn't add, so it's just invoking: "overfitting, come to me". This makes sense looking at the DeltaE evolution we already saw, which was actually very insightful in finding out where to stop regarding NN complexity:
Regards
It's also interesting to mention perhaps that it takes at least 3 nodes to produce a perceptual curve, while it takes only 1 node to properly represent linear, despite a tanh activation curve. Which probably explains why XYZ performs so much better as far as complexity is concerned.
This entire exercise raises one additional question that might be useful as amendment: how does a one layer config perform?
And, for some future exercise, it would be interesting to know what happens if you could define a custom activation function.

Excellent job Guillermo, starting to make sense now.

Doing some infographics to decorate the story:
(http://guillermoluijk.com/misc/it83d.gif)
Regards

Nice!

Hugo Rodríguez (the person who shared with me the capture and measurements over his IT8) is launching tomorrow a new calibration card intended for higher precision colour rendering. It hast nearly 1000 patches:
(http://guillermoluijk.com/misc/superchroma.jpg)
I'll suggest him to share a RAW file and set of measurements over it, so as over the IT8 with the same camera and shooting conditions. In this situation the IT8 would serve as a validation set to prevent overfitting so as to more correctly measure the delta E's.
Find a description here (Spanish):
https://www.hugorodriguez.com/blog/lahr1superchroma/
Regards

Hugo Rodríguez (the person who shared with me the capture and measurements over his IT8) is launching tomorrow a new calibration card intended for higher precision colour rendering. It hast nearly 1000 patches:
printed with inks consisting from like ~8 pigments at best  so what is the point in that many patches other than marketing to make money ? and it looks at least semigloss  the hell for most except few to shoot it properly... just asking what is the point ?

Some truth to that. Looks like it comes with a black gloss reference card to help hunt down reflections and flare. It mentions a USB key with files and software. I wonder what the capabilities of the software are and if it also comes with 'factory' spectral readings.

printed with inks consisting from like ~8 pigments at best  so what is the point in that many patches other than marketing to make money ? and it looks at least semigloss  the hell for most except few to shoot it properly... just asking what is the point ?
Quoting Hugo: "The HR1 is printed in short batches using a last generation plotter with 12 pigments, over professional paper [Epson UltraChrome] suited for certified tests, glossy, of maximum quality and low content in optical bleach. This allows for the widest available gammut [wider than Adobe RGB and reaching ProPhoto RGB in some patches]. Not only that: also the longest duration, which is approximately 60 years (vs 22 of chemical copies, according to the Wilhelm Imaging Institute)".
I can confirm shooting these cards is a pain if you want to avoid any kind of reflections. I failed to shoot a glossy IT8 having to give up (even tried to surround the camera with a black sheet to prevent me from reflecting any light). I'd say you need studio stuff and abilities to do a proper shot.
Listening to other colour expert, he doesn't agree with glossy and saturated colour cards because of the difficulty in having a proper shot, so your point makes sense. However I'm mainly interested in the training and validation NN exercise. Camera colour calibration for me is in fact irrelevant, but to do a more complete and rigurous test I see a good opportunity here. With 1000 patches we can safely go to a more complex (3.5x) and hence precise NN, and use the IT8 set for validation.
Regards

Hi all!
My name is Hugo Rodriguez and I'm the guy who helped Guillermo with IT8WG captures and launched the new colour chart just today.
I was registered here since years ago but my english was very poor then and barely posted once.
I'll explain:
my new chart is aimed for professional use. It's not easy to shoot, due to it's glossy finish. That's because it focus on quality and precision, not on ease of use.
This means it's not for shooting outdoors with the chart in one hand and the camera in the other hand: it needs a careful reflection control, and this usually is better done in studio.
Anyway, it's not that hard to shoot; you just have to do some very particular things that doesn't require highly difficult procedures and the reflections will be all eliminated.
The chart comes also with lots of helpful patches around and inside the chart for controlling those reflections.
It's aimed to achieve the best precision out there, for highly requiring purposes: art reproduction, commercials, ecommerce, high quality LUTs...
I'll be publishing next week an english version of my webpage so you can all read it. Also I'm in talk with Kevin raber for a future article about camera profiling here.
The chart is printed with an Epson 7900. Having 12 inks doesn't really make a big difference in the whole gamut comparing with 8 or 10. In fact, I'd say that 95% of the gamut is done with the main 4. The rest are just to improve the smoothness of the highligths or some particular colors.
It's glossy; this is the only way to achieve highly saturated colors and wide dinamic range.
It has a glossy black card in the back to help removing any reflection. the procedure and some recommendations are carefully explained in the manual, step by step (in english, about 35page long).
It comes with dual reference file: one without the perimetral patches for lightning control and central cross and the other with all them. Both will come in two versions: XYZ, Lab, LCH and Spectral data. Measured from averaging 5 charts.
Optionally is can come with custom read spectral data.
It comes also with free software for both Win & Mac for profiling (CoCa for PC, RoughProfiler for Mac). I'm in talk with Graeme and authors of these softwares to get the SuperChroma added to their software.
I hope I clarified the questions :)
Best regards from Barcelona
Hugo

Seems unnecessary emphasis on gamut. Just offer matte versions as well. The slight decrease in gamut is offset by the infinite increase in capture consistency & stability, and thereby accuracy.
The majority of this is used in linear matrix conversions. There is not much to be gained by providing wider gamut patches. The patches merely need to be reasonably distributed over a reasonable gamut where the capture device operates distinctly linear (i.e. not deep blacks for example).
Also, there may be merit in randomizing the chart somewhat.

with 12 pigments
that will be inks, not pigments ;D ... now deduct all those "blacks"

The chart is printed with an Epson 7900. Having 12 inks doesn't really make a big difference in the whole gamut comparing with 8 or 10. In fact, I'd say that 95% of the gamut is done with the main 4. The rest are just to improve the smoothness of the highligths or some particular colors.
and pigment wise you are not better (probably worse) then XRite charts
It's glossy; this is the only way to achieve highly saturated colors and wide dinamic range.
wide dynamic range you can achieve using different exposures and creating synthetic target data with rawdigger if you are so inclined ...
It has a glossy black card in the back to help removing any reflection. the procedure and some recommendations are carefully explained in the manual, step by step (in english, about 35page long).
is there a link (URL) to the manual ?

Some truth to that. Looks like it comes with a black gloss reference card to help hunt down reflections and flare. It mentions a USB key with files and software. I wonder what the capabilities of the software are and if it also comes with 'factory' spectral readings.
it also has to come with matte grey for the flat fielding ... as for software  the author says "it comes also with free software for both Win & Mac for profiling (CoCa for PC, RoughProfiler for Mac)."... I'd take dcamprof over those ...

and once you have 45678 pigments that's it... all that wide DR & making it glossy to achieve saturation is all snake oil... you have just so many different spectral reflectances ... now if you so inclined to sell snake oil  make it 2 charts: matte and glossy  rawdigger perfectly allows to combine synthetic data from several shots and you don't need to have all patches glossy ;)

I believe wider gamut gives more precise profiles, and so my testing has proved me.
matte printing does indeed help avoiding problems with reflections, but worsens a lot the dynamic range and the dark tones, which are as important as the rest.
I've done hundreds of profiles from many different charts and the very best precision came from this one. I'm preparing an indepth article explaining all that.
Seems unnecessary emphasis on gamut. Just offer matte versions as well. The slight decrease in gamut is offset by the infinite increase in capture consistency & stability, and thereby accuracy.
The majority of this is used in linear matrix conversions. There is not much to be gained by providing wider gamut patches. The patches merely need to be reasonably distributed over a reasonable gamut where the capture device operates distinctly linear (i.e. not deep blacks for example).
Also, there may be merit in randomizing the chart somewhat.

and pigment wise you are not better (probably worse) then XRite charts
I am not 'better' that Xrite, and XRite is not better than mine. Better can be a product, not a person or a company. Well, in my case it's just me: a one person job. :)
wide dynamic range you can achieve using different exposures and creating synthetic target data with rawdigger if you are so inclined ...
I don't want to mess with rawdigger, well my customers. Yes, it can help in builing synthetic raw blending exposures (though I've never tried that with that software) but the customer target my chart is aimed to isn't a scientific, but a professional photographer.
is there a link (URL) to the manual ?
Not yet. Probably it will only be released in printed version.

it also has to come with matte grey for the flat fielding ... as for software  the author says "it comes also with free software for both Win & Mac for profiling (CoCa for PC, RoughProfiler for Mac)."... I'd take dcamprof over those ...
Regarding flat fielding, I'll recommend to use C1's LCC system.
Dcamprof is a command line software, and I've never seen any photographer in my entire life using a commandline software.
BTW: their site is down, isn't it?
https://www.ludd.ltu.se/~torger/dcamprof.html
Bear in mind that my chart is aimed to photographers, that will be able to build an excellent profile by themselves or with my help. They are not scientific or programmers. So they stick to Capture One or Adobe software all the time.

and once you have 45678 pigments that's it...
Same thing I said and think. The main gamut is achieved with the main 4 inks, the rest are for improving details here and there.
all that wide DR & making it glossy to achieve saturation is all snake oil... you have just so many different spectral reflectances ...
I find your sentence bold and a bit agressive to me. Given the thing that you have never seen one of these charts, tested it nor even got access to the reference file, how can you state that?
So many different spectral reflectances? If there are 4 main inks, there are just 4 spectral reflectances. Then there are many mixes.
How many different spectral reflec. does a CC classic have? 24? How many a CC SG? (answer: 140)
But, anyway, I didn't know that many different spectral reflectances suppose a problem for a photography that will be always take under high quality sources, with high CRI: Prophoto, Broncolor...
I really don't get the point. But I detect some refusal. And also poor understanding of what's my chart aimed to, although I've explained clearly: art reproduction, ecommerce...
now if you so inclined to sell snake oil  make it 2 charts: matte and glossy  rawdigger perfectly allows to combine synthetic data from several shots and you don't need to have all patches glossy ;)
I don't sell snake oil, and I won't push my customers to buy and use a software that will make their work much more complicated, thus making them refusing my chart. Is that so hard to understand?
I repeat: it's aimed to photographers, not to scientifics or tech geeks.
Regards,
Hugo

Why Xrite Gretag 24p is good ? Because all patchs are targetted for having interesting colors AND spectral data close to objects with the same color in réal life (skin, greenery, etc) to avoid metamerism.

Why Xrite Gretag 24p is good ? Because all patchs are targetted for having interesting colors AND spectral data close to objects with the same color in réal life (skin, greenery, etc) to avoid metamerism.
The CC24 is a good chart indeed, but specially for checking results. Don't forget it was designed 43 years ago (!) for visual check with chemical photography.