OK, still puzzled though, and trying to understand the chain of custody of color as it makes its way from the original scene to output device (monitor or print), and its implications for color rendering. Let us ignore the output side of the problem, and concentrate on capture. For that aspect of the problem, one has the following:
The original scene has some light source with some spectral power distribution (SPD), and contains some objects with pigments of given spectral reflectivity. Light bounces around and comes through the lens and incident upon the sensor having a given spectral power distribution -- the intensities of the various constituent frequency components of the light. Let us for present purposes set aside issues of demosaicing, eg we have a scene without fine detail at the pixel level which would distinguish different interpolation algorithms.
The color filters in the CFA have a transmissivity dependent on frequency. Underneath the filter, the sensel responds to the transmitted light, and so integrates over the incident spectral power distribution convolved with the filter's spectral transmissivity and the spectral response of the photodetector. The integration averages over spectral information and therefore data is lost.
1. Because I am assuming that one can set aside interpolation issues, one has color data for each pixel in the form of raw values from the three different types of pixel in the CFA. Let us call this camera color data, the three numbers R_c, G_c,B_c. As some have said in this thread, the camera defines its own notion of color space (or if that is too charged a terminology, call it "color data") through the sensor response to various SPD's. We now wish to use that color data to reconstruct "color" according to some other measure(s).
2. Human vision has its own set of spectral response functions (SRF's) of the cones in the eye, somewhat quantitatively measured by researchers of decades past. An observer looking through the camera viewfinder sees colors that are characterized by three numbers, again the SPD integrated against these SRF's of human vision. Again information is lost since the true SPD of various components of the scene is a function of wavelength; a generic function of one variable cannot be fully characterized by the three numbers, call them R_e,G_e,B_e, that one gets by integrating the SPD against the SRF's of the three kinds of receptor. Nevertheless our brain takes this information and somehow interprets it as color. Let us call this eye color data (ECD).
3. Yet another characterization of the scene are the tristimulus values XYZ, which are the SPD integrated against the CIE's color-matching functions (distinct from the cone response functions of human vision, but meant to model them). Let us call this CIE color data, the values XYZ.
Because three numbers do not specify a function, neither the camera color data, the eye color data, nor the CIE color data are sufficient to reconstruct the original SPD of the original scene. That, in a nutshell, seems to be the source of the problem of mapping between the three sorts of color data layed out above. Of course, ideally that is what one would like so that one can convey to someone elsewhere, later, the experience of the scene where and when one records an image. The best one can hope is that one can reconstruct a reasonable approximation to the SPD using three numbers. Unless of course the SRF's of the camera are the same as human vision or CIE, in which case the corresponding sets of color data will correspond for any input SPD.
There is by now a whole industry built around the CIE convention using XYZ. It seems reasonable to use that as a starting point, and not concern ourselves with the map between CIE color data and eye color data (leaving that to the CIE to refine, though since the SRF's of the CIE standard for XYZ are not the same SRF's of human vision, the relation between CIE color data and eye color data is as fraught with ambiguity as the relation between camera color data and CIE color data, or between camera color data and eye color data).
Setting aside the relation to visual perception, and concentrating on relating the camera's color data to CIE conventions, one wants a map from camera data R_c,G_ c,B_c to X,Y,Z that is bijective (maps in both directions unambiguously). Of course, that is a bit of a non-starter, since for example the degeneracies (metamers) of the two sets of data -- the sets of SPD functions that yield the same XYZ or R_c G_c B_c -- are in general quite different. However, it seems reasonable that one could set up an optimization problem to average over various SPD's to make an "optimal" map between different representations of color data.
I suppose what I am trying to get at is that the camera raw data is color data that is no more or less valid than CIE color data insofar as it is related to the color data of human vision; it is just less standardized. Because the three numbers comprising that color data represent averages over SPD's convolved with SRF's, one cannot map one set of color data uniquely to the other. Constructing a map amounts to choosing a convention rather than deriving a rigorous relation; one tries to construct the map so that the map is roughly accurate with respect to a wide variety of SPD's (the aforementioned optimization problem). There is absolutely no reason why the map need be a linear (matrix) transform, a linear map is simply the crudest and simplest approximation one could make; a lookup table is a more general way of encoding such a map.
Anyway, a long-winded exposition of a few thoughts about which I'd be happy to hear comments.