Optics, OLPF, micro lenses and the spatial integration carried out by pixel sites can probably be described by a linear space-variant filter/PSF reference system. The actual pixel sampling grid cannot, and seems to make the problem harder to describe in a simple manner.
Indeed, it is not easy to model the system MTF with several poorly characterized components. However, one can approximate the effect on overall performance when one varies a single variable, e.g. the sensel pitch.
I assume most professional photographers will only invest in higher MP sensor arrays when they feel the need to produce larger output. Afterall, there is not much use for huge MP (and associated storage) solutions for web-publishing. Because most camera platforms pose physical limitations on the dimensions of the sensor array, the common approach is to increase sampling density, i.e. using a smaller sensel pitch, for a sensor array of a given size.
I've made a simulation of the sensel pitch effect on the MTF curve of a system with an imaginary perfect lens (no residual aberrations), with a fixed (perfectly circular) aperture of f/8 which causes diffraction , and a (square, 100% fill factor, sensel grid) sensor array without OLPF. I've varied the sensel pitch between 1 micron and 9 micron, which will have an effect on dynamic range, but I've only focused on the sensel pitch effect (due to diffraction) on resolution.
Without diffraction, the square exposure aperture of the sensel basically performs like a box filter with the size of a sensel and that produces a predictable MTF roll-off, and that shape does not change with sampling density (because each sensel is filtered the same). However, the diffraction pattern diameter for a given wavelength and aperture value has a given dimension and spans a variable number of sensels, depending on the sensel pitch.
In the attached file, I've shown the same f/8 diffraction pattern for 555 nm wavelength, but overlaid with grids with a different pitch. The 9 micron pitch grid shows that a single sensel position is almost the same size as diffraction pattern, but at a 1 micron pitch the same diffraction pattern is subsampled much more. Again, the diffraction patterns are the same, it's the sampling pitch that's different. This represents the effect of viewing each pixel at the same size (100% zoom on screen, or the same PPI in output). The denser sampling will produce more output pixels for the same image detail, so larger output at a given PPI but with lower per pixel micro-contrast.
I've used a simulated crop from a 24x36mm sensor array, and convolved the various sensel pitch versions with the f/8 diffraction pattern of 555 nm light. The results were evaluated with Imatest, and I'll attach several graphical outputs of some relevant MTF results in followup posts (due to the file number/size limitations). First I'll present a summary of various key numbers from the Imatest output in the second attachment. For an explanation of their meaning you can read about it on the Imatest website (
http://www.imatest.com/docs/sharpness.html#optimum_aper).
Cheers,
Bart