Thank you Bart for a very insightful answer. I'm certainly not an expert (like most of you here), I'm just an IT guy with an interest in digital cameras, who wants to do things right.
I did some additional reading on the subject and it appears that you are correct – the EXIF data seems to be the most accurate way to get the pixel pitch.
I searched for EXIF data for about 15 different cameras and then compared my calculations with the pixel pitch I got from EXIF data. I found that if I use the most simplest formula (sensor width/max. image resolution width) I get the most accurate results. In most cases it's between 0 – 0.3% difference compared to EXIF pixel pitch.
I will probably do a compromise. I will use EXIF data for DSLRs and for the rest sensor width/max. image resolution formula. This way at least high end cameras will have the most accurate numbers possible.
Given the huge task, you indeed want to use something that can preferably be automated. DSLRs can be used by various types of users, so it makes sense to get those numbers as accurate as humanly possible.
The whole thing, however, still seems a bit illogical to me. I try to imagine pixels on a sensor in a physical way. You have a sensor of a certain size and a certain (total) number of pixels on it. If you want to know horizontal pixel pitch, you take the full width of the sensor and simply divide it with all the horizontal pixels. Simple and logical.
The problem is with the size specifications given by the manufacturers. They are not necessarily very accurate, but in the EXIF data the information is more accurate because it is used internally by the camera, and later for postprocessing (e.g. lens corrections). Therefore we should be able to rely on the EXIF data, although we need to understand what they exactly mean. As for the data tagged with FocalPlaneXResolution and FocalPlaneXResolution, it is the actual sensel density per linear (X/Y) dimension. How many of those sensels are fitted in the total image doesn't change their sampling density, it just determines the total dimensions of the effective sensor area.
Part of the sensor area (at one or more edges) holds technical sensels which do not contribute directly to the image formation, e.g. masked sensels to determine the noise/black level without exposure. Then there are edge sensels that are required for demosaicing, which can require a wider area around the principle pixel that is being demosaiced. And then there is a crop applied to get the demosaiced image pixel dimensions to a multiple of 16, for the generation of JPEGs and full size preview thumbnails. The question then becomes, when a manufacturer says the sensor dimension is 36x24mm, which sensels are included in those dimensions? That's why it is so tricky to start with the total dimensions. It's much safer to start with the EXIF FocalPlane Resolutions, and that info is available from all those sample files made available by sites such as Imaging Resource, DPreview, etc. .
See also this explanation
But if you use max image resolution width, you only use like 95% of the pixels that are on the sensor. You use full sensor width, but only 95% of the pixels and you get more accurate pixel pitch (according to EXIF) than if using all pixels (sensor resolution). I still can't get my head around this.
It's the difference between effective pixels, total pixels, and border sensels. Not all input sensor elements are counted for the published output MegaPixel specifications. The EXIF 'Maker notes' for my 1Ds3 specifies the different sensel/pixel dimensions in the metadata fields;
CroppedImageWidth: 5616 (as output to a JPEG, cropped to a 3:2 aspect ratio and a multiple of 16 pixels, and corresponds to 36mm)
CroppedImageHeight: 3744 (as output to a JPEG, cropped to a 3:2 aspect ratio and a multiple of 16 pixels, and corresponds to 24mm)
SensorWidth: 5712 (as used to demosaic, and output by some Rawconverters)
SensorHeight: 3774 (as used to demosaic, and output by some Rawconverters)
And then there are additional border pixels specified, but I'm not sure that the Tag meaning has been reverse engineered correctly.