I think the basic premise of basing this calculation on pixel size comparisons is wrong. How about considering it from an optical resolution or image blur perspective instead? After all, the only purpose served by pixels is to adequately sample the intrinsic resolution of the optics.
Setup #1) 4x5 camera (100mm x 125mm sensor) with 180mm lens at f/16. Assuming the lens is diffraction limited, the point spread blur diameter at the sensor is 20um. The sensor then captures (100K)(125K)/(20)(20) = 30 million resolution "spots". You assume this setup has 12um pixels, which samples the 20um blur function moderately well, but apart from that observation, pixel size does not enter the calculation.
Setup #2) Canon G11 camera (5.7mm x 7.6mm sensor) with 180mm lens at f/16. Yes, that's right, lets use exactly the same lens, just switch sensors. Then to capture the same image, with the same composition and the same number of resolution "spots", you need the same effective sensor area, which requires (100)(125)/(5.7)(7.6) = 288 frames. This setup will clearly give equivalent image quality, since every important variable is the same. Of course the image in this case will have vastly more pixels, since each pixel is smaller, but that is irrelevant to the calculation. You could downsample quite a bit to reduce the pixel count, without any loss of image quality, since sampling a 20um blur function with 2um pixels is overkill.
Setup #3) Canon G11 camera (5.7mm x 7.6mm sensor) with 90mm lens at f/8. This time let's try something different -- use a shorter lens at a wider aperture. To get the same image composition with a lens of 1/2 the focal length, you now need to capture only 1/4 the image area as before, so you will need to stitch 72 frames. By opening the lens up two stops, diffraction is reduced by a factor of two. This means that you will still capture the same number of blur "spots" on the reduced sensor area, giving the same image resolution (still assuming diffraction limited optics). Depth of field will also be similar. Capturing the 10um blur spots with 2um pixels is still overkill.
Setup #4) Canon G11 camera (5.7mm x 7.6mm sensor) with 45mm lens at f/4. The logic is the same as the previous example. Now you need only 18 frames to capture the same image composition, and you will again capture the same number of resolution "spots". This still assumes diffraction limited optics, which may be pretty difficult to achieve at f/4, so there is probably no point in extrapolating further than this. Digitizing 5um blur spots with 2um pixels is still good sampling.