I think you have a point, and I could elaborate it a little further.

1. The lens resolution can be expressed in ray angle. In the simplest form, it is expressed in

wave_length/lens_diameter.

We can convert it ti lp/mm if we assume the iamge plane is very near the focal plane, if so it will be

lens_diameter/(focal_length x wave_length).

We can furthe convert it to the number of pixel per frame assume each frame is W x L.

That is the total number of pixel per frame is

W x L x lens_diameter^2 /(focal_length x wave_length)^2

2. The above equation shows to achieve the same number of pixel per frame with the scaled focal length to match the angle of view, we will need the same lens diameter.

That is, when WxL is scaled down, so as the focal length, they cancel out, and leave the lens diameter unscaled. This implies the lens physical size will be about the same as thenlarge format.

3. This means,

-- if two formats have the same number if total pixel, and have the scaled lens focal length, the lens will have the same diameter.

-- on the other hand, if the two formats have the same pixel pitch, so that the larger format has more pixel per frame, the lens diameter of the large format needs to be proportionally larger.

4. However, the point 3 also indicates an advantage of smaller format. When two formats have the same total number of pixels and same lens diameter, the smaller format will be faster because the f number (focal length/ lens diameter) will be larger.

5. So when more pixel per frame is needed, the choice is larger format.

But when the total number of pixel is enough, the smaller format is the choice. You get either smaller lens with the same speed or faster lens with the same size.

Well the larger format has a fundamental advantage when it comes to sharpness. The larger the sensor, the less optical reduction there is for a given Field of View. For example, consider shooting a billboard painted with closely spaced alternating black and white lines with a FF sensor and a 1.5x crop sensor and assume you just fill the frame with the billboard in each case, and use the same FF lens in each case. If the lines are spaced at say 50 lppmm on the FF sensor, they will be spaced at 75 lppmm on the crop sensor. The MTF of the lens will be lower at 75 lppmm than at 50 lppmm, assuming the lens has the same characteristics at the two different focal lengths used. Thus there is a fundamental advantage with the FF in terms of sharpness of the captured image and it is for this reason that the quantity lw/ph is used for comaprison purposes, when different sensor sizes are involved. Counteracting that to some extent is the fact that the lens will be less sharp at the edges of the FF sensor than at the edges of the crop sensor.

Dave