Isn't the key to understanding this connected with the concept of the linear progression? Each additional f/stop of dynamic range requires double the number of photons to impinge upon the photoreceptor.
There will be a minimum number of photons required to produce a recognisable image that is not completely or almost completely obscured by noise. Having decided what that number is, you can then keep doubling the number till the photoreceptor well is full. eg. (50, 100, 200, 400, 800....128,000, 256,000, 512,000 etc).
Once the well is full, no futher increase in the DR of the captured image is possible. Clearly, sensors with bigger pixels have a DR advantage, all else being equal.
It seems to me the technological challenge here is to devise a way of creating a non-linear progression which more closely matches human visual acuity. For example, the brightest stop of DR in the above progression (256,000 to 512,000) contains far more levels than the eye can discern (what a waste), whereas the darkest stop of DR (50-100) contains far too few levels for good tonality.
One might expect that 14 bit conversion would help improve tonality in the darkest parts of the image so that what was previously unacceptable with 12 bit processing now becomes almost acceptable. One thus gets a marginal increase in DR through slightly cleaner shadows, but to get a substantial increase in DR (say one full stop) one would need photoreceptors that can accommodate at least double the number of photons, say 1,024,000 in the above example. Ie., to get one additional stop of DR, each photoreceptor needs to have double the capacity, all else being equal. (It's always necessary to add all else being equal ).