120MP is basically P&S sized pixels (2.2µ) in a DSLR sized sensor. Please note that noise is not a fixed quantity, but varies with scale in the image. There are two components to noise in most images -- read noise (noise contributed by the camera electronics), and photon shot noise (quantum fluctuations in the light signal itself). At low to moderate ISO, photon noise tends to be the more visible noise in an image. At a fixed image scale, this noise depends only on how much light is collected, and is therefore independent of pixel size. High ISO might be compromised somewhat; with read noise unchanged from current performance of production DSLR's, read noise at a fixed scale will be about twice that of 7D (since the pixels are about half the linear dimension). Low ISO might not be so compromised for read noise; for instance the 40D has about 10-15% less read noise than the 1D3 at base ISO, so smaller pixels might have slight advantage unless Canon decides that cleaner low ISO shadows are more of a priority (they've been going backwards in this department recently). They could also mitigate the high ISO read noise issue by cutting down on the pattern noise that plagues their cameras at all ISO (though it's worse at low ISO).
And since diffraction is often discussed in the context of pixel size, note that diffraction is a property of the optics, not the sensor; having smaller pixels does not increase diffraction -- on the contrary, all it does is decrease the range of f-stops over which the sensor resolution is the limiting factor in system resolution, rather than the rest of the optics. A side benefit is that the AA filter's blur radius, being tied to the pixel size, will be smaller; and demosaic artifacts will be pushed off to finer image scales where they will be less noticeable.
What Canon really needs to do if they are serious about heading in this direction is to concentrate some development resources on compression technology -- clearly one doesn't want to have to deal with the huge files that result from high MP count cameras, since most of that information is redundant. sRAW is the lamest possible image compression method one can imagine. Much better would be the sort of compression RED uses -- preserves nearly full resolution while reducing file sizes substantially, and allowing continuous shooting without maxing out the image buffer. I did a very klugy exercise some time ago, taking a raw file, separating out the RGGB planes and doing JPEG2000 (wavelet based) compression by about a factor of 10, then uncompressing, reassembling the raw, and demosaic. The result was not bad for such heavy compression; I'm given to understand that RED does something along these lines, but of course they've spent more than an afternoon on it and it works much better.
One could also wring a little more savings out of recognizing that 14-bit is wasteful at current output DR levels; 12 is sufficient for all Canon DSLR's ever made. Level-thinning methods such as Nikon uses with its lossy compression are another way to shave off a bit or two more per pixel. But to me, the issue is that one doesn't want to keep all the image info at high res (eg, not in skies and other smooth regions), so the question becomes what data to keep and what to discard. Low res cameras make the decision for you, by omitting all fine scale image information; but there are certainly more intelligent methods one can imagine (and some that exist already), where compression technology keeps high res information where it is desired, and discards it where it is not.
BTW, as I understand it Canon is not "obsessed" with APS-H; rather, it is the largest sensor size that they can make without stitching on their fab line.