I have to say that my understanding of the definition of a pixel, long before I bought my first DSLR, is it's a basic 'picture' element consisting of a red, green and blue (sub) element.
When I discovered that digital cameras in general were essentially monochrome in relation to pixel count and that the 'real' pixels as I understood them, were in fact interpolations, I was quite dismayed. I thought, this is some almost fraudulent sleight of hand.
The Foveon concept conforms to my general understanding of what constitutes a pixel. If my CRT monitor is set to a resolution of 1280x1024, that means there are 1280 groups of 'red, green and blue' phosphors in the horizontal dimension, and 1024 such groups in the vertical dimension.
As I understand (although I'm not entirely sure), one of the difficulties of producing a high resolution Foveon sensor (say 10 real megapixels) is one of signal-to-noise. The red and green frequencies have to pass through layer(s) of silicon, so some degree of diffusion and loss of sharpness can be expected.
On the other hand, with a conventional Bayer type sensor, there is also a built-in degree of inefficiency. For example, light falling on a 'green' pixel lets through only green light. What happens to the red and blue component of that light? Down the drain it appears.