Surley this is code dependent though?
I mean tight looped code that contains few cache misses won't benefit from this sort of memory interface change.
What? Not sure if you are talking about code fetching/misses or data fetching/misses but either way it is incorrect. For exactly as cmaier pointed out earlier. Perhaps another stab at won't be pointless.
If you are going to need the code/data at address X and likely will next request the code/data at X+1 (where 1 is the 'word' size) then fetching and transfering the data at X and X+1 all at once will cut your requests in half.
You seem to looking at what happens after the data is in the cache. There isn't a zero penalty to getting the code/data into the cache in the first place. Similarly, you also need to take into account what you are fetching. For a tight code loop you may get the code into the cache. However, the data is a quite different matter. If you have a large (relative to the L1 cache size) matrix it probably will not fit in the L1 cache. Even more so if the L1 cache has some set associativity present. If it does you are doing some relatively trivial matrix math (e.g., multiple two 10x10 matrices ). As opposed to dealing with large blocks of pixels on the screen or 1000's of elements in a mesh.
The other factor folks seem to skimping over is that the GPU is hooked to the same memory. Go look at the trend line of GPUs and see if the "wider" isn't a long term trend.
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
http://en.wikipedia.org/wiki/Comparison_of_ATI_Graphics_Processing_Units
and yet pixel representation has remained primarily constant over time. (8 bits per primary color). Likewise the native register word is independent of bus width to memory.
Likewise constantly long branching code that contains many cache misses won't benefit much either......
Again once you get to long branch you are not going to need the next instruction word after the one you just branched to? The same X is very likley followed by X+1 effect.
Even in data. Let's say finished multiplying row X from matrix A and column Y from matrix B. The next one in tight loop will be row X and column Y+1 from B. Sure the column Y+1 from B is a series of extra pulls (if matrix B is on row major order) but the striding through row X will be inorder ( again assuming stored in row major order) until end of column inner loop till jump back to beginning elements of row. Similarly, when finish all of the columns will followed by row X+1 (if stored in row major order it is a sequential read).
Similar, doesn't have to be vectors and matrices. If declare commonly cooccuring variables together
int x, y , z ;
double a , b , c ;
z = x + y
then if when read x also read y into a cache line you are more ahead of the game than:
int x ;
double a ;
int y ;
double b ;
int z ;
z = x + y
The first example is more common of what you'd see in real code. The second is jamming up localized , sequential allocation.
in answer to somebodies question about whether you can buy bare memory die, yes you can... flash, sram, dram, obviously in quantity.
OK thanks. However, that would seem to make you more vendor and specific memory implementation dependent than the notion that ifixit had in their write-up of making you less. Pinouts and packages can be standardized. What is inside the package is not likely to be standardized. Especially, as connection tolerances get smaller and smaller. Similarly, the thermal validation for a bundle gets more complicated. Also subject to supply problems if memory vendor decides to tweak design/implementation or just straight up discontinue to chase a "hotter" part of the memory market.
If Samsung is building millions of other 2 RAM + ARM die packages with Samsung memory, it make the different ARM die in side of same bounds as the other Samsung ones can just slide it into place using exactly the same supply chain as the one already in place just with a slightly different ARM die. Reusable designs are generally much cheaper because spread the common design element costs out over more products. If the A4 ARM die is more power efficient than the common Samsung ARM dies it will fit the thermal constraints. Just need to tweak the connectors.
Also likely that other ARM setups need that wider bus with too from the current iPhone's package era. Bigger screens (more GPU bandwidth needed) and faster clock rates ( CPU pipeline bandwidth to fill. ) I doubt the A4 is the only one rolling out the factories this year with that feature.