The size of our dataset and the use of 1.5 T MRI images are of course a limitation. There are variations which depends on age (in adults) that would be better captured with a larger sample capturing a wider range of age and higher resolution images. This is particularly true for gyrification (the process and the extent of folding) which varies with age  and can thus impact on the identification of anatomical branches and borders. The current dataset was nevertheless variable enough to highlight issues in automated packages. For what is reported here, i.e. that the differences observed mainly stem from how anatomical variability in additional gyri and branching is handled, aging or higher resolution imaging has no impact. For instance, the presence/absence of double gyri is observed once the brain is fully formed and does not change across adulthood and is observed even with coarse image resolution.
With volume being (in theory) a product of thickness and surface area, and the thicknesses being generally stable for each package, larger surface areas are expected to accompany larger volumes, and vice versa and this is what we saw. We also observed that the inability to fully capture anatomical variability has knock-on effects on neighbouring regions, as was the case in FreeSurfer, BrainSuite, and BrainGyrusMapping where SFG GMvol and WMsa are proportional to the CG GMvol and WMsa, whilst no or the reverse effect were observed when segmenting regions manually (Fig. 4).
Although our work highlights differences between parcellation protocols, it is most likely that the corresponding outputs of image analysis tools in fact vary due to a combination of factors, and not just the parcellation phase. One step prior to parcellation in automated and semi-automated tools is the pre-processing phase. In FreeSurfer, for example, amongst other things, that phase is used to derive white and grey matter masks . These are consequently split in the processing stage, as per a parcellation protocol, to form parcels. Such mask effects were not investigated in this manuscript although it could be contributing to differences, especially for thickness. Package inconsistency across sites (e.g., ) and operating systems (e.g., ) is another aspect to consider, although was not a contributing factor to our study as each package was run on only one computer and one operating system. Finally, and most relevant here, differences in algorithms can also account for observed differences. Volume is simply derived by counting the number of voxels in each parcel and thus directly reflects differences in parcellation protocols. Cortical thickness however is specific to grey matter in FreeSurfer, while in BrainSuite it refers to that of the gyrus, all the way down to the fundus, therefore capturing the combined grey and white matter thicknesses . The combination of parcel definition and using the sulcal fundus to mark the border of a gyrus also explains inconsistencies in surface area measurements.