### Prudent & Knowledgeable Analytic Insight

#### A Scale-Free “Meaningful Difference”

Past entries in the P&K series have provided detail regarding use of meaningful differences (and, by extension, effect sizes). A meaningful difference was defined through data mining the extensive warehouse of P&K projects and calculating the size of a difference associated with a variety of percentiles. A meaningful difference, with specific reference to use of 9-point scales, could be defined as that difference associated with the 90th percentile such that any difference greater than that would be taken as significant with at least 90% confidence.

As is true for all statistical tools, the use of a meaningful difference criterion, to identify truly interesting and realistically significant results, must take account of the nature of the data input to the calculation. For meaningful differences, rating data are primary inputs from which the averages used in the difference calculation are derived. Ratings have a specific feature that can seriously affect and distort otherwise simple statistical summaries, such as the averages: scales on which ratings are based are double-bounded. Re-expressed, ratings provided by respondents can be no lower than the lowest scale ratings nor can they be any higher than the highest rating. This bounded nature of scales causes a compression in ratings, best seen as the bunching of ratings at either the low or high end of the scale. The bunching can be statistically characterized by increased skewness (i.e., data tailing away from the bunched end), more peaked (leptokurtic) distributions and, of greatest effect on typical statistical summaries, reduced variability. (The diminished variability also can have a dampening effect on correlations.) The bounding effect is easily seen when plotting averages against variability (specifically, the standard deviations). For example, the scatterplot below shows the strong negative relationship: as averages increase toward the upper bound of the 9-point scale, the standard deviation diminishes. This makes sense: with less room to rate, the variability in the ratings decline.

As a statistical side note, theory underlying the t-test assumes independence between the averages and standard deviation, independence between what’s in the numerator and what’s in the denominator of the t-test. Clearly, with the bounding effect, this assumption is violated: averages and variability are correlated, negatively. However, given the relatively small differences between averages typically encountered (e.g., less than 1 point difference), the change in variability is sufficiently small to not cause concern. In addition, an assumption underlying the use of t- (and F-) tests is the equality of variances associated with the averages. However, for rating data, the larger the differences between averages then the larger the differences between variances. Such is the strength of relationship created by the bounded scale. But, again, the difference in variances in a typical application will usually be sufficiently small (as long as the ratio of largest to smallest variance is roughly smaller than 3) so as to not present a problem.

At issue, for this note, is the ability to compare differences across a wider range of results found in a normative database.

Of interest is the result of bounding on meaningful differences. A crucial point to be made is that differences are not scale-free, but rather dependent or conditional on the absolute value of the averages used in their calculation. While, for example, the difference of .4 between 6.2 and 6.6 (on a 9-point scale) is numerically equal to the .4 difference between 7.8 and 8.2 (on the same 9-point scale), the absolute position of these averages along the 9-point scale suggests these differences don’t hold the same meaning. The simplest way to think about this is to consider distance to the upper bound. The .4 movement from 7.8 should be more difficult given the nearer constraint of the upper bound. The effect of this bound constraint can be quantified and, along with that, a re-expression of “meaningful differences” is developed that reduces or removes the effect of the bound.

The bound adjustment works in two steps. The first step removes the bound effect, specifically the upper bound, from each average: Average / (Bound – Average). For the application here, the “Bound” value is 9, taken from use of a 9-point scale. Using an average of 7.2 as an example, the bound-adjusted result is 4, 7.2 / (9 – 7.2). The table below provides more examples. The opportunity is taken to also show comparability across scale sizes, for 9-, 7- and 5-point scales. Pretending for a moment that rating scales have ratio scale properties, the scale values reported for each scale are the same ratio relative to their upper bound (e.g., a rating of 7 on a 9-point scale has the same ratio as 5.4 on a 7-point scale and 3.9 on a 5-point scale) and all share roughly the same adjusted value (e.g., around 3.5). In this sense, the bound adjustment acts to equate rating values across scales of varying length.

9-Point | |

Scale Value | Bound Adjusted |

5 | 1.25 |

6 | 2.00 |

7 | 3.50 |

8 | 8.00 |

8.5 | 17.00 |

8.8 | 44.00 |

7-Point | |

Scale Value | Bound Adjusted |

3.9 | 1.26 |

4.7 | 2.04 |

5.4 | 3.46 |

6.2 | 7.78 |

6.6 | 16.50 |

6.84 | 43.00 |

5-Point | |

Scale Value | Bound Adjusted |

2.8 | 1.27 |

3.3 | 1.94 |

3.9 | 3.55 |

4.44 | 7.93 |

4.72 | 16.80 |

4.88 | 40.74 |

An examination of the table, toward the larger rating values, shows that the bound adjustments “take off” as the rating nears the bound. The adjusted values appear to follow an exponential distribution. In an attempt to rein in these values, the second step in the adjustment applies a logarithmic (base 10) transformation^{*}. The table below shows the log adjusted values for the 9-point scale. Aside from taming the soaring adjustments, the log (base 10) transformation also returns a value between 0 and 1 for any adjusted value below 10 (or below 8.1 on the original 9-point scale), which should encompass most average ratings encountered. (The zero point of this transformed scale is at 4.5.)^{**}

9-Point | ||

Scale Value | Bound Adjusted | Log Adjustment |

5 | 1.25 | 0.097 |

6 | 2 | 0.301 |

7 | 3.5 | 0.544 |

8 | 8 | 0.903 |

8.5 | 17 | 1.230 |

8.8 | 44 | 1.643 |

With the averages bound-adjusted and behaving, thanks to a log transformation, differences between pairs of averages, regardless of absolute position on the scale (or length of scale), can be compared. As an example of application, the differences (6.4 – 6.12), (7.4 – 7.2) and (8.4 – 8.31), although of different numeric size, all translate to a difference on the bound-adjusted log scale of roughly .064 (between .063 and .065). In words, a difference of .3 when averages are in the 6-point range is equivalent to a difference of .2 when averages are within the 7-point range. These, in turn, are equivalent to a smaller .09 difference when averages are at the high end (around 8.4) of the scale.^{***}

The table below provides guidelines for defining “meaningful differences” in the bound-adjusted scale, and then translating back using a 9-point scale example. The first two columns provide a selection of percentiles and the difference between bound-adjusted averages associated with that percentile (i.e., the “meaningful differences” at each percentile). The last three columns provide examples of differences between averages along a 9-point scale, from 1 of 3 starting points (6.1, 7.1 or 8.1) to the difference, on a 9-point scale, corresponding to the bound-adjusted meaningful difference.

9-Point Scall Translation From: | ||||

Database Results | 6.1 | 7.1 | 8.1 | |

Percentile | Meaningful Difference | To: | ||

10 | 0.02 | 6.2 | 7.2 | 8.2 |

50 | 0.10 | 6.6 | 7.4 | 8.3 |

80 | 0.23 | 7.1 | 7.7 | 8.4 |

90 | 0.31 | 7.3 | 8.0 | 8.5 |

95 | 0.36 | 7.5 | 8.1 | 8.6 |

The median (50th percentile) bound-adjusted difference between averages is .10. Using the above table as a reference frame, .10 would back-translate on a 9-point scale as a .5 difference when ratings are in the vicinity of 6.1 (i.e., from 6.1 to 6.6). (Referring to the earlier “Meaningful Difference” P&K entry, the median difference cited there, with no bound adjustment, was .38. This is about the size of median differences midway between 6 and 7.) Returning to the table above, the bound- adjusted difference of .36 associated with the 95th percentile (treating a difference at this percentile as that required to achieve, empirically derived, 95% confidence) back-translates to a 1-point difference when the averages hover between 7 and 8. However, the difference accelerates as averages recede to more moderate rating levels. The 1.5 difference, from 6.1 to 7.5, closely aligns with the 1.41 difference reported for the 95th centile in the earlier “Meaningful Difference” entry. Yet, a difference of a half-point is at the 95th percentile level when averages involved in the difference calculation exceed 8. The difference in the size of the meaningful difference again points to the value of appreciating where the averages used to calculate a difference fall along the scale. The .36 difference is associated with larger differences as the averages involved move away from the upper bound of the scale.

As written in the earlier P&K entry, a database-derived empirical measure of product differentiation should be used to augment statistical significance results when evaluating differences in product performance. Differences should be large relative to historical norms, as well as statistically significant, before being qualified as “meaningful”. Consider, for example, assessing the superiority of product performance. The superior product should not only be significantly better (e.g., with 95% confidence) but also with a difference larger, historically, than typically encountered (e.g., with a difference larger than 95% of those encountered previously). The information and detail provided in this note suggests that the definition, the magnitude, of a “meaningful differences” depends where on the scale the averages involved in the difference fall. As noted above, differences calculated from averages in the middle of the 9-point scale are expected to be larger than those associated with averages that are closer to the upper bound of the scale. This note provides guidelines on how to calculate an adjustment to account for the effect of the scale bounds, providing a scale-free measure of difference.

^{*}This same adjustment has been used in analyzing quantal or percentage data which follow a sigmoidal distribution caused by bounding at 0 and 1.

^{**}The use of this adjustment, including the log10 transformation, is a bit opportunistic, taking advantage of the nature of 9-point rating data. There tend to be very few, if any, ratings below 4.5 and so negative adjustments, and non-existent log values, are avoided.

^{***}Typically, the effect size would be employed to provide this sort of equivalence. Examination of the P&K database shows that averages in the vicinity of:

6.12 to 6.4 have a standard deviation of 2.2,

7.2 to 7.4 have a standard deviation of 1.7, and

8.31 to 8.4 have a standard deviation of 1.0.

The 3 effect sizes, respectively, are .127, .118 and .09. While all three effect sizes are judged small by Cohen’s assessment, the effect size calculation fails to equate the differences in averages because the standard deviations decline at a rate different from that for difference in the averages.

### Prudent & Knowledgeable Analytic Insight

#### An Alternative Design for Discrimination Testing

When asked to conduct a discrimination test (e.g., cost reduction or ingredient substitution tested using a triangle or 3-AFC test), a usual design includes a single test where a lone test product is compared to a standard or benchmark. To re-express, it is typically the case that one test product, e.g., a cost reduction, is provided by a client to be tested against a current or benchmark product, with the goal of assessing respondents’ ability to tell the difference. This one-cell design turns the test into a go/no-go process. If there is no significant discrimination then the test product is introduced to market (or at least proceeds to a next step in the evaluation process). If the test product fails (i.e., significant discrimination) then, perhaps, a second variant is prepared and tested at some future time. The evaluation of a single test product at a time makes necessary any additional tests to determine if some variation in the ingredients then leads to an insignificant result.

There is an alternative design that can be more efficient, in time and money, which leads to the determination of an effective test product. The idea is to test variations of the test product. For simplicity of exposition for this note, assume that tests product differs from the current along a single, dominant, dimension. For example, to cost reduce a product, the amount of a specific ingredient (e.g., sugar) is reduced or replaced by some alternative ingredient. (The design used here as example is a simple one-factor arrangement. More complex designs can also be used with the same desired increase in information.) So, the idea of variation is to test, for example, 3 ingredient levels rather than one. The variation, experimental in nature, then allows for the estimation of the ingredient level that leads to significant discrimination (akin to a JND), even if that level is not one that has been explicitly tested (i.e., found via interpolation). As such, more information regarding discrimination can be provided by the manipulation and testing of multiple ingredient levels.

Regarding design, respondents may evaluate all variations (rotated in accordance with a Latin Square arrangement to accommodate the assessment of order effects), so there’s not much increase in cost when testing more than a single ingredient level. Alternatively, respondents may test only one variation as per a monadic design. With k variations or ingredient levels, 1/kth of the total sample is allocated to each level. (There are design and cost trade-offs to consider, related to number of variations to test, the presence of order effects, total sample size and project costs.) An opinion offered here is that testing, for example, 3 variations with fewer respondents per variation would garner greater information than testing a single level, without much of an increase in project costs. For example, if a sample of 100 respondents is typically used to evaluate the single test product then slightly lowering the sample to 90 and having 30 respondents each evaluate one of three test variations will provide great detail regarding the relationship between ingredient change and discrimination perception without grossly affecting the research cost.^{*}

Consider a small example: a 3 cell monadic evaluation using a triangle (3-AFC) test. Each cell had a sample of 30 respondents (total sample of 90). Three solutions or concentrations of a liquid were tested, at 1%, 2% and 3%. Respondents were assigned to one of the three cells and asked to complete a single triangle test. The percentages “discriminating” per solution were 1%: 32; 2%: 36; 3% 41. A simple linear regression (calibration) equation was calculated to summarize the relationship predicting percentage discrimination from the solution: 27.33 + 4.5(solution)^{**}. The data and equation allows for forward and reverse prediction, estimating what solution level corresponds to what level of discrimination or what level of discrimination is achieved for a specific level of solution. Confidence bounds can also be provided for any prediction.

An interesting application is to ask what level of solution yields 37% discrimination, where 37% is taken as an upper bound such that any solution stimulating a greater amount of discrimination would be considered different (unacceptably) from the control. The answer is a 2.15% solution (with standard error of .026) It’s unlikely that a solution level of 2.15 would have been directly tested in the usual one-shot discrimination test design (unless prior calibration such as described above was performed first). Having a good statistical sense that 2.15 is the upper bound on acceptable concentrations can assist in determining the viability of the required cost reduction. Any cost-reduced solution below 2.15 will be acceptable. Anything above will not be.

As noted at the beginning of this note, the typical discrimination test evaluates one solution. Either the test solution discriminates of not… hit or miss. Greater information is obtained if those same sample resources used for the typical test are spilt into at least 3 portions, to test 3 levels of whatever ingredient is being manipulated.

The resulting statistical model provides a good deal of additional detail about the relationship between consumer perception and the manipulated ingredient, with statistical precision belying the relatively small sample.

^{*}Respondents may also continue with the evaluation of the two remaining variations, in accordance with a complete block design with serving order following a Latin Square arrangement. However, for this example, only the first position data are required.

^{**}Testing 3 levels allows for an assessment of curvature, a test of validity when wishing to believe that the true relationship between ingredient change and perception are linear. More levels can be tested, to accommodate the presumption of more complex relationships.

### Prudent & Knowledgeable Analytic Insight

#### Another Look at 60/40

The preference testing regimen called “60/40” has been around for many years, popularized by a large FMCG company and then copied by many others. The intent of a “60/40” test was to demonstrate that a company’s products weren’t just preferred, but preferred by a wide margin… a definition of superiority. In contrast, a typical preference test would be based on achieving at least a 55/45 preference split. A winning preference can be characterized as any difference in excess of 50% that is statistically significant with at least 95% confidence (with a sample base of at least 150), although pure reliance on significance testing allows relatively small levels of preference to pass. To clarify the action standard specifications of a 60/40 test, the test product (from the company wishing to claim product superiority) must achieve 60% preference vs. 40% preference for the competitor (after applying the typical even splitting of “No preference” votes), with 95% confidence (one-tailed). The test is configured for 80% power, which requires a sample of 150 respondents. (Power is used here as an insurance policy. Its effect is to increase the sample size sufficiently such that there is an 80% chance of calling as significant with 95% confidence a difference of at least ten percentage points greater than parity.) One way to apply the results of this test, to be consistent with the specificity of the action standard, is that if 60% or more of the sample state a preference for the test product then it’s a “win”. While the severity of this application is noble, it suffers from two faults.

The goal of superiority is undermined by the statistical testing procedure. The statistical test described above is not constructed, statistically, to necessarily identify superiority but only to signal when preference is large enough versus parity (a 50/50 split in preference) to offer statistical significance. One way to see this is to build a confidence bound around the desired 60% superiority level, using the specifications provided above: 95% confidence (one-tailed) and a base of 150. Of specific interest is the lower bound of the confidence interval, calculated to be 52.2% preference. That’s pretty close to parity. A sample result of 60% preference is indeed significantly greater than 50% but is not inconsistent with the possibility that true preference sits closer to 53%. Relying solely on statistical significance allows less than superior results to be considered as such. Preventing weak preference performances from being misconstrued as “superior” can be alleviated by tightening the definition of superiority. An approach is setting a tighter confidence bound around 60% with, for example, a lower bound at around 57% rather than 52.2%. The sample required is huge, around 1,650 (calculated with 80% power and 95% confidence). Perhaps the biggest issue here is that a sample of 150 is inadequate for the real purpose of the test… to reject products (or send them back to the plant for redevelopment) that were not superior. The issue becomes a difficulty because few companies are willing to allocate budget resources to an increased sample size.

Alternatively, 60% can serve as a rigid action standard, such that if preference for the test product surpasses 60% then product is deemed superior. While this logic appears to work, especially when preference greater than 60% is achieved, the underlying statistical variability (based on a sample of 150) suggests that results that truly (in terms of the population value) match or exceed 60% preference may not be observed in the actual sampling of consumers. Essentially, this hard and fast 60% cut-off is a very conservative test, calling as inferior products that perhaps should have been superior. The test can be liberalized a bit by accommodating variability that could lead to a truly superior product falling short during the sampling process.

Consider a statistical evaluation where the test product must receive, minimally, 60% preference. In this situation, 60% preference can be taken as a lower bound of the kind of statistical distribution that truly represents superiority. For example, with a sample of 150, 60% represents the 5th percentile of a preference distribution centered at 66%. With 60% sitting at the 5th percentile, the risk of calling as inferior a truly better product is restricted to 5%. A next step is to provide a sense of variability around 60%, to accommodate those superior products that, through sampling, fall below that hurdle. A solution is to create a lower bound, drawn with 95% confidence, below the 5th percentile (60%). This lower bound is then taken as the minimal level of preference allowable. The answer is 58.7… any preference result lower than 58.7% would be rejected^{*}.

It seems then that the 60/40 test meant to indicate superiority can indeed be used for that purpose, with the prescribed sample of 150.

^{*}The bound is drawn around the 5th percentile value and was derived from tables created based on the assumption of normality and with a sample of 150. The results were confirmed using bootstrapping. A bootstrapped standard error for the 5th percentile was estimated to be .76. When multiplied by 1.65, the 95% one-tailed confidence multiplier, the lower bound on the 5th percentile was calculated to be 58.7 as well.

### Prudent & Knowledgeable Analytic Insight

#### Response Surface Modeling Issues

One of the foundations of product marketing is the development of an “optimal” product. Here, “optimal” can be characterized as the combination of highest consumer acceptance amongst a well-defined consumer segment tempered by manufacturing considerations (e.g., production feasibility and cost). Then, the fundamental idea of product development research is to profile such a product in terms of its physical characteristics. There is perhaps no more efficient process for doing this than through experimentation: the systematic variation of product features (e.g., ingredients) in accordance with an experimental plan to study the impact on consumer acceptance while also obtaining associated production costs.

A tremendous amount of literature on aspects of experimental design, and optimization, already exists. The note below only wishes to add a few comments about application, with the hope of improving the results of a product optimization project. There are four points. To be very specific regarding the context for these points, they are offered with regard to consumer tests, where a set of products, developed in accordance with an experimental plan, are presented to consumers for their evaluation. The study design can be of any variety, from monadic to complete block.

1) Reliance on r-squared values as measures of model fit or performance should be reduced, or at least reported in a less optimistic context. A first, small, step is to adjustment the r-squared value for the number of predictors used in the model. (The word “predictor” is used to cover factors and their assorted combinations, as interactions, and extensions, such as polynomial terms.) The adjustment, more of a shrinkage, is 1 – [(1 – r-squared)] x [(n-1) / (n-p)], where “n” is the number of products being modeled and “p” is the number of predictors used in the modeling. As an example, a response surface model was provided with a reported r-squared value of .9. The model, using 6 predictors, was fit to data with 10 products (modeling average overall liking ratings obtained from each of 10 products). The adjusted r-squared value is .78, a little less impressive. From a different perspective, if an r-square value is truly zero, meaning that none of the experimental factors or predictors had any effect, then the r-squared value has an expectation equal to (number of predictors-1) divided by (number of products-1). If there were 10 products tested and 6 predictors used to model average overall liking then an expected r-squared value, even when there are no statistically significant predictors, is 5/9, or .56. Given this baseline then, the question becomes: is an r-squared value of .78, while large by traditional standards, really that good?

To address goodness, the statistical model fit, as summarized by the r-squared value, can be provided to researchers wanting a statement about statistical significance. The r-squared value can be converted to an F-value…

F = [r-squared / (1-r-squared)] x [(n-p) / (n-1)]

…and tested for significance. An r-squared value of .9 (not adjusted) based on n=10 products and p=6 predictors yields an F-value of 4.49. The degrees of freedom associated with this F statistic are 4 and 9. The result is significant with 97% confidence. As a hunch, few researchers would have guessed at such a lower level of confidence for such a large r-squared value. A large r-squared value is no guarantee of good model fit.

Perhaps the greater issue involving use of an r-squared value is the underlying reliance on statistical significance to assess model goodness. Statistical significance is of some value but perhaps as a lower bound on just how good the resulting model is. The central issue for model building should not be significance but rather model stability and the ability to provide adequate prediction for future model use (e.g., interpolations to optimal products within the tested product space)… definitions of model goodness. Greater reliance should be placed on what is called the “interpretability multiplier”. As a rough rule of thumb, an effective model is not just significant with some pre-set level of confidence but one that obtains a model fit which exceeds the required significance (e.g., the F-value) by 3 to 5 times.^{*} The fear, of course, is that few, if any, models will show any sort of significance. Reverting to the small example, with 10 products and 6 predictors, the F-value set at 95% confidence is 3.6. Three times that value is 10.8, achieved when the r-squared value is in the vicinity of .96. Perhaps the rule is suggesting that 6 predictors was a bit too much to model, that no model of 6 predictors based on 10 data points would ever really reach stability.^{**}

2) Having a good, reliable estimate of variability, often referred to as error variance, is an integral part of effective modeling, whether using an experimental plan or not. With the assumption that no better estimate of error exists, for example from prior research, two approaches to estimating model error have been encountered, both of which are problematic. One approach is based on using degrees of freedom otherwise assigned to more complex or insignificant predictors. For example, predictors meant to capture the effect of higher order interactions are often candidates. These interactions are usually considered benign, with variance accounted for that is assumed (or hoped) to be due to random variability or noise. The variance, and degrees of freedom, accounted for by these predictors are accumulated and used as the basis for an initial estimate of model error. This error is then used to test significance for other predictors. Difficulties with this approach to error estimation are that variability is defined by very few degrees of freedom that were really meant for another purpose, i.e., to capture more complex aspects of the relationship between experimental effects and consumer responses. There’s no guarantee, other than small size, that the predictors assigned to error are truly reflective of error.

A second approach advises replication of some product that is central to the design (e.g., a mid-point), with the assurance that the replication is “true”, e.g., created from scratch. Quite often, the replication is not “true”: replicates are often from a single batch of product split into two and then tested as two products. While this approach avoids the subjective assignment of predictor effects to error variance, the statistical models are still left with few degrees of freedom upon which that error variance is based. In addition, the estimated error may be under-estimated if replicates are drawn from the same batch.

While the two approaches just described may indeed provide reasonable, if unstable (due to too few degrees of freedom assigned), measures of error that can be used to further test experimental effects, there is another source of variability, always obtained in consumer product tests, which represents the variability of greatest interest when marketing a product: consumer variability in perceptions or reactions to the products. The best measure of variability is obtained from the consumer measurement of product performance. Consider a fundamental statistical question: do experimental effects on average consumer acceptance transcend respondent to respondent variability in the assessment of the products upon which the effects are based? The logic here does not at all differ from that used in two-product tests, performed to assess the significance of difference between, say, a current product and its cost-reduction alternative. Each respondent represents at least a quasi-independent replication of the product test and respondent variability provides an estimate of “pure” error, unaffected by the experimental manipulation of the products. As such, the best representation of error variability is that provided by respondents. This estimate of error should then be used to assess significance of the model components.

However, the focus on a model error, with which to test significance, does not go far enough. There needs to be an assessment of model stability, especially with reference to error variability surrounding the profiling of an optimal product (both in terms of the ingredient profile and confidence bounds around the optimal level of liking). A simple suggestion, useful when respondents are asked to evaluate all products, is to model data for each respondent separately, to treat each respondent as the replicate they are. One of the most compelling results is to calculate the percentage of respondents from whom a stable (stationary point) optimum has been obtained. It is a great comment on the modeling process and model stability when, for example, 40% or more of respondents did not have a stationary point (e.g., optimal level of liking) even when one was obtained when modeling the data in aggregate. Also, the experimental effects calculated for each respondent can then be used as input to a cluster analysis, to identify segments of respondents for whom the product manipulations affected product acceptance differently.

With the goal of providing confidence bounds for important model summaries (e.g., ingredient levels associated with an optimal product or the optimal level of overall liking), a better recommendation (better than modeling data from each respondent separately) is to use bootstrapping and bootstrap aggregation (bagging). The idea here is to create a large number of bootstrap samples (usually 1,000 samples which are balanced to keep intact the experimental design structure) and from each obtain an estimate of the model parameters of interest. Essentially, the complete analysis of the experimental design is run 1,000 times, once for each bootstrap sample, and from each model effects, optima, optimal product profiles, etc. are retained. These bootstrapped calculations are then used to provide at least non-parametric bounds around all pertinent model estimates. The availability of these bounds provides manufacturers with greater flexibility in product ingredient specifications, especially when less expensive ingredient levels are found to fall well within confidence bounds drawn around optimum levels.

3) Experimental design models, where the product is the unit of analysis or “observation”, typically use aggregate measures of consumer ratings as dependent variables. Specifically, averages are analyzed. Factorial effects are assessed with regard to changes in average consumer liking. Considering that experimental design modeling uses aggregate consumer ratings opens the opportunity to use other summary performance measures as well, providing greater sensitivity and insight in consumer evaluations. One example is use of an accepter-rejecter ratio. Using a 9-point overall liking scale as an example, the ratio is calculated by dividing the top two box percentage by the bottom four box percentage. The ratio can provide greater discrimination, when compared to differences in average ratings (but is also more variable). Other summary measures can focus on performance at specific ends of the response distribution. One example is a summary of product rejection or disappointment. With reference again to a 9-point scale, bottom four box percentages serve this purpose. At the other extreme, modeling top or top two box percentages provide a tighter focus on positive consumer acceptance. If these alternate summary measures are used, it is best to use them all since the factorial effects may change for different levels of consumer acceptance. Re-expressed, it is quite possible that ingredient effects are not the same in the two tails of the distribution of overall liking ratings. What drives consumer acceptance at the lower end of scale may be different than upper end. Response surfaces obtained from models of each of these dependent measures can be overlaid to find a region or set of factor levels that best meet the manufacturer’s needs.

Note that production costs for each product tested can be modeled as well, with the resulting response surface added to the overlay. A definition of “optimal product” may then be one which maximizes acceptance (e.g., top box percentages), minimizes rejection (e.g., bottom four box percentages) at a reasonably low production cost.

4) It is often the case with product development experiments, especially when 3 or more factors (e.g., ingredients) have been manipulated, that only a subset of factors have a significant effect on consumer perceptions. This is referred to as effect sparcity (and is especially true for screening experiments). The insignificant factors can be set to some benign level (e.g., in the center of the design and consistent with efficient production and lower product costs), allowing the researcher to focus on the fewer, more effective, factors. The remaining factors can be cleanly and clearly analyzed, their effect on consumer liking best summarized, if the product subspace in which they fall is sampled in accordance with an effective experimental design plan. Experimental designs which afford clean estimation of a subset of factors are said to have good projective properties.

Box Behnken designs offer good examples of this. Consider an example with 3 factors...

Factors | |||

Product | a | b | c |

1 | -1 | -1 | 0 |

2 | 1 | -1 | 0 |

3 | -1 | 1 | 0 |

4 | 1 | 1 | 0 |

5 | -1 | 0 | -1 |

6 | 1 | 0 | -1 |

7 | -1 | 0 | 1 |

8 | 1 | 0 | 1 |

9 | 0 | -1 | -1 |

10 | 0 | 1 | -1 |

11 | 0 | -1 | 1 |

12 | 0 | 1 | 1 |

13 | 0 | 0 | 0 |

When Factor c is found to be ineffective, the effects of Factors a and b can be more clearly studied by setting c to a specific level (say, at the mid-point, 0). The resulting design for assessing the effects of Factors a and b looks like this…

b | ||||

-1 | 1 | |||

-1 | x | x | ||

a | x | |||

1 | x | x |

...which provides estimates of main effects, interaction and a general sense of curvature. The resulting design for a and b, nested or constrained when the ineffective Factor c is set to its mid-point, is balanced, a good sampling of the (a,b) subspace.^{***}

Product | a | b | c | d |

1 | 15.5 | 16.7 | 0.20 | 67.6 |

2 | 14.3 | 13.8 | 0.15 | 71.8 |

3 | 18.0 | 18.2 | 0.17 | 63.6 |

4 | 18.0 | 16.0 | 0.10 | 65.9 |

5 | 10.5 | 13.8 | 0.20 | 75.5 |

6 | 10.5 | 18.2 | 0.20 | 71.1 |

7 | 10.5 | 13.8 | 0.10 | 75.6 |

8 | 14.3 | 18.2 | 0.10 | 67.5 |

9 | 10.5 | 16.0 | 0.15 | 73.4 |

10 | 18.0 | 13.8 | 0.20 | 68.0 |

The 4-component mixture design is correctly displayed as a 3-dimensional triangle (e.g., pyramid). For purposes here, once these data points were modeled against overall liking, the last 2 of the 4 components, c and d, were found to be ineffective predictors. The two ineffective components can be remove from modeling consideration (i.e., set to middling, cost-effective, levels). The levels for two remaining components, when plotted, below, resemble (reasonably closely^{****}) a 3 by 3 factorial design and can be effectively modeled using a traditional response surface analysis (without regard for the usual mixture constraints).

To summarize, experimental designs with good projective properties allow weaker experimental effects to be set aside, yet still leave an effective design used as the basis for analyzing the remaining factors that require closer examination. What may be called traditional designs (e.g., Central Composite and Box Behnken) generally provide this projective capability. Other, more ad-hoc, computer generated designs (e.g., Alphabet Optimal) may not and the projective property should be investigated before design selection. The investigation can be as simple as creating scatterplots of pairs of factor setting.

^{*}The “interpretability multiplier”, developed by George Box, is described in Response Surfaces, Mixtures and Ridge Analyses (by Box and Draper), page 277. That and other references embrace models that were not only statistically significant but had a model F statistic exceeding the tabled F-value by a factor of 10. The 3 to 5 time “rule” offered is a more lenient criterion based on experience, perhaps better suited to earlier stage, exploratory research. This more lenient guideline jives nicely with a simple rule that has been applied to 1-degree of freedom (t) tests. The rule is based on translating t-values to correlations and that correlations greater than .4 are at least borderline indicative of a meaningful relationship (or, in the case of averages, indicative of a meaningful difference). The rule suggests that t-values be at least 4 times the required tabled value. So, with 95% confidence (two-tailed) and more than 50 degrees of freedom, the rule says that t-values of about 8 (1.96 x 4) or more are indicative of not only statistical significance but are also large enough to exceed a minimum correlational standard.

^{**}To be sure, there are many other statistical criteria for assessing general model goodness. The Bayesian Information Criterion (BIC), Mallow’s C and Akiake Information Criterion (AIC) are three examples. The Interpretability Multiplier is featured here because it is a direct extension of the commonly used and reported F-value.

^{***}There are two advantages to examining the effects of two factors when nested within the mid-level (0) of the third. The first is that data for an additional product is available that provides a 1-degree of freedom test of curvature. The second is the cleaner design which assesses main effects at the high (+1) and low (-1) levels of the two nested factors.

^{****}The off-setting of the center point creates some correlation in the otherwise orthogonal design. However, the simplicity of analyzing this quasi-factorial design data counters nicely the design flaw.

### Prudent & Knowledgeable Analytic Insight

#### Positive Predictive Value

In an earlier P&K entry, reference was made to the variability in p-values. The fundamental idea there was that while the obtained p-value, resulting from a specific statistical test, may signal a statistically significance result, variability in the p-value may cast doubt on the true significance, or at least the stability, of the result. This doubt is amplified when the statistical test result is compared to a pre-set and rigidly applied action standard (e.g., the result must be statistically significant with 95% confidence). Indeed, even though a test result is significant with 95% confidence, and so pass the action standard, the lower bound on a confidence interval drawn around that result may be closer to 80% or 85%.

This notion of uncertainty in statistical test results can also be expressed in terms of whether the same statistical conclusion would be reached if the test were redone (independently replicated). A statistical measure called the Positive Predictive Value (PPV) can be used to quantify this. Specifically, PPV is a probability of whether a specific statistical test result is indeed true or correct. PPV reassesses the results of a specific significance test by taking account of the error rates (alpha, the probability of incorrectly calling a statistical result significant when it is not, and beta, the probability of incorrectly calling a statistical result insignificant when it should have been) as well as the historical tendency to find statistical significance when having run similar tests (e.g., product tests within the same category performed within the past year) in the past. In formula form, PPV is:

(R – R*beta) / (R – R*beta) + alpha.

R, in the formula, represents the rate of finding statistically significant results in past research. Specifically, it is the number of significant results divided by the number of insignificant results. This rate, R, can be quantified using the P&K database. For example, with reference to product tests in the edibles category for 2012, 498 statistical tests out of 1,448 had differences large enough to be taken as significant with 95% confidence. R is then 498 / (1,448 – 498), .52. In a sense, R represents an expectation, or Bayesian prior, regarding the likelihood of obtaining significance for the next test.^{*}

As written above, alpha and beta are error rates associated with the incorrect statistical inferences. For purposes here, alpha is set to .05, consistent with the typical desire to test with 95% confidence; the researcher runs a 5% risk of making a mistaken judgment by calling the result significant when it’s not. Setting the beta risk, or conversely power, is a bit more problematic. Power is not often taken into account when setting sample size needs for the typical product test, especially those required at early stages of product development. Budget constraints hold sway and smaller samples, testing with 30 to 100 respondents, are usually employed. As such, it is not unusual to find power, and the beta error, to be hovering at around 50% (i.e., statistical tests are as likely as not to detect and call as significant a difference of a specific magnitude).

Returning to the formula for PPV and using as inputs R=.5, beta=.5 and alpha=.05, then PPV is equal to .83. In words, within the confines of 50% power and an anticipation (Bayesian prior, R) that a test is as likely to be significant as not, then when a statistical test result offers evidence of significance with 95% confidence, there is an 83% chance that the conclusion of significance is correct.

How sensitive is PPV to its parts? The most dominating effect is held by R. Numerically, this makes sense because of the restrictions placed on use of alpha (usually between .2, for 80% confidence, and .05, for 95% confidence) and beta (usually in the vicinity of .2, for 80% power, to .5, for 50% power) levels. Focusing specifically on levels of alpha, historically a lower rate of achieving statistical significance (i.e., smaller values of R) leads to greater influence of alpha levels on PPV. Re-expressed, reducing the alpha risk (e.g., from 95% confidence to 99%) matters most to increasing PPV when there is weak history of significance. Consider, for example, if historically significance was infrequently attained, e.g., R=.1, and beta errors were contained at around.2 (i.e., 80% power), then with alpha set to .05 the resulting value of PPV is .62. However, when increasing the desired level of confidence to 99%, and so reducing alpha to .01, PPV then becomes .89, a great improvement in the chance of attaining truth.^{**}

Conversely, strong evidence of past success, a large value of R, increases the probability of future truth and reduces the influence of statistical testing errors. Insert into the PPV formula a value for R of .8. With minimal power (at 50%) and alpha set to .05, PPV is .89. Reducing alpha to .01 then increases PPV to .98. The range of PPV values shrinks, as does the effect of alpha, when there is a stronger history of significance.

Perhaps the greatest practical purpose of PPV is, in the same spirit as reporting confidence bounds around obtained p-values, to alert researchers that having achieved statistical significance for one specific test does not at all guarantee correct inference regarding the difference just tested. PPV can represent a lower bound on confidence. So, statistical tests results are indicative of having met an action standard only when both the level of confidence and the PPV exceed a pre-set value. The calculation of PPV also reminds the researcher that current test results can be strengthen, or put into better context, by reference to tests performed in the past. This is the utility of R. The level of confidence adopted by the researcher should take account of this history. For example, to better prove the point that a new prototype is significantly better when previous attempts have failed, use a stricter confidence level, say 99%.

^{*}Use of the term “rate” to label R is a bit of a misnomer. R is a measure of the odds of finding a significant result compared to insignificant results.

^{**}The same result of a large range of PPV values occurs when setting beta, and power, to .5. The range of PPV extends from .5 to .83 as alpha declines from .05 to .01.

### Prudent & Knowledgeable Analytic Insight

#### Using the Effect Size to Better Frame Business Decisions

Consider a two-product test where a prototype (A) has been developed, as an improvement, to replace the product currently in market. The improvement was the addition of a more expensive ingredient, increasing unit production cost. A monadic design in-home use test was employed, such that respondents used one of the two products for a 5-day period. Each product was tested by 150 respondents. One of the evaluative criteria asked of respondents was purchase interest. The top-box purchase interest percentage for Prototype A was significantly greater, with 90% confidence (one-tailed test, to assess superiority of the improvement) than that for the current product: 23% vs. 17%. Based solely on statistical significance testing, this result could be construed as a “go” decision for replacing the current product.

However, the significant improvement in purchase interest for Prototype A does not take into account marketing issues, for example the increased production cost associated with the product modification. Without charging more at retail for the new product, the greater production cost (i.e., reduced profitability) can be overcome for by an increase in sales. The research issue then becomes finding ways to couple statistical test results with marketing issues. Approaches proposed here are based on a modification of the effect size.

First, recall that the use of an effect size has been discussed in a few of the past P&K notes. The effect size is a standardized difference, between averages for example, that allows for easier comparison of these differences across results from different studies where different scales or data collection methods may have been used. Standardization is achieved by dividing the difference of interest by a measure of data variability, usually a pooled or average standard deviation. Reference had also been made to using historical norms to get a better sense of just how large or meaningful (from a marketing or product development perspective) an effect size is, with the purpose of relieving statistical significance from this task. For example, work with Cohen’s d, suggests that a moderate effect size is around .5, when the difference is half the size of data variability. Reliance can also be placed in past data, such as from the P&K product testing data base.

However, reliance on external databases, as well as significance testing, doesn’t capture the research nuances or address the marketing / product development reality of a specific test of interest. In addition, effect sizes don’t have a clear interpretation, other then the difference in the numerator is a certain fraction of the variability in the denominator. The change or difference in performance between two products, as reflected by an effect size measure, should be tied to explicit R&D or marketing metrics to help make better sense of the product test results. An approach described below is to translate or convert the effect size into a percentage rate of change and to then apply this to a metric more closely associated with business decision making criteria, a two step process. The result can then be compared to changes required to achieve a specific research or marketing goal.

First, converting an effect size to a rate of change is accomplished by exponentiation, e^{r}, where e represents the base for the natural logarithm, 2.718, and r is the effect size. (“r” is used to represent rate.). For example, with an effect size of .5, the rate of change is 1.65, or a 65% increase.

Next, the rate of change is multiplied by some quantity, Q, the change of which serves as the basis for decision making. Often, the quantity of interest is sales or some other financial performance metric. The example below makes this explicit.

Reverting to the two-product test example, the increase in purchase interest for prototype A can be measured by an effect size, the standardized difference between the two top-box purchase interest percentages: (.23-.17)/0.40, or 0.15. (The difference in percentages is divided by an average standard deviation for .23 and .17, which is .40.^{*}) The rate of change, or growth, is obtained from exponentiation of the effect size, e^{.15} , which equals 1.16 or a 16% increase. Current product sales, the quantity Q above, are 250,000 units a year. When multiplied by 1.16, unit sales volume for Prototype A are estimated to be 290,000 units/year (assuming the prototype receives the same category management and marketing support as did the current product, e.g., distribution, pricing, awareness, etc.). The significant increase in top-box purchase interest ratings translates to an increase of 40,000 units or 16% over current sales. The marketing issue is then whether this is sufficient growth to justify the product change.

The above approach can be reversed and used as an action standard: estimate the difference in the top box purchase interest percentage required for a desired increase in unit sales volume. For example, management may dictate that a 25% growth in sales is required to pay for the increased cost for producing Prototype A. Using the same standard deviation (.40) as above, a top-box purchase interest percentage for Prototype A of 26% leads to roughly the desired 25% increase in sales^{**}. As such, a nine percentage point improvement in purchase interest, from 17% to 26%, is needed for Prototype A to surpass the financially based action standard. While the actual increase in Prototype A’s purchase interest, to 23%, was statistically significant, it failed to achieve the volume requirement set by management. Alternatively, if the management goal had been a 10% growth in sales then the move to Prototype A would have been acceptable regardless of significance.

In summary, traditional consumer survey statistical results can be re-expressed as rates of change and then used with financial metrics to provide outcomes more relevant to business decision making and less reliant on significance testing.

^{*}The average standard deviation is obtained by taking the square root of the average variances for the two products’ top box percentages.

^{**}The effects size for (.26- .17)/.40, or .225, when exponentiated is 1.25, or 25%.

### Prudent & Knowledgeable Analytic Insight

#### Acceptance Voting

A research goal in product screening is the identification of a most preferred product. For example, consider a test where three or more prototypes are evaluated. The prototypes are competitive amongst themselves such that only one of them will be taken forward for more extensive development and market introduction. As is typical in early stage screening, where efficient designs and lower cost consumer tests are used, a group of respondents is asked to evaluate all prototypes. Upon completion of all evaluations, these respondents may be asked for a preference.

The simplest preference measure is obtained when respondents are asked to identify, and vote for, the single product they most preferred. Each respondent selects one and only choice. The most preferred product is that which garners the greatest number or percentage of these preference votes. This is referred to as Plurality Voting. A difficulty with this simple approach arises when two or more of the products are perceived as similar, having similar sensory profiles, fulfilling similar needs or providing comparable benefits. The similarity leads to vote splitting.

Vote splitting is encountered when two or more products are considered equally preferable forcing (because respondents can only vote for one) preference to be split among them. This splitting leaves open the possibility of some other less preferred and different product receiving the greatest amount of preference. As an example, two of three products tested are equally preferable and together account for 60% of preference votes. Re-expressed, if either one of the two products, but not both, were tested which ever was present would receive 60% of the preference votes. The third, weaker, product would receive only 40% of the preference votes. The presence of the two similar products together in one test creates vote-splitting. With no strongly perceived difference between the two similar products, their preference would be randomly split, each of the two products receiving around 30% of the preference vote. The least preferred product, having no other product available to dilute its preference, wins with 40% of the preference vote.

The problem of vote splitting is readily detected by use of additional information. For example, it’s very likely that the products were rated on some measures of overall acceptability (e.g., overall liking, purchase interest) With reference to the example above, the two similar and equally preferable products would have received roughly equal overall liking ratings that would have been greater than those received by the less preferred product. The inconsistency of results between preference votes and the overall liking ratings focuses attention on the vote-splitting problem. However, there still needs to be an assessment of preference to identify that best product.

At the risk of over-simplification, the cause of vote splitting is restricting voters to one and only one vote for their most preferred product. Moving to the opposite extreme, all products can be provided with a preference evaluation in the form of a complete ranking. Noting that a product ranked first is most preferred, total sample percentages of first-ranked results should provide preference results similar to plurality voting. So, first ranked votes alone do not provide a solution to vote splitting. Scoring systems based on ranked data, such as Borda counts, can be used to get more quantitative information from the ranks. With the Borda approach, ranks are converted to points and the product with the most points wins. With 4 products, and 4 ranked positions, for example, the top ranked product for a respondent receives 3 points, second ranked receives 2 points, third ranked receives 1 point and last ranked product receives no points^{**}. Returning to the 3-product preference example above, it is expected that the two most preferable yet indistinguishable products would evenly split first and second place ranks. The third less preferred and different product would be relegated to last position and receive no points.^{*} Ranking provides some progress over preference alone, though, by providing the recognition that the weaker product as indeed weak.

Regardless of utility, the ranking task may become less effective as more products are considered, as when several or more products are evaluated and the research desire is to have respondents provide a preference ranking across them all. A complete ranking may be cognitively difficult and time consuming. The complete ranking task can be made easier by working hierarchically, also referred to as chunking, placing products into smaller and smaller piles until the complete ranking is achieved. But, again, this is a time consuming and cognitively taxing task, especially if only a small subset of products is actually liked by a respondent. An alternative to the limitations of, at one extreme, plurality voting and, at the other extreme, complete ranking, is acceptance voting.

With acceptance voting, respondents vote for or select just those things (e.g., products) that are preferable or of interest. For use in a product test, respondents are asked, “Thinking about the products you’ve just tried, please identify which, if any, you would consider purchasing (or using) if they were available to you at the store where you shop.” Respondents are free to select as many or as few, or none, of the products, as desired, with the number selected itself being a very useful summary statistic. (The number selected is referred to as the size of the respondent’s preference consideration set.) For the three product example above, if two of the products were equally preferred then by acceptance voting both would receive a vote. (The exclusion of the third, less preferred, product is taken as a form of negative acceptance.) Perhaps the biggest advantage of acceptance voting over a complete ranking (e.g., Borda counts) is that respondents can provide clear delineation between what is and is not acceptable by virtue of having selected a subset of products and not selecting others.

The percentage selecting each product serves as a simple summary measure of acceptance. In addition, the voting data allow for more detailed analysis of co-incidence of selection, the extent to which selections are correlated. Statistical tools such as hierarchical clustering or TURF analysis can be used to summarize these relationships.

Acceptance voting can be augmented by asking for a ranking of just the things selected. For the respondent who selects or votes for 3 of, say, 10 products that were presented, the 3 are then ranked in terms of preference. The rank data can be used to create a more sensitive acceptance measure. Rather than adopting a Borda-type counting measure, though, the ranks are re-expressed by use of an exponential (e.g., Zipf) transformation. The Borda count procedure treats ranked positions linearly, with one unit changes as ranks increase or decrease. The change in rank from, say, first to second is treated the same way as a change in rank from third to fourth or from ninth to tenth. However, there is good reason to suspect that the change in preference perception underlying the ranks, or ranked positions, is not linear but decelerates quickly as ranking declines. Reliance on an exponential distribution is meant to reflect this non-linear change associated with change in ranked positions. As such, the ranked data are changed or transformed in accordance with an exponential distribution.^{***} (Preference behavior is said to follow a power law.) The results of this transformation are rescaled to reflect a share of preference which can be calculated separately for each respondent as well as averaged to provide a total sample summary. Use of this transformation also highlights differences in results as the size of respondents’ preference consideration set varies. Essentially, for a given ranked position, a product is rewarded with greater share amongst respondents who have smaller sets, i.e., are more selective. For example, the product ranked first in a smaller consideration set receives greater share of preference then the product ranked first within a larger set. Generally, the acceptance voting approach with partial ranks converted to shares will not necessarily match results obtained from a Borda scoring system.

More generally, acceptance voting, augmented by ranking, is a very effective measurement tool for understanding brand usage. Resulting aggregate share estimates have been shown to be very highly correlated with market share data. Also, respondent level share data correlates well with share of wallet data. In addition, the voting procedure is easily applied to ideas or benefits for purposes of screening or assessing importance. Respondents are presented with a list of ideas or benefits, asked to read through the list and then select just those of greatest interest or relevance to them in their decision to purchase or use a product. The percentage of times ideas or benefits are selected has been shown to be highly correlated with more technically exhaustive analysis results provided by statistical techniques such as Shapley Value.

Lastly, measurement approaches such as acceptance voting work well (e.g., provides a valid and stable estimate of preference and increases the likelihood, over use of plurality voting, of providing a winning product) because the measurement process of selecting relatively few products, brands or benefits parallels the cognitive process used by consumers in decision making. In general, evaluations are provided in a “good enough” or satisficing manner based on simple lexicographic heuristics such as “take a few” or “elimination by aspects”. It is instructive to know, for example, that typical brand consideration sets for existing fast moving consumer product categories is around 4 brands. Consumers do not hold in memory extensive rankings of all brands nor do they contemplate the performance of products in a massively multidimensional space implied by the 30 or more product characteristics used in product test questionnaires. Rather, consumers simplify and approximate to achieve greater efficiency. Perhaps the best data collection methods and analyses are those that mimic this heuristic process.

^{*}There are other vote counting methods that rely on complete rankings, Single Transferable Voting (STV) and Majoritarian Compromise (MC) being two examples. With STV (also known as the Hare System), the product that receives fewest first-choice votes is removed, with its votes transferred to whatever product was ranked second. The product elimination continues until a single product garners the most votes. For MC, votes are accumulated, from ranked first, to second, etc., until one product receives a majority. Complete rakings can also be converted to pairwise preference votes, where a product is declared most preferred (a Condorcet winner) if it gathers more preference votes in all pairs in which it appears. While all use ranked data, these voting procedures differ from the Borda approach which is a quantitative score system, using ranks as weights.

^{**}Clearly, though, if indeed the two best products are perceived similarly and are equally preferred there is no way to separate them from preference, including ranking, data alone. Addition detail must be provided, perhaps in the form of other product performance measures or consideration of production cost.

^{***}Preference shares obtained from partial ranking can be estimated using Zipf’s Law (or the Zipfian distribution). The law states that the share associated with a specific product (in this application) is inversely proportional to its rank order in the respondent’s consideration set. Preference share can be calculated from 1 / r ln(1.78R), where:

- r is the rank of the product for which share is to be estimated,
- R is the total number of products in the consideration set,
- “ln” represents the natural logarithm.

As an example, the 10th ranked product of 10 total products in the consideration set is estimated to obtain a share of roughly 3.5 percent. Both r and R are set equal to 10.

### Prudent & Knowledgeable Analytic Insight

#### Meaningful Differences

All statistical tests are based or focused on some sort of difference between things, either explicitly or implied. A difference between two averages, for example, sits in the numerator of the test statistic (e.g., t-test) which is then divided by a quantification of noise, the numeric equivalent of the “null hypothesis”. If the difference is substantially greater than the noise, say twice the size, then the difference is deemed statistically significant with some accepted level of confidence (e.g., 95%).

But is the difference, found to be statistically significant, meaningful? Is the difference the right size, either small enough or large enough, to serve as the basis for a business decision worth millions of dollars? To come at this from a different perspective, all serious sample size calculations, including considerations of power, are based on a notion of some difference being large (or small) enough to be detected and called as significant with some level of confidence. But how is that difference set and is that difference meaningful from a decision making perspective?

Too often, the definition of “meaningful” has been set, without debate, as that which is statistically significant with some accepted level of confidence… if the difference was significant with 95% confidence then it must have been meaningful. However, the statistical test, and resulting significance, is often held hostage by research budgets which directly affect sample size. Base sizes upon which statistical testing is to be performed may either be too small… making it difficult for even large differences to be considered statistically significant with enough confidence… or too large… making it too easy for otherwise small differences to be taken as statistically significant with a possibly exaggerated level of confidence.

As such, alternative approaches can be used to identify differences that are relatively larger (or smaller) than those typically encountered. A qualitative, expert elicitation, approach is to survey researchers who have done product testing and ask them for a sense of what is a “meaningful difference”. A more quantitative, database driven, approach makes use of normative data from past research, using historical differences to define “meaningful”.

As an example of this application, consider use of the P&K database for edible products tested in 2012, through June. The table below reports the magnitude of differences based on use of a 9-point overall liking rating scale that are associated with various percentiles. (The differences were obtained from data for 65 projects, 353 products, with over 1,400 differences calculated between products, only within the same project. All products were evaluated on samples between 50 and 150 respondents. The argument for using this database to assess significance relies on sheer volume of differences across a variety of products and conditions, creating an empirical distribution of differences that have occurred historically.)

Percentile | Difference |

20 | .13 |

50 | .38 |

80 | .88 |

90 | 1.18 |

95 | 1.41 |

As per the table, the typical or median difference is .38. Also, a difference of .13, or smaller, occurs infrequently (20% of time or less) that this sized difference can be used to define “equivalence” between products.

Of greater interest here, a large meaningful difference should be one that occurs sufficiently infrequently, exceeded by relatively few other differences. Consider the .88 differences associated with the 80th percentile (a two-tailed perspective). A difference of this size is exceeded only 20% of the time. Differences greater than .88 can be considered significant with at least 80% confidence.

In sharp contrast, though, a difference of only .32 is required for an equivalent amount of confidence based on traditional statistical significance testing and a sample of 100 evaluations per product. (A .43 difference is required with 50 evaluations per product. Both sample sizes are based on comparing averages around 7 on a 9-point scale, with standard deviations of 1.7.) Why the great disparity? Are the statistical tests too sensitive or is the database bloated with strange products that differ greatly? (Recall that the differences in the database are calculated only within a project, eliminating comparison of very different products taken from different tests.)

For one perspective, to provide an independent reference point, consider use of the effect size (consistent with what has been referred to as Cohen’s d). The effect size is a standardized difference, a difference divided by an estimate of the standard deviation of ratings^{*}. For example, a difference between 7 and 7.8, or .8 points, is divided by the average standard deviation typically obtained from ratings of this magnitude. An average standard deviation taken from the P&K database is 1.7. The effect size is then (7 – 7.88)/1.7, or .52, what historically (via Cohen’s work) would be considered as a “moderate” effect size. Conversely, the difference referenced above, .32, required for 80% confidence using traditional statistical testing, has an effect size of .19, small by Cohen’s reference standards.

As a second perspective, perhaps the discrepancy is caused by standard deviations, calculated from variability in respondent ratings, which are too small. The median difference, .38, shown in the table above holds more meaning than just the typical difference. It’s also serves as the basis for an estimate of the standard error of the difference between two averages. (The estimate has a robust flavor as the median absolute difference of all differences in the database.) Multiplied by 1.5 (taken from Lenth’s Method), this median difference provides an estimate of the standard error of the difference between two averages, .57. From this perspective, the database differences aren’t so large. However, from the traditional statistical perspective, this is a lot of variability. Assuming, for example, that a typical sample size is 100 then the implied standard deviation is 4, more than twice the size of the respondent-based estimate of 1.7.

A way out of this inconsistency when interpreting statistical significance tests is to think in terms of bounds. The lower bound is represented by traditional statistical tests; the upper bound takes account of the database results. Differences can be statistically significant, exceeding the lower bound yet not be “meaningful” in light of an empirical database of historical differences. Perhaps then, truly meaningful differences are both statistically significant and historically large… and meet Cohen’s criterion of at least “moderate”.

^{*}The standardization allows for comparison of differences across data from different studies which may have used different scales and summary statistics, different methodologies (e.g., acceptance testing vs. triangle discrimination testing) come from different countries or from different points in time. Effect sizes form the basis for a meta-analysis, an analysis of the statistical results accumulated over many studies or separate pieces of research. The accumulated evidence, in the form of effect sizes taken from the different studies or pieces of research, is used to increase the statistical power, sensitivity and generalizability of conclusions that could not obtained from any one study alone.

### Prudent & Knowledgeable Analytic Insight

#### Acceptance Testing

A marketer's concern over product differentiation is amplified when one product is meant to replace the other... and consumers aren't supposed to notice. The concern grows exponentially with size of franchise. Either for cost or ingredient supply reasons, a test product is developed as substitute for another that is currently in the marketplace. Typically, from the R&D perspective, some sort of discrimination trial is executed, either with sensory panels or consumers. The marketing concern is whether the replacement product is different such that it affects acceptance, and sales, to the point that product users, especially heavy users, would object if their current product was "replaced" (i.e., alienation). So, for consumers of the products, is discrimination the best research objective?^{*} Are better questions: "Is the test product good enough?"; "Are the two products that were tested similar enough?" Could we simply acknowledge that products might differ (and often, the difference is physically obvious to consumers) but that their performance was "close enough"? Does discrimination matter if there is no significant decline in acceptance?

It is quite possible to build a statistically sensitive acceptance test based on a sequential monadic design.^{**} The focus, with regard to an action standard, is on overall liking ratings. The sensitivity of the design can be expressed by the sample size required to detect a difference of interest. For sample size determination, as an example consider overall liking ratings in the range of 7.0, on a 9-point scale. A typical standard deviation is 1.7 for averages in this range. The next step is to establish a "null" difference against which obtained product differences in averages should be compared. The “null” value is the difference between the test and current product that must be exceeded before the test product is considered different, in this case significantly worse than current. The decision of which “null” value to use is pretty subjective and may depend on past experience. One perspective is to rely on academic research to provide a typical or meaningful difference. For example, Cohen’s effect size (i.e., Cohen’s d) guidelines can be applied, where an effect size (the difference of interest divided by a standard deviation) of .5 is considered moderate. Translating that effect size to the example here suggests that a difference of .85 is appropriate. (The difference of .85 is half the standard deviation value of 1.7.) Coincidently, a difference of this magnitude falls at the 80th percentile in the P&K “meaningful difference” database, larger than 80% of differences between products tested previously. Then, with 95% confidence (one-tailed, as we wish to statistically focus on the significant decline in acceptance for the test product) and 80% power, a sample of 50 respondents is required to call as significant a difference of .85 or larger.

While the difference of .85 used above may strike researchers as too large (although when expressed as an effect size, .5, it is moderate), the goal was to offer a hurdle that would safely separate smaller, perhaps overly sensitive differences, from what historically is seen as “large enough to matter”. Differences that exceed 80% of historic differences define that. However, for those wishing to use smaller differences as benchmarks, a difference of .6 requires a sample of 100. The median difference in the P&K database is .38, requiring a sample of 250 respondents.

There are design and analysis embellishments to the acceptance testing that can be used to enhance product differentiation. For example, the analysis, and action standard, should take account of order effects, especially as second position ratings which tend to show greater discrimination. Next, a preference measure may be added, after evaluation of both products. Evidence of product acceptance from both the overall liking ratings and preference can be combined to provide a more stable statistical result. (See the earlier Prudent & Knowledgeable entry on "Combining Evidence".) Also, with the objective of assessing similarity, a direct question asked of respondents at the end of the interview is: "Are the two products you just tried similar enough that you would be satisfied using either?"

This short note was written to serve a couple of purposes. The first was to offer what might be a surprising result that acceptance tests provide useful consumer perspective with relatively small samples. The second purpose was to invigorate use of acceptance testing, especially when coupled with discrimination test results obtained from sensory panels. The cognitive notion of acceptance, as a form of affect, differs from the mental task required for discrimination. A merging of results, discrimination and acceptance, would offer a more complete product evaluation.

^{*}While researchers closer to R&D favor discrimination tests, those closer to marketing have favored the use of preference testing: differentiation vs. acceptability. The best course of action includes a quantitative assessment of both perspectives, along with consideration of franchise risk and possible production cost savings, to provide a complete risk analysis.

^{**}Consumers are screened for heavier usage of the product to be substituted. The test is "branded", in that respondents know the brand being evaluated, although the specific products tested are blinded (i.e., the respondent does not know which product is current and which is the alternative).

### Prudent & Knowledgeable Analytic Insight

#### Variability in Statistical Test Results

When reviewing statistical results from a product test, we may find that the significance of a difference between ratings for two products falls just short of the pre-set action standard. But is that shortfall a true indication of “no difference”? How often do we find inconsistencies in the KPI (key performance indicators), where, say, the difference in preference is significant but the difference between overall liking averages is not?^{*} When we find that two different data collection methods (or when testing the same products at two different times) seemed to offer differing statistical sensitivity, and come to different conclusions, are they truly different? Underlying these questions is the fact that the statistical summary measure we rely on most, the p-value (or level of confidence), has variability. The p-values obtained from statistical tests are not fixed, even though we treat them as such.

Just like all other statistics (e.g., averages, percentages), p-values obtained from statistical tests (e.g., t-tests, z-tests) have a standard deviation which can be used to put a confidence bound around them. For example, a test of the difference between average overall liking ratings for two products, based on 150 ratings for each product, was significant with 93% confidence (a p-value of .07). With a pre-set action standard requiring significance with at least 95% confidence, the two products’ ratings were not considered different. However, a 95% confidence bound was drawn around the p-value. It ranged from .15 (85% confidence) to .02 (98% confidence). Technically then, the difference between average product ratings could be considered significant as the required confidence level, 95%, was encompassed within the p-value confidence interval.^{**}

Consider another example. Respondents were asked to perform two discrimination tests, using the triangle test methodology. Results of the two tests were cross-referenced. With an assumption of randomness, or guessing, 11% (1/9th) of respondents should provide consistent discriminant ability. Samples of three different sizes of respondents were tests: 40, 100 and 200. Each of the three samples showed 20% discrimination. With 40 respondents, the difference between 11% and 20% was significant with about 94% confidence (a p-value of .06). However, a 95% confidence bound drawn around the p-values extended from .43 (significant with 57% confidence) to .0001 (significant with 99.9% confidence). This huge range should give the researcher immediate cause for concern.

p-value variability will be closely related to power, the ability to detect a difference, but may offer a more direct measure of uncertainty in test results. With regard to the sample of 40, this provides the statistical ability to call as significant a 9 percentage point difference (from 11% to 20%) with 95% confidence, but with only about 50% power. So, p-value variability makes it very apparent to the researcher just how wobbly the test results are, and may be a more telling indicator then poor power.

With 100 respondents, significance ranged from 78% to 99.99% around a result of 99.4% (with 80% power). With 200 respondents, significance ranged from 96% to 99.99% around a result of 99.9% (with 97% power).

In light of this variability, the confidence bounds drawn around p-values can be used as the basis for constructing action standards.

- Liberal application: the upper bound on a 95% confidence interval reaches at least the desired level of confidence… Used for exploratory research.
- Conservative application: the lower bound on a 95% confidence interval must be at least the desired level of confidence… used for confirmation, to guarantee significance.

^{*}That preference (as a comparative evaluation) and ratings of liking (intended to be an “absolute” measure) differ should not surprise. The respondent cognitive process and the context within which these two measures are obtained can differ greatly.

^{**}Confidence intervals for p-values are best calculated using simulation. Intervals are more stable when a log transformation of the p-values is used. Also, because of the bounded nature of probabilities (can’t be smaller than 0 or greater than 1), the p-value confidence interval is not symmetric.

### Prudent & Knowledgeable Analytic Insight

#### Combining Evidence

The action standard for a product test will often be based on more than one overall performance measure. For example, after evaluation of all products, respondents may be asked to rate each on an overall liking scale as well as state a preference. These product performance criteria are consistent with an action standard that is framed in terms of both statistically significant liking as well as preference. But often, these results conflict, specifically when one result shows statistical significance, at some pre-set level of confidence, and the other does not. To reach a decision about whether the products do indeed meet the action standard… and differ significantly…, statistical sensitivity as well as stability (to say nothing about a researcher’s sanity) can be enhanced by combining results from different statistical tests.

Consider a two-product test example where both preference and overall liking ratings (amongst usually several other product attributes to be evaluated) were obtained from a sample of consumers. The action standard called for significance with at least 95% before the action of going to market with Product A would be taken. The difference in overall preference favored Product A but was significant with only 84% confidence. A t-test of the difference in overall liking averages also showed Product A’s superiority, with a greater level of liking significant with 98% confidence. The two results do not offer a consistent conclusion.

The statistical suggestion here is to combine these two results… these two pieces of product performance evidence. The test results can be combined to create a single summary level of confidence regarding the better performance of Product A. Using the Logit Method for Combining p-values^{*}, the p-values for the two tests (.17 for the preference test and .02 for the t-test of the difference between overall liking averages) were combined to yield a p-value of .035, or significant with 96.5% confidence, significant enough to exceed the action standard and recommend moving forward with Product A.

Combining test results is an effective approach to accumulating and summarizing all evidence about product performance into a single result. Combining results also effectively deflects arguments favoring one result over another. Rather than debate the goodness of any single statistical result (e.g., preference vs. liking), their combination provides better use of all data, with greater statistical stability. The combined result can be used effectively to address the “action standard”.

As a final comment here, it should not be surprising that preference results differ from ratings of liking. The two pieces of product evidence are collected from respondents at different times and in different contexts during the product evaluation process. In addition, both measures can be strongly affected by the order of product presentation. These order effects should also be considered as evidence of product performance and be taken into explicit account when summarizing product performance.

^{*}Bootstrapping was used for estimation and takes account of the correlation between the two statistical test results. Additional calculation details are available upon request.

### Prudent & Knowledgeable Analytic Insight

#### Taking Account of Order Effects in Two Product Tests

Sometimes it’s difficult to shake the idea that order effects are a nuisance. In a product test where respondents have been asked to evaluate more than one product, the ratings for a product can change substantially depending on what other products are included in the test and the order in which respondents are exposed to them. Order effects make interpretation difficult and their presence leads to questions.

How can product performance be effectively summarized, especially when a one-number summary is required (as per an action standard) to judge overall product performance? With regard to overall liking ratings, for example, is it best to report an average based only first position ratings? Or, are total sample averages the better summary? Are there other statistics that would better summarize a product’s performance, that more effectively take account of the order effects?

Before a solid answer can be given to these questions, there needs to be a review of the data. Also, whatever summary is ultimately provided must address, and be consistent with, the purpose of the research. At the end of this process, it’s a good bet that order effects won’t be seen as nuisance, but rather as valuable detail that can be used to create a very effective performance summary. It will also hopefully be clear that sequential monadic designs are capable of providing substantial detail about product performance beyond that provided by monadic designs.

Consider a specific two-product test scenario. Respondents evaluated both products in accordance with a sequential monadic design. Rotations were imposed so that each product was seen equally often in first and second position. The marketing goal is to identify whether the client’s product is “best” such that if “best” status is achieved then the client will market the product. So, the research goal is to define “best” and use that as the basis for the action standard.

Taking account of the sequential design, a definition of “best” products^{*} is:

(1) The best product is rated strongly when evaluated in first position. This can be quantified by comparing ratings of both products when each was seen in first (the monadic position).

(2) Ratings of best products don’t decline significantly when evaluated after another product (i.e., when seen second). Any decline should be considered a penalty. This can be quantified for each product by comparing ratings from first position to ratings in the second position, (1st - 2nd).

(3) When best products are seen first, and act as context for the other product seen second, the ratings of the second product should decline. The best product negatively affects perceptions of products that follow. To quantify this for each product, compare ratings when one product is seen first vs. when the other product is seen second.

Results from these three steps can be used to provide a summary measure: (1) - (2) + (3). (This summary works well for netting the effects of best performing products. It is assumed that the values of (1) and (2) are positive. The value of (2) acts as a penalty, reducing the value of (1).) The standard error of this difference (which takes account of the correlation between the three components) can be used to assess significance.

The table below provides an example. A sample of 150 respondents evaluated the two products, A and B. A 9-point overall liking rating was used to capture overall performance judgments. Product A will be the focus of the example calculations.

Product | |||

A | B | ||

Order Of | 1st | 6.2 | 5.8 |

Appearance | 2nd | 6.0 | 5.2 |

(1) The monadic or first position difference comparing products’ A and B is (6.2 – 5.8), or .4, favoring Product A.

(2) For Product A, there is a decline in ratings between when the product is seen first or second, from 6.2 to 6.0. While ratings of Product A don’t drop considerably when seen in second position, after exposure to Product B, there is still a penalty of .2. Note that the truly good product will not falter when seen second.

(3) The third difference is the effect or impact of Product A’s performance on ratings of B, when B is evaluated in second position. Specifically, the calculation is: (A’s average when A is rated first – B’s average when B is rated second). This should be a large positive number if Product A dominates. The difference, taken from the table above, is 1.0, (6.2 – 5.2).

The three differences added together yields a net of +1.2, which can be tested for significance (vs. a “null hypothesis” of no difference). The truly superior product has a positive value here. While Product A appears to be marginally superior when evaluated just based on first position results, A’s dominance is more the result of its effect on perceptions of Product B. Imagine that the client’s interest was in introducing to market a strong competitor to Product B, which is manufactured by another company. Product A then provides strong evidence for successful introduction (in conjunction with an effective marketing program), especially amongst consumers who currently use Product B. The context of this third difference matches the “real world” situation where many consumers will have already tried Product B. Indeed, the utility of the approach described above is that it provides three different views of a product’s perceptions, each framed by a different context. The truly good product will succeed across these contexts.

^{*}A definition of “best” can also be created for larger sets of products, when respondents are asked to evaluated 3 or more products.

### Prudent & Knowledgeable Analytic Insight

#### Reverse Causality

At the risk of overstatement, it’s possible that almost every statistical model constructed from respondent rating data obtained from a product test has supposed that some attribute performance, as measured by a set of ratings, affected or influenced overall acceptance. Causality, often implied, always flowed from perceptions of specific attributes, as measured by ratings, to some overall assessment of product performance, e.g., an overall liking rating. This modeling assumed that respondents are introspective enough that researchers could derive from those attribute ratings what motivated acceptance. While the performance of specific attributes may indeed influence a respondent’s perceptions, during consumption or usage, the fear is that the rating process itself betrays the causality.

What if respondents’ ratings aren't that insightful and are incapable of providing more than a one-dimensional evaluation consistent with a halo-effect: liked or disliked? What if having asked overall liking first, as is often done, leads to attribute ratings offered solely in defense of the liking rating, as per use of heuristics such as assimilation and satisficing. The result would most likely be the data typically obtained from such studies.

Rating data from product tests have two telling characteristics which belie the inferred causality. The first are the one-dimensional, redundant ratings, as indicated by the large amount of variability accounted for by the first principal component obtained from principal components analysis (performed without rotation). Typically, anywhere between 40% and 60% of variability is accounted for by that first component. The one-dimensional nature of these data suggests that attribute ratings are tightly anchored to the preceding overall liking rating, with very little distinction or discrimination actually provided by the attribute ratings.

The second characteristic relates to the distribution of ratings, the frequencies with which specific rating scale points are used, for overall liking. With reference to “direction of dependence” research*, the distribution of ratings for a dependent measure should be more “normal” (i.e., bell shaped) than the distributions of attribute ratings used as predictors. This assertion is based on the notion that a true overall measure (e.g., overall liking meant to obtain a total product evaluation) should be the compilation of effects from many sources or dimensions of product performance, such that their combination yields ratings that are normally distributed. Two statistical measures often used to assess normality are skewness and kurtosis. A review of data from several product tests shows that overall liking ratings have greater skewness and are more peaked (i.e., leptokurtic) distributions, are less normal than many of the attribute ratings. By this criterion then, overall liking may be more a “cause” of attribute ratings, at least from the perspective of predictive modeling. Either we are modeling in the wrong causal direction or the questionnaires used, the way ratings are acquired, create a backwards causal flow. (A form of modeling, referred to as a linear mixture model, posits that overall acceptance "drives" the specific attribute ratings. This model fits data well, often as well as traditional modeling used with product test data, e.g., various forms of regression modeling. The weights estimated from the linear mixture model, which best relate the overall acceptance measure with each of the attribute ratings, are simple correlations.)

Three suggestions are offered here as approaches to help bring measurements in line with implied causality. The first is to ask for the overall rating last, after respondents have rated all attributes: “Now that you have completed your evaluation, please rate your overall opinion of this product”. The evaluations of attributes first acts as a funneling process, as a review that leads respondents to a conclusion or overall performance evaluation.

A second suggestion is to not ask for an overall evaluation but rather create one as a composite of the attributes. Various weighting schemes can be employed to assign numeric importance to the attributes but perhaps the simplest and most robust is to simply add the ratings, giving each rating a weight of 1.0.

The third suggestion is based on Paul Slovic’s Affect Score. After having completed the attribute ratings, respondents are asked to list all the positive features of the product. Each feature is then weighted to reflect importance or value to the respondent. (A 3-point scale has been used to rate this importance.) The same question sequence is then applied for negative features. The affect score is then the difference between the sum of the positive features and the sum of the negative features. For example, a respondent may list 5 positive features of the product they just tried, and assigned to each a value rating of 2, for a positive affect score of 10. The same respondent cited 3 negatives and to each assigned a value of 3, for a negative affect score of 9. The difference, 1, is this respondent’s affect score. These scores, obtained from all respondents, can then be used as the dependent measure in subsequent modeling.

^{*}Direction of dependence” notes that the correlation between two measures is related to the ratio of their skewness (or kurtosis) measures, skewness (kurtosis) for the dependent measure divided by skewness (kurtosis) for the attribute predictor. Since correlations must, by definition, less than 1, the skewness (kurtosis) for the dependent measures must be smaller than that for the attribute.