Calibration, weighting and post-stratification in audience measurement
The last few months I’ve been working on two very demanding projects in the field of media measurement. One is nearly finished, the other one will keep me busy for the next few months. Both involve calibration. Calibration is one those polysemes that create a lot of confusion in statistics and data science. Here are several distinct but related ways in which the term calibration is used in statistics:
A probabilistic model is said to be well calibrated if the predicted probabilities from the model reflect actual observed frequencies. For instance, suppose a model predicts that a particular viewer segment has a 60% chance of watching a certain TV channel during prime time. If, over many such predictions, that segment actually ends up watching the channel about 60% of the time, then the model is considered well-calibrated.
Adjusting a measurement device to ensure its outputs matches a known standard or reference is also called calibration. For example, a TV audience measurement device (also known as a peoplemeter, audience meter, or TV meter), used to measure TV ratings, sometimes needs to be calibrated to ensure that it is accurate and functioning correctly.
Tuning the parameters of a model (e.g., coefficients, neural network weights, variable weights, hyperparameters, etc.) to minimize prediction error on training or validation data is sometimes called calibration. For example, if you use a statistical model to assess the extent to which migration background determines listening behavior, you will likely need to set some hyperparameters; or, if you use a neural network, the algorithm will update the weights to minimize prediction error. Both processes are sometimes referred to as calibration.
Adjusting decision thresholds to align with desired performance criteria is also sometimes called calibration. For example, if you use a random forest model to predict whether people are watching a certain program in a specific daypart based on socio-demographic characteristics, the default threshold of 0.5 might underestimate or overestimate certain segments. Adjusting that threshold to accommodate for this is sometimes called calibration as well.
Post-hoc probability calibration refers to adjusting a model’s predicted probabilities after training to ensure they better match actual outcomes. Even well-performing models can produce overconfident or underconfident probability scores. Techniques like Platt scaling and isotonic regression are used to calibrate these probabilities, especially in classification tasks where accurate confidence estimates are important (e.g., predicting audience likelihoods for different programs).
The type of calibration I’m referring to in this post is survey-weight calibration, also known as calibration estimation or Deville-Särndal calibration. The latter term comes from the influential 1992 paper “Calibration Estimators in Survey Sampling” by Jean-Claude Deville and Carl-Erik Särndal (Journal of the American Statistical Association, 87, 376–382). I used to own a printed copy of that paper, but unfortunately, it was lost when part of our roof was torn off during a storm — a story for another time.
Back to calibration: Loosely defined, Deville-Särndal calibration refers to the process of adjusting sampling weights so that survey totals align with known population totals (e.g., from a census or administrative register). I say “loosely defined” because this type of calibration is not restricted to surveys, nor must the calibration targets necessarily come from census data. For example, one might adjust the weights of a sample from a print readership study so that they align with externally sourced benchmarks. In some applications, it's not the weights that are adjusted directly, but rather the viewing or reading probabilities derived from them — again, another reason why the definition should be interpreted with some flexibility.
Iterative Proportional Fitting
Many practitioners will immediately turn to Iterative Proportional Fitting (IPF) as their go-to method for calibration. We will see later that IPF can be viewed as a special case of Deville–Särndal calibration, particularly when using a chi-squared distance function and targeting marginal constraints.
Before we get to that, it's worth noting that IPF is known under several other names and is used across a variety of applications.
Terms like raking, Random Iterative Method (RIM) weighting, and RAS are often used interchangeably with IPF, although they may carry slightly different connotations depending on the field.
Conversely, the IPF algorithm has many applications beyond calibration. For example, some implementations of log-linear models — a class of models used to analyze relationships between categorical variables — use IPF to fit the model parameters. In economics, IPF (often referred to as RAS) is used to update input-output matrices while preserving their row and column sums. While researching for this post, I was surprised by how widespread IPF’s use is across different fields. Some machine learning methods also employ IPF; for instance, discrete Markov Random Fields under marginal constraints are sometimes fitted using IPF. It’s even applied in image reconstruction tasks.
That said, the most common application in survey statistics remains adjusting a joint distribution so its marginal totals match known census or population totals—typically when only marginal distributions are available. Of course, you could argue that this last application is essentially a form of calibration. To add to the onfusion, the part of weighting that follows IPF is sometimes refered to as post-stratification.
Mechanically, IPF is straightforward to explain. Consider the two-dimensional case where we observe the joint distribution of two categorical variables, A and B, in a sample. Additionally, we know the population totals (marginals) for both A and B. Our goal is to adjust the cell weights in the joint distribution so that, after adjustment, the marginal distributions match the known population values.
Let’s write the joint cell counts or probabilities as:
where i refers to the categories of the first variable in our two-dimensional table, for example the categories man and woman of the variable gender and j refers to the second variable, for example age categories. Let’s denote the known marginal population totals as:
The IPF algorithm then consists of the following steps:
Initialization: Start with an initial estimate of the full table:
\(P_{ij}^{(0)} \leftarrow P_{ij}\)Fit row marginals: For each row i, scale the cells in that row so each row sum matches category i of variable A:
\(P_{ij}^{(t+1)} \leftarrow P_{ij}^{(t)} .\frac {A_i}{\sum_j P_{ij}^{(t)}}\)Fit column marginals: For each column j, scale the cells in that column so each column sum matches category j of variable B:
\(P_{ij}^{(t+1)} \leftarrow P_{ij}^{(t)} .\frac {B_j}{\sum_i P_{ij}^{(t)}}\)Convergence: Continue steps 2 and 3 until the marginals of the adjusted table match the known marginals within a specified tolerance.
IPF as EM
Another thing that is worth noticing is that IPF can be seen as an application of the Expectation-Maximization (EM) algorithm. The EM algorithm is mostly attributed to Dempster, Laird, and Rubin enven though the algorithm was used by others earlier. In their 1977 paper "Maximum Likelihood from Incomplete Data via the EM Algorithm" published in the Journal of the Royal Statistical Society they were the first to formalize the general framework, prove its convergence properties, and demonstrate its wide applicability. When I was studying — back in the second half of the eighties and early nineties of the previous century — about half of the articles I was reading referred to this paper.
In short, the EM algorithm iterates over two steps:
Expectation-step: Fill in missing/latent data based on current parameter estimates.
Maximization-step: Re-estimate parameters based on the “complete” data from the Expectation-step.
In IPF, the marginal totals are observed, while the true joint cell counts for the full population are unobserved. We can treat these joint cell counts as missing data. The IPF procedure can then be interpreted as an EM algorithm: the E-step computes the expected joint distribution given the current estimates and the known marginals; the M-step adjusts the cell estimates proportionally so that they match the observed marginal constraints.
A more formalized framing of calibration
Now that we have explained what the IPF algorithm does, it’s time to formalize calibration a bit further. This will allow us to see why IPF indeed solves (some) calibration problems. More importantly, this more formalized framing of calibration opens the door for other methods that can be very helpful in practical calibration situations.
In general, calibration as described by Deville and Särndal can be formulated as the following constrained minimization problem:
where gᵢ denotes the calibration factor for unit i, wᵖʳᵒʲᵢ is the projected base weight (e.g., after expansion or design weighting), n is the sample size, and φ(·) is a convex distance function that penalizes deviations from 1 (e.g., quadratic or KL-divergence).
The objective expresses the idea that the calibration-adjusted weights should remain as close as possible to the original projected weights. The resulting calibrated weights are defined as:
These weights must satisfy the following calibration constraint:
where 𝐱ᵢ is the auxiliary variable vector for unit i, and 𝐗 is the known total of these auxiliary variables in the population.
Let’s consider the case of IPF. Take as distance function:
This is the Bregman divergence (convex function) that corresponds to minimizing Kullback-Leibler (KL) divergence between the calibrated weights wᶜᵃˡᵢ = wᵖʳᵒʲᵢ · gᵢ and the original weights wᵖʳᵒʲᵢ. This can easily be shown by taking the Kullback-Leibler Divergence between the calibrated and the original weight:
and by noticing that by taking the distance function mentioned above we get:
which matches exactly the KL formulation.
The advantage of the more general formulation of calibration is that you can easily relax some constraints by introducing a slack variable s and a tolerance ε:
where you can choose
depending on the penalty you want. Further we can impose bounds on how much the weights can change by introducing delta:
So you could see IPF as a special case of entropy minimization with:
No bounds on the changes weights can make:
No slack:
\(S=0\)No tolerance:
\(\epsilon=0\)Categorical (binary indicator) data representation:
The auxiliary variables are represented as indicator vectors (typically binary), representing categorical distributions.
Other types of calibration
Beyond IPF and entropy-based methods, there are several other widely used calibration techniques.
Generalized Regression Estimation (GREG)
When the distance function is chosen as:
the resulting calibration method is known as Generalized Regression Estimation (GREG). This corresponds to a chi-squared divergence, and the calibration amounts to solving a weighted least squares regression problem. The name “GREG” reflects the fact that this method effectively fits a weighted linear regression model between the auxiliary variables and the survey variable of interest. While many variants exist, we do not explore them in detail here.
Hierarchical Calibration
Hierarchical calibration (also known as multilevel calibration or nested calibration) adjusts weights to match known population totals at multiple levels of aggregation, for instance:
Lowest level: Calibration adjusts weights to match known totals within individual countries (or other base-level units).
Intermediate level: Calibration also matches totals aggregated over subregions, where subregions (i.e. groups of countries).
Highest level: Calibration further ensures totals match at the region level, which is an aggregation of multiple subregions.
This is typically implemented through sequential calibration: weights are first adjusted at the lowest level (e.g., within countries), and then recalibrated at higher levels, preserving previously achieved constraints. This ensures coherence across levels, meaning that lower-level totals sum to the corresponding higher-level targets.
Calibration at scale
In one of the projects I mentioned earlier, I simply used the uniroot function in the statistical programming language R to perform calibration. For another project, we employed Python, using CVXPY, a powerful library for convex optimization. A colleague of mine is working with CVXR, an R equivalent of CVXPY, to solve problems that are not directly related to calibration but rely on similar optimization techniques.
He also mentioned Gurobi, a state-of-the-art commercial optimization solver widely used for large-scale mathematical programming problems. For very large and complex optimization tasks, such solvers can be highly effective and practical alternatives.An older but still widely respected solver is IBM’s CPLEX optimization suite. I am aware of at least one market research company that leverages these solvers to implement large-scale calibration exercises—specifically calibrating consumer panel data to retail measurement results. A key aspect in their implementation is that the consumer panel results are not calibrated to match exactly with the retail measurement results, but that slack is allowed.
Conclusion
As you can see, calibration is an important topic in market research in general and audience measurement in particular. It suffers from conceptual confusion as I already mentioned in the begining of this post. It is worht noticing that, strictly speaking, in some applications, post-stratification, weighting and calibration are the same thing, but yet, in some industries, like for instance in media research, weighting and calibration are used as different terms, probably because they happen at a different moment in the workflow. Another reason could be that in some media measurement applications, the weights are left untouched, but instead probabilities (for example reading probabilities) are adjusted to match a population total, using the same calibration techniques as discussed in this post.
I hope that with this post I have helped lifting some of the conceptual unclearities w.r.t. calibration and weighting, and at the same time pointed to relationships between well known algorithms.