Model Extraction via Query-Based Functional Stealing

Model extraction is the attack where a black-box adversary recovers a functionally equivalent model by querying the target API. The threat isn’t hypothetical: Tramer et al. extracted scikit-learn models from BigML and Amazon ML with a few thousand queries in 2016, and the techniques have only improved. But the cost of the attack and the fidelity ceiling you can realistically achieve are frequently misrepresented.

What model extraction actually means

There are two distinct goals that get conflated:

Functional extraction. Produce a substitute model f' that agrees with the target f on most inputs. You don’t need to recover the exact architecture or weights. If f' makes the same predictions as f on the relevant input distribution, you’ve extracted the functionality. This is the economically motivated threat — a competitor can reproduce your model without paying to train it.

Exact parameter recovery. Recover the weights and architecture of the target model up to numerical precision. This is possible for small, simple models with known architecture; it’s not viable for large neural networks through API queries alone. Papers claiming “exact recovery” of neural networks typically assume knowledge of the architecture and use equation-solving over a sufficient query set.

For the rest of this piece, “extraction” means functional extraction.

The Tramer et al. baseline

Tramer et al.’s 2016 paper (arXiv:1609.02943 ↗) is the foundational reference. They showed that with access to the full prediction vector (confidence scores across all classes), a linear model can be extracted by solving a system of equations, and decision-tree models can be extracted by path-finding with structured queries.

For neural networks, they demonstrated that logistic regression and shallow networks are extractable from confidence outputs with a number of queries proportional to the number of parameters. The key resource is the confidence vector, not just the label.

Distillation-based extraction: the current state of practice

For large neural networks, the practical attack is knowledge distillation against a black-box oracle. Conceptually:

Sample or generate inputs from the domain the target model is designed for.
Query the target API with those inputs, collect the confidence (or soft label) responses.
Train a substitute model on the resulting (input, confidence-vector) pairs, using the oracle’s soft labels as supervision.
Repeat until the substitute’s agreement with the oracle plateaus.

This is exactly the distillation process from Hinton et al. (2015), except the teacher is a closed-box API instead of an internally accessible model.

What determines the fidelity ceiling:

Query budget. The substitute model’s accuracy on the target distribution increases with the number of query-label pairs, but with diminishing returns. For a 100-class ImageNet classifier, agreement above ~85% typically requires tens to hundreds of thousands of queries; above 92% requires millions. The relationship is roughly logarithmic after a threshold.

Query distribution quality. Random queries from the correct domain are inefficient. Active learning strategies that query on uncertain or decision-boundary-adjacent inputs get more information per query. Jacobsen et al.’s JBDA (Jacobian-Based Dataset Augmentation, arXiv:1602.02697 ↗) generates synthetic inputs by taking steps along the substitute model’s Jacobian. This significantly improves extraction fidelity for the same query budget.

Target model output richness. Soft confidence vectors are much more informative than hard labels. If the target API returns only the top-1 prediction, distillation-based extraction requires more queries to achieve the same fidelity. If the API returns probabilities over all classes, each query contains much more training signal.

Architectural match. A substitute model that matches the target’s capacity can achieve higher fidelity. If the target is a ResNet-50 and you’re training a ResNet-18, you’ve built in a fidelity ceiling below 100% regardless of query budget.

The cost-of-attack tradeoff

This is where vendor threat models go wrong. Extraction fidelity is quoted in absolute terms (“we achieved 90% fidelity”). The relevant metric is the cost to reach that fidelity relative to the cost of training the target model from scratch.

For a typical commercial classifier:

Training cost: $50,000-$500,000 in GPU compute (large model, proprietary data)
Data acquisition cost: often the dominant term, not reproducible
Extraction cost at 85% fidelity: $1,000-$20,000 in API queries and compute

The extraction attack is not free. Millions of API queries at $0.001-0.01 per query add up. For models where training is cheap and data is commodity, extraction is not worth the attack cost. For models where the training data is the moat — proprietary annotations, expensive clinical data, massive user behavioral data — extraction is worth doing and the substitute doesn’t reproduce the data moat.

When extraction is worth doing, adversarially:

You need to run the model offline with no API dependency (sovereignty, latency, cost)
You need to run transfer attacks (substitute model used to generate adversarial examples that transfer to the target)
You want to bypass per-query pricing at scale
You want to inspect the substitute’s internals (architecture recovery, feature analysis)

For competitors trying to replicate a product, extraction is usually not competitive with just training their own model. The data is the moat, and extraction doesn’t reproduce the training data.

Transfer attacks as the downstream threat

The most dangerous downstream use of an extracted model is transfer-based adversarial attacks. If your substitute agrees with the target at 85%, adversarial examples crafted against the substitute transfer to the target at roughly 40-70% success rates (varies widely by model family and attack type).

This matters for image classifiers deployed in security-sensitive contexts. An attacker who can’t access the target directly crafts adversarial examples offline against the extracted substitute and submits them through the normal API path. The attack evades any input filter trained on natural queries. For a broader catalogue of attack techniques against production ML systems, see aiattacks.dev ↗. Adversarial robustness evaluation tools that assess classifier vulnerability to extraction-facilitated transfer attacks are reviewed at aisecreviews.com ↗.

Defenses

Rate limiting on per-query and distributional patterns. An extraction attacker is sending many structured queries. Distribution-based detection that flags query sets unlike any natural user distribution is effective. The challenge is that natural usage from a large ML pipeline can look similar to extraction.

Hard label outputs. Returning only the top-1 class prediction significantly degrades extraction fidelity per query. The distillation attack needs more queries to reach the same fidelity. This has a tradeoff: some downstream applications need confidence scores.

Output perturbation. Small calibrated perturbations on confidence values, enough to degrade extraction fidelity without affecting downstream accuracy. Lee et al. showed that adding noise calibrated to the downstream loss function can nearly halve extraction fidelity while keeping end-task accuracy within 1-2%. Harder to deploy than it sounds; requires careful calibration.

Watermarking. Embed a verifiable watermark in the model’s behavior on a secret set of inputs. If the substitute inherits the watermark behavior, you can cryptographically prove the substitute was extracted from your model. This is more about legal recourse than prevention, but it deters economically motivated extraction.

What the threat actually is

Model extraction is a real threat for high-value proprietary models where the training cost and data moat are large. For the median production ML classifier, it’s not a practical threat — training your own model is cheaper than mounting a high-fidelity extraction attack, and extraction doesn’t reproduce the training data anyway.

The threat becomes concrete in three scenarios: (1) models over expensive proprietary data that can’t be reconstructed; (2) models that are inputs to downstream adversarial attacks; (3) models where offline operation has high value to the adversary. Outside those scenarios, rate limiting and confidence rounding are sufficient mitigations for the realistic threat.

References

Tramer et al., “Stealing Machine Learning Models via Prediction APIs” (2016), arXiv:1609.02943 ↗
Papernot et al., “Practical Black-Box Attacks Against Machine Learning” (2016), arXiv:1602.02697 ↗
Hinton et al., “Distilling the Knowledge in a Neural Network” (2015), arXiv:1503.02531 ↗
Juuti et al., “PRADA: Protecting Against DNN Model Stealing Attacks” (2019), arXiv:1805.02628 ↗

Model Extraction via Query-Based Functional Stealing

What model extraction actually means

The Tramer et al. baseline

Distillation-based extraction: the current state of practice

The cost-of-attack tradeoff

Transfer attacks as the downstream threat

Defenses

What the threat actually is

References

Adversarial ML — in your inbox

Related

Data Poisoning and Backdoor Attacks on Foundation Models

Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W

Adversarial Robustness in NLP: Why Text Attacks Are Different

Comments