Building an Algorithm for Art Valuation: What Actually Predicts Auction Prices

I spent the last few weeks on two related projects. The first was reproducing a 2024 Scientific Reports paper by Lee et al., “Social signals predict contemporary art prices better than visual features, particularly in emerging markets”. It trains an XGBoost model on 34,200 auction records from 590 living artists. The second was building Art Evaluator, a Django and Expo platform with a knowledge graph of artists, owners, and auction records. It’s basically the data plumbing you’d need around a model like that one if you wanted to do anything useful with it.

Both projects kept circling back to the same question: can you actually predict what an artwork sells for? The answer is yes, but not from looking at the artwork. The features that matter are almost entirely social.

Why Art Is Hard to Value
#

Before writing any code, I spent a couple of days reading about how appraisers and auction houses price work. The list of factors is long and most of them don’t quantify cleanly:

Provenance: who has owned it, where it has been shown, what catalogs reference it.
Attribution and authenticity: a piece confirmed by the artist’s foundation is worth orders of magnitude more than one that’s only “attributed to” them.
Artist reputation: solo shows, biennial inclusions, museum acquisitions, critical reception.
Condition: preservation state, restoration history, conservator reports.
Rarity: edition size, format, whether the subject is unusual within the artist’s body of work.
Medium: oil on canvas tends to outperform works on paper, which outperform prints.
Institutional recognition: a piece held by MoMA, Tate, or the Louvre carries a stamp that the secondary market trusts.
Market timing: recent comparables matter; five-year-old comps are stale.

Sotheby’s specialists also talk about “emotional value”: collector desire, bidding-war dynamics, personal resonance. It can push hammer prices well past any rational estimate. You can’t model that directly, but you can pick up its shape indirectly through auction-house estimates.

The dominant valuation method in practice is comparable sales analysis. Find recent auction results for similar works by the same artist and adjust for differences. An XGBoost model is doing the same thing, just at scale and with more features than any human appraiser would track by hand.

The Paper’s Central Claim
#

Lee et al. make a counterintuitive argument: the visual content of an artwork explains very little of its price. Their visual-only model, using color, composition, edge density, and ResNet18 embeddings, gets an R² of about 0.055. That’s barely better than guessing the mean.

A model using only artist-level “social” features (career stage, exhibition history, past auction prices, ranking on ArtFacts) hits R² ≈ 0.73. Add the auction house’s pre-sale estimate as a feature and it climbs to 0.92.

In other words, the artwork barely matters. What matters is who made it and what the market has already said about them.

The paper’s secondary claim is that this gap widens further in emerging markets, by which they mean anything outside the established USA, UK, France, and Germany core. In those markets, social signals matter more and expert estimates are less accurate, leaving more room for algorithmic prediction to add value.

Reproducing the Paper
#

The brief was straightforward: reproduce the paper as a single-developer MVP. No production deployment, no live data pipeline. Faithful reimplementation, prototype-quality code, and a written report on findings.

The Data
#

The paper provides cleaned CSVs as supplementary material:

File	Rows	What it is
`Df_mloutfull.csv`	86,221	Full raw dataset, 500+ columns
`df_for_ml_improved_up_to_2012.csv`	34,200	Main cleaned dataset, 1996–2012
`df_for_ml_improved_old_market.csv`	29,853	Established markets only
`df_for_ml_improved_new_market.csv`	4,346	Emerging markets only
`transactions.csv`	114,283	Raw auction records with image links

The whole thing is about 5 GB. It fits comfortably in RAM, and XGBoost training on a dataset this size takes under five minutes on a laptop CPU. No GPU needed for the core model. Gradient-boosted trees on tabular data at this scale are nearly free to train, which is easy to forget if you’ve spent the last few years in neural-net land.

Features
#

The model uses 38 features split across three categories.

About the artist (30 features):

Demographics: age, gender, education level, top-school flag.
Career: ArtFacts ranking, solo shows, group shows, biennial inclusions, awards.
Collections: counts of private and public collections.
Price history: mean, median, max, and min from the artist’s last 5 and last 10 sales.
Size-adjusted price history (price per square inch).
Geography: where the artist works and lives, encoded as country flags.
Match flags: does the artwork’s genre match the artist’s typical genre? Does the sale country match?

About the market (8 features):

Auction house tier (1–4).
Country and region price levels for that year (min, mean, median, max).

About the artwork itself (non-visual):

Width, height, area in square inches.
Medium category (painting, print, photo, sculpture, other).

What’s missing from that list is anything about the artwork’s actual visual content. That comes in only as a separate ablation study.

The Train/Test Split
#

Time-based, not random. Everything before May 2011 goes into training (~80%, ~27,360 records); everything after goes into the test set (~20%, ~6,840 records). Random splits would leak future information through the artist’s price-history features. If you scatter one artist’s records randomly across train and test, the test rows have already “seen” each other’s prices through the rolling windows, and your R² gets artificially inflated.

It’s the kind of thing that’s obvious once you see it but easy to miss on the first build.

The Model
#

XGBoost regressor, predicting log10(price_usd). Hyperparameter search over max_depth and learning_rate against a held-out validation slice. Two model variants:

Without expert estimates. Target R² ≈ 0.73.
With expert estimates. Target R² ≈ 0.92.

The “with estimates” variant uses the auction house’s pre-sale low and high estimates as additional features. Those two estimates alone, used by themselves, get R² ≈ 0.90. Adding the social features on top picks up another 0.02. That’s small in absolute terms, but it’s the part of the model doing real predictive work beyond what the experts already know.

What I Aimed For
#

I wasn’t trying to beat the paper. The goal was to land within ±0.03 of their R² values across all conditions. For the headline numbers (~0.73 metadata-only, ~0.92 with estimates), reproducibility is good. XGBoost is deterministic with a fixed seed, and the features are well-specified. I expected drift in two places: the visual-feature ablation, where small differences in PCA dimensions or image preprocessing can move things, and the emerging-market split, where there are only 4,346 records and the noise floor is higher.

Why Visual Features Barely Help
#

This was the most interesting result for me. The paper’s visual pipeline extracts 8,971 numbers per image:

GIST descriptor (960 dims): overall spatial layout.
Histogram of Oriented Gradients (2,915 dims): edge structure.
Color histogram (4,096 dims): palette breakdown.
ResNet18 features (1,000 dims): high-level pretrained embedding.
Colorfulness (1 dim): vividness scalar.
Complexity (1 dim): edge-density scalar.

Compress all that with PCA and feed it to XGBoost, and you get R² ≈ 0.055. Color alone gets 0.056. Edge structure gets 0.029. The neural-net embedding gets 0.009, basically zero.

The clean explanation is that auction prices are set at the level of the artist, not the individual work. Two paintings by the same artist sell for similar amounts regardless of what’s on the canvas, because what’s being priced is the signature.

You can see why this would be true if you’ve ever read auction catalog notes. They spend almost all their words on provenance, exhibition history, and comparable sales. They barely describe the work itself.

If you take that 5% number seriously, the aesthetic content of a work is almost decoupled from its market value. Whatever the market rewards, it isn’t what’s on the canvas. A model that understood art at the pixel level would predict prices worse than one that only knew the artist’s CV.

Phase 3 Was Conditional
#

The paper’s image-link column points to URLs that were valid in 2012. By 2026, most are dead. The plan I wrote up flagged this risk on day 8: pull a hundred random URLs and see how many resolve. If more than half were gone, skip the visual phase entirely and document why.

Visual features only explain about 5% of price variance anyway. Spending two weeks fighting bit rot to confirm a number that small isn’t a good trade. Better to write up the dead-link finding as part of the report and move on.

Established vs. Emerging Markets
#

The paper trains separate models for established markets (USA, UK, France, Germany) versus emerging markets (19 other countries). The headline finding:

Feature set	Established R²	Emerging R²
Visual only	0.053	0.056
Metadata (social signals)	0.667	0.750
Metadata + estimates	0.916	0.859

Two things stand out. First, social signals matter more in emerging markets, not less. You might expect that established markets have richer data and therefore better social-feature performance, but it’s the opposite. In emerging markets the social signal is doing more of the predictive work.

Second, expert estimates are less accurate in emerging markets. That’s the 0.916 vs 0.859 gap once estimates are added in. In an established market, an algorithmic prediction has to compete with a Sotheby’s specialist who has decades of comparable-sales data behind them. In an emerging market, that specialist has thinner data and the algorithm has more room to add value.

If you ever wanted to commercialize a model like this, emerging markets are where it would earn its keep.

Limitations Worth Stating
#

This is reproducing a paper, not building a production system. A partial list of what it doesn’t do:

No live data pipeline. The model trains on 1996–2012 auction data and never updates.
No deployment. Outputs are notebooks and a report, not a prediction API.
No drift detection. Once trained, the model doesn’t know when the market has changed.
Frozen dataset. Crypto art, NFTs, and post-COVID market shifts simply don’t exist in the training data.
No Korean-market specialization. If you wanted to predict prices at K Auction or Seoul Auction, you’d want auction-specific features and possibly a separate model for the Korean market.

The model also reflects historical patterns. If the contemporary market changed structurally around 2015 (and there are arguments it did, with art fairs becoming a major primary-market venue and a new generation of collectors entering), then the 1996–2012 patterns may not generalize.

From Paper to Platform
#

Reading the paper changed how I thought about the second project. Art Evaluator was originally framed as a crowdfunding marketplace for art exhibitions. Artists secure funding for upcoming shows, and investors get exposure to emerging artists earlier than the auction market typically allows.

The product surface is deliberately simple. Artists post upcoming shows that need funding (venue, dates, expected works). Investors back specific shows in tiered amounts ($100, $500, $1k, $5k) and share in the upside when the work sells. AI agents handle most of the artist-investor chat so that neither side has to be online for the conversation to keep moving. That’s particularly useful when a Korean gallery is talking to a U.S. collector across a 14-hour gap.

Once it sinks in that the artist is the asset and not the artwork, the data model has to change to match. What matters is:

Who has owned each work (a chain of nodes in a graph).
Where it has been shown (institutional recognition signals).
What comparable works have sold for (auction comps for the artist).
Who else owns work by this artist (collector network effects).

That’s what a knowledge graph is for. The Django backend (apps/artworks/models.py) models four primitives:

Artist: the person making the work.
AuctionRecord: historical sales with prices, dates, and houses.
OwnershipRecord: who held what between which dates.
ScrapeLog: provenance for the data itself, so we can audit where every record came from.

The mobile app (Expo + React Native + @shopify/react-native-skia) renders all of this as a force-directed graph. useGraphData hits the Django /api/graph/ endpoint, useForceLayout runs a d3-force simulation in JS each tick, and GraphCanvas draws the result via Skia. A time-range slider lets you scrub through years and watch the network evolve: which collectors entered when, which artists got picked up by which museums, which works changed hands during the 2008 dip.

Each node type has its own shape and color:

Node type	Shape	Color
Artist	Diamond	red
Artwork	Rounded rect	blue
Collector	Circle	green
Dealer	Circle	orange
Museum	Circle	purple
Estate	Circle	teal
Auction House	Hexagon	grey

The shapes aren’t decorative. When you zoom out to a few hundred nodes the silhouette tells you what type of entity you’re looking at without waiting for labels to render. A diamond surrounded by circles is an artist with a collector network. A hexagon connected to many diamonds is an auction house that handles a roster of artists. You can read the structure of a market at a glance.

A second tab shows the same data on a vis-timeline axis: ownership periods as horizontal bars, sales as point events. Same underlying graph, different lens. Useful when you want to see the sequence of who held what, rather than the network shape at a single moment.

Art Evaluator isn’t itself a price-prediction tool. It’s the data layer underneath one. If you wanted to feed something like the Lee et al. model with live, current data instead of frozen 2012 CSVs, this is the structure you’d want.

Crowdsourced Valuation as a Game
#

The piece I haven’t built yet, and the part I’m most interested in, is turning art valuation into a game.

The Lee et al. paper takes auction-house pre-sale estimates as a feature and gets a big R² lift from them. Those estimates are valuable because they’re the aggregated judgment of trained specialists who have seen thousands of comparable works and have skin in the game. But auction-house specialists are a narrow, expensive resource. There are a few hundred of them globally, they work on specific consignments, and they don’t touch the long tail of artists nobody is consigning yet.

The question I keep coming back to is whether you can synthesize a comparable signal from a much wider, cheaper pool of judgments.

The rough design:

Show a user an artwork: image, dimensions, year, artist name, brief CV.
Ask them what it would sell for at auction.
Let them pick a price bracket, or guess a hammer figure directly.
Reveal the actual sale price after they commit.
Score them with a Brier-score-style accuracy rating over time.
Surface a leaderboard, daily streaks, and badges for hot streaks.
Weight each user’s future predictions by their historical accuracy.

It’s a prediction market for art prices in the form of a guessing game. The loop is the same as Geoguessr or chess.com puzzle ratings: repeatable, scorable, with feedback you can build skill against. People who guess well rise on the leaderboard, and their future guesses count for more in the aggregated signal.

The model side is where this gets useful. A weighted aggregation of those guesses becomes a feature you can feed into XGBoost alongside the social signals. If the crowd’s accuracy-weighted median guess correlates well with hammer prices, you have something close to a synthetic auction-house estimate for works that don’t have a real one. That’s exactly the gap on the long tail of artists where the Lee et al. model needs the most help.

There’s also a cold-start angle. The model trains well on artists with rich price histories and badly on artists with three sales to their name. Three sales plus a thousand crowd guesses is at least something to work with. Crowdsourced valuation is a way of generating training signal for the cases the underlying data doesn’t cover.

Gamification matters here because the only way to make the crowd reliable is to make participating fun enough that people do it a lot. Single-question forms get you noisy data from people who clicked through once. A daily streak with a leaderboard and a calibration badge gets you the same person making 500 carefully considered predictions over a year. Volume plus self-selection (good predictors stay, bad predictors get bored) is what gives you a signal rather than noise.

A few open design questions I’d need to figure out before building:

What do you show the user? Just image and CV, or do you also give them comparable sales for context? More context means better-informed guesses, but it also means the user is mostly regurgitating what you’ve already shown them.
How much price feedback do you give? Reveal the exact hammer? A range? Just whether they were within ±20%? Too granular and you train people to anchor on specific numbers; too vague and they can’t calibrate.
How do you stop the leaderboard from being gamed? Players will gravitate toward artists they already know, and a Banksy guess is mostly trivia rather than skill. Probably forced random sampling, like Duolingo’s lesson selection, with a separate ranking for “blind” rounds where the artist’s name is hidden.
How do you handle hindsight bias? Once a user knows a Basquiat sold for $110M, they can’t un-know it. The supply of fresh, unseen artworks is what keeps the loop honest, and that supply is finite per week.

The gamification framing also pulls the platform’s two halves together. The crowdfunding side rewards investors when shows perform well financially. The guessing-game side rewards anyone, paying investor or not, for being good at calling prices. Both are versions of the same idea: let the crowd’s collective judgment turn into a tradeable signal. One trades capital, the other trades attention. Players who develop a real edge could eventually graduate to the investor side, bringing a verifiable track record of price-calling accuracy with them. That’s a better filter than “knew the right gallerist.”

What I Took Away
#

A few things I didn’t expect going in:

1. Pre-sale estimates are very good. Auction houses get a lot of grief for opaque pricing, but their pre-sale estimates capture something close to 90% of price variance on their own. Whatever specialists are doing, they’re doing it well.

2. The artwork itself barely matters statistically. This is uncomfortable if you care about art for aesthetic reasons. The market isn’t pricing what you see. It’s pricing what other people have already paid for adjacent work.

3. Time-based splits are the only honest way to evaluate price models. Random splits give you flattering R² values that don’t survive contact with a real production deployment.

4. Gradient-boosted trees are still the right tool here. XGBoost beats every neural baseline in the paper. For tabular data at this scale, with this much engineered feature structure, there’s no reason to reach for a transformer.

5. The interesting opportunity is in emerging markets. That’s where expert pricing is thinnest and algorithmic prediction has the biggest gap to fill.

6. The crowd is an underused data source. The Lee et al. model’s single biggest feature was an auction house’s pre-sale estimate. If a calibrated crowd can produce a comparable signal for the long tail of artists, that’s a real opening, both as a model input and as a product surface in its own right.

The two projects ended up more complementary than I planned. The paper reproduction taught me what predicts prices. The platform work taught me what data structure you’d need to put that knowledge to use. The gamification idea ties the two together. The platform’s knowledge graph supplies the artworks and historical sales. The guessing game generates a crowd-derived estimate feature. The model consumes that feature alongside the social signals to produce predictions for artists who would otherwise be invisible to it.

None of this is finished. The paper reproduction is a working prototype with a written report. The platform is a demo with seeded data. The crowdsourced game only exists as the design notes you just read. But the connecting idea is consistent enough that I think the right shape is reasonably clear: price the artist rather than the artwork, aggregate the crowd rather than relying only on experts, and treat valuation as something the market produces collectively rather than something specialists hand down.

That’s the version I’d ship next, given the time to build it.

Why Art Is Hard to Value#

The Paper’s Central Claim#

Reproducing the Paper#

The Data#

Features#

The Train/Test Split#

The Model#

What I Aimed For#

Why Visual Features Barely Help#

Phase 3 Was Conditional#

Established vs. Emerging Markets#

Limitations Worth Stating#

From Paper to Platform#

Crowdsourced Valuation as a Game#

What I Took Away#