Independent Research Β· US Market

What makes users trust
a payment app?
A conjoint study.

A small-n quantitative UX study measuring how users of US P2P payment apps (Venmo, Zelle, PayPal, Cash App) weigh specific trust signals when deciding to send money. 12 participants. Choice-based conjoint with a hierarchical Bayesian model. One finding that flipped my hypothesis on its head.

⏱ 6 weeks πŸ“Š Choice-based conjoint πŸ“‹ n = 12 πŸ‘€ Solo Β· Personal Project πŸŽ“ Academic Case Study πŸ“ˆ 6 Trust Attributes
✦ what this study found

I assumed visible encryption badges drove trust. They score 7.1% importance. Recipient's real name scored 5.4Γ— higher.

12
participants recruited from my own network β€” classmates, friends-of-friends, two family members. All active monthly users of at least one US P2P payment app.
38.4%
relative importance of recipient display name β€” the largest driver of perceived trust across all six tested attributes. Dominates at every transaction amount.
7.1%
relative importance of the security / encryption badge. The industry-standard trust signal is second-from-last on the list β€” barely ahead of the verified checkmark.
πŸ“Š a finding that overturned my hypothesis
Trust in payment apps is overwhelmingly social, not technical. Identity beats cryptography on every metric.

Across 12 respondents and 144 choice observations, the four identity attributes (name, photo, shared history, mutual contacts) together account for 86.3% of total trust-driver importance. Encryption and verified-tick combined sit at 13.7%. Effects this large show up even at small n.

✦ scope of this project
Personal academic case study. One designer, six-week timeline, pre-registered, data and code open-sourced, small self-collected sample.

No client, no team, no production implications. Zero budget β€” so no paid panel, no Prolific. Just 12 people from my own network, a rigorous quantitative method, and honest reporting of both the finding and the caveats that come with n = 12.

✦ why this project matters to me
I kept hearing designers defend security badges as "trust signals". But I'd never once looked at one before sending $40 to a friend. I wanted a number, not an opinion.

This was a personal academic case study β€” no team, no funding, no client, no real-world implications. Just me, a 6-week timeline, and twelve generous people from my network who sat through a conjoint task each. I wanted to teach myself how to run a quantitative study end-to-end, and I wanted an honest answer to a question that had been bugging me for months β€” even if "honest" meant "with n = 12 caveats attached".

The Question

what actually drives trust? ✦

Every P2P payment app β€” Venmo, Zelle, PayPal, Cash App β€” wraps itself in "bank-level encryption" badges, padlock icons, fraud-protection microcopy. The assumption baked into the industry is that visible security communication drives user trust.

But when I send $20 to someone on Venmo, I don't read the encryption badge. I check the profile photo, the username, the last transaction. My trust lives somewhere else entirely.

So which is it? Is the industry right and I'm an outlier, or is the industry wrong? And if it is wrong, by how much? Conjoint analysis can answer that with a number.

PRE-REGISTERED HYPOTHESES
H1Security badges drive trust
Hβ‚€: Ξ²_badge = 0  Β·  H₁: Ξ²_badge > 0  Β·  predicted importance: β‰₯ 20%
The industry's implicit bet. Reject Hβ‚€ if badge importance β‰₯ 15% with p < .05.
H2Identity signals drive trust more than technical signals
Hβ‚€: Σβ_identity ≀ Σβ_technical  Β·  H₁: Σβ_identity > Σβ_technical
Directional test comparing combined importance of name + photo + mutuals + history vs. badge + verified tick.
H3Badge importance scales with transaction amount
Hβ‚€: Ξ²_badgeΓ—amount = 0  Β·  H₁: Ξ²_badgeΓ—amount > 0
Interaction term test. At $400, technical signals may compensate for reduced familiarity.
H4Rank ordering of attributes is stable under resampling
Leave-one-out jackknife across all 12 participants. Kendall's Ο„ > 0.80 required.
Robustness check. Small samples can produce fluke orderings; this tests for it explicitly.
Pre-registered in a timestamped Notion doc before data collection. Predicted directions and thresholds were committed before a single choice was recorded.

The Method

choice-based conjoint · 7 attributes ✦
✦ why choice-based conjoint (CBC)

If I asked people "how important is encryption to you?" they'd all say "very important". Self-report on trust is useless β€” everyone performs the rational answer. Conjoint analysis asks instead: which of these two options would you pick? Trade-offs force revealed preferences, and with enough repetitions across many respondents, I can decompose each choice into part-worth utilities per attribute level.

I mocked a neutral β€œgeneric pay app” confirmation screen with 6 trust attributes (varied at 2–3 levels each) plus a transaction amount as context. Each participant saw 12 forced-choice tasks drawn from a 24-card D-efficient design. Modelled in Conjointly's free academic tier to recover individual-level utilities β€” not just aggregate averages.

ATTR 01
Recipient display name
3 levels: full legal name Β· first name + last initial Β· username only
ATTR 02
Profile photo
3 levels: clear human face Β· obscured/group/pet Β· default avatar
ATTR 03
Shared history
2 levels: 4 prior transactions Β· first-time recipient
ATTR 04
Mutual contacts badge
3 levels: "5 mutual friends" Β· "1 mutual" Β· no mutual shown
ATTR 05
Verified checkmark
2 levels: platform-verified tick present Β· absent
ATTR 06
Security badge visibility
3 levels: prominent "bank-level encryption" Β· small padlock icon Β· none
ATTR 07 Β· transaction amount (context, not preference)
Amount tier (rotated across tasks)
3 levels: $15 Β· $75 Β· $400. Varied across tasks so I could listen for whether participants reasoned differently at higher stakes.

Study Design

sample size, power, guardrails ✦
✦ why n = 12 (and what that buys me)

I'm one person with no funding. A paid panel (Prolific, Qualtrics) was out of reach, so my recruiting pool was people I could reach personally β€” classmates, friends-of-friends, two family members. Twelve was the number who'd sit with me for a 30-minute Zoom each.

I kept the quantitative method (choice-based conjoint) anyway. A dedicated conjoint tool (I used Conjointly's free academic tier) can still produce interpretable utility estimates at n = 12 β€” the error bars are wide, but the rank ordering of attribute importance was stable when I held out different participants and re-ran. That's enough to answer "which signals matter most?" with confidence, even if it's not enough to pin down the exact importance scores to the US population.

Orme's rule of thumb says ~63 respondents for aggregate CBC estimation at this design's size (largest attribute = 3 levels). I'm an order of magnitude below that β€” and I say so in the Limitations section. The goal here was to practice the craft and get a directionally honest answer, not publish a representative study.

METHOD
Choice-based conjoint
12 forced-choice tasks per respondent from a 24-card D-efficient design (D-eff 0.91). 3 holdout tasks for validation.
ANALYSIS
Conjoint utility model
Ran the analysis in Conjointly's free academic tier. Got per-attribute importance scores and individual-level utilities out of the box β€” no coding required.
COST
$0 cash
Zero budget. Sessions on Zoom. Stimuli in free Figma. Thanked each participant with coffee or a handwritten note.

The Participants

12 from my own network, all active P2P users ✦

Twelve participants recruited from my own network β€” classmates, friends-of-friends, two family members. All active monthly users of at least one P2P payment app (Venmo, Zelle, Cash App, PayPal). A convenience sample, not representative of the US. The Limitations section names this explicitly.

✦ age range of participants
20–29
7
of 12
30–44
3
of 12
45–59
2
of 12
Skews young β€” reflects who I could reach. A limitation named explicitly in the Reflections section.
✦ primary P2P app used
Venmo
6
of 12
Cash App
3
of 12
Zelle
2
of 12
PayPal
1
of 12
PEOPLE ASKED
18
COMPLETED
12
CHOICE OBSERVATIONS
144
AVG SESSION
32 min

The Process

step by step. what i thought β†’ why β†’ what i did ✦

Twelve steps over six weeks. Each one is broken into the thought that kicked it off, why it mattered, and what I actually did. This is the part I wish other case studies showed me when I was learning.

01
week 1 Β· day 1–2
Turning a hunch into a falsifiable question
what I thought
"Security badges are probably performative. Nobody reads them."
why this step mattered
A hunch is not a research question. If I couldn't state a version that could be proven wrong, I'd end up p-hacking my way to my priors. I needed to lock the hypothesis before I saw any data.
what I did about it
Reframed the hunch as four competing hypotheses (H1–H4 in the Question section). Pre-registered them in a Notion doc with predicted effect sizes and minimum importance thresholds before writing the survey. This kept me honest later when I was tempted to re-interpret results.
02
week 1 Β· day 3–4
Deciding between self-report, behavioral, and conjoint
what I thought
"The fastest method is a Likert survey 'how important is X to you, 1–7'. But that'll give me garbage data."
why this step mattered
Choice of method determines what the data can say. Direct importance ratings suffer from social desirability bias (everyone rates security 7/7) and no-trade-off problem (everything ends up "very important"). I needed a method that forces trade-offs.
what I did about it
Compared three options: Likert survey (fast, noisy), MaxDiff (forced ranking, but can't estimate interactions), CBC conjoint (slower to design, but I could model interactions and recover individual utilities). Picked CBC. Documented the reasoning in the pre-reg doc so I couldn't switch methods mid-study because results were inconvenient.
03
week 1 Β· day 5–7
Selecting and operationalizing the attributes
what I thought
"There are dozens of things on a payment confirmation screen. I can't test them all."
why this step mattered
CBC task complexity scales with attribute count. Past ~6 attributes, respondents start simplifying by ignoring some (non-compensatory decision-making), which corrupts the utility estimates. I had to be ruthless.
what I did about it
Did a screener exercise: spent 2 hours screenshotting every P2P confirmation screen I could find (Venmo, Cash App, Zelle, PayPal + three neobanks). Listed every visual element. Then I dropped the whole list into Claude and asked it to help me cluster the elements into candidate "trust signal" categories and flag blind spots I might have missed β€” it surfaced "shared transaction history" as a category I'd had scattered across three rows. I kept the clustering I agreed with, discarded a couple suggestions that felt forced, and dropped anything not visible on more than half the apps. Landed on the 6 attributes + 1 context (amount) shown earlier. Defined each at 2–3 levels, keeping the largest at 3 so the sample-size calc stayed manageable.
04
week 2 Β· day 1–3
Generating the choice design
what I thought
"I'll just random-sample combinations. Should be fine, right?"
why this step mattered
Random designs produce huge standard errors because levels get correlated by accident. A D-efficient design minimizes those correlations and means I need far fewer respondents for the same precision. This is where most first-time CBC researchers get burned.
what I did about it
Used Conjointly's built-in design generator to produce a D-efficient partial-profile design. 24 unique tasks, blocked into 2 versions of 12 tasks each. D-efficiency came out at 0.91 β€” I'd been warned to aim above 0.80. Also added 3 "holdout" tasks not used in model fitting, reserved for validating predictive accuracy later.
05
week 2 Β· day 4–6
Designing the stimulus so attributes actually land
what I thought
"I'll describe the options in text. Simpler, right?"
why this step mattered
Trust is a visual/gut response. Text bullets like "shows padlock icon Β· yes" don't trigger the same instinctive evaluation as an actual screen. But if I made the stimuli too realistic I'd smuggle in extra variables I hadn't controlled for.
what I did about it
Mocked a neutral "generic pay app" confirmation screen in Figma stripped of any brand elements (no Venmo blue, no Cash App green). Designed 16 component variants (one per attribute level), so every stimulus was a compositing job from the same kit. Kept typography, spacing, and CTA identical. Exported 24 images as the CBC stimuli.
06
week 3 Β· day 1–2
Pilot with 2 friends, and rewriting everything
what I thought
"The survey feels clear to me. Let's ship it."
why this step mattered
Surveys always feel clear to the person who wrote them. A pilot catches ambiguous wording, broken randomization, and task fatigue before you burn your tiny participant pool on a broken protocol.
what I did about it
Ran a soft pilot with 2 friends (not counted in the final 12). Median completion was 14 minutes way too long, consistency dropping in the last 3 tasks. Both pilots misread "mutual contacts" as "mutual funds". I rewrote the label, dropped from 14 to 12 tasks, added a micro-illustration explaining "mutual contacts" on the intro screen, and repiloted. Completion time dropped to ~8 minutes of actual task work (plus ~20 min of onboarding + debrief for the real sessions). Shipped.
07
week 3 Β· day 3–5
Building attention checks that actually work
what I thought
"I know everyone in my sample personally β€” I don't need attention checks."
why this step mattered
Friends want to help you. That means they'll click through even when their head isn't in it. Attention checks aren't about catching cheaters here β€” they're about flagging tasks where someone zoned out, so I don't treat a tired click as a meaningful preference.
what I did about it
Built two lightweight checks: (1) a trap task where one option strictly dominated the other on every attribute β€” anyone picking the worse option there got a gentle "hey, want to revisit this one?" prompt in a follow-up email; (2) a timing filter β€” sub-10-second responses flagged in my analysis script. Across 12 sessions, 1 participant failed the trap, talked it through, and I kept the re-done response. Zero timing failures. Small sample means I could actually pay attention.
08
week 3 Β· day 6–7
Running the sessions and keeping field notes
what I thought
"Send everyone the link and collect the data."
why this step mattered
When the sample is 12 people, unmoderated completion is a waste of a rare resource. If I'm already scheduling Zoom time with each person, I can confirm they understood the task and catch quirks in real time.
what I did about it
Scheduled 30-min Zoom sessions over two weeks. Each one: 5 min walkthrough of the attributes, 12 CBC tasks completed by the participant (I stayed silent), 3 holdout tasks, 10 min debrief with open questions. Kept a field journal for each session β€” quirks, confusion points, verbatim remarks. After all 12 sessions wrapped, I pasted the anonymized debrief notes into Claude and asked it to help me surface recurring themes and tensions across participants. It flagged a pattern I'd half-noticed β€” people kept mentioning the recipient's profile photo even when I hadn't asked β€” which shaped how I read the quantitative results later.
09
week 4 Β· day 1–3
Cleaning the data before looking at any results
what I thought
"Just load the CSV and start modeling."
why this step mattered
Looking at aggregate results first anchors your expectations. By the time I was tempted to cut respondents, I'd already be influenced by what cutting them does to the headline. Clean first, analyze second.
what I did about it
Exported the raw choice data from the survey tool into a spreadsheet. Applied my pre-registered exclusion criteria in a separate tab before computing anything else β€” result was 0 of 12 excluded (after the trap-task correction from step 7). Flagged straight-liners (same option every task) in a third tab β€” zero cases. Saved a timestamped copy of the clean dataset before running any analysis.
10
week 4 Β· day 4–7
Running the conjoint analysis
what I thought
"I have the choices. Now I just count which option wins most, right?"
why this step mattered
Counting winners tells you almost nothing β€” every choice involves multiple attributes at once, so you can't just say "option B won, so badges matter". Conjoint analysis decomposes the choices into per-attribute utility scores and translates those into relative importance. I'd need a tool that does this, because the math is not spreadsheet-friendly.
what I did about it
Uploaded the 12 participants' choice data into Conjointly (free academic tier). It handled the modeling β€” I just needed to specify my attributes, levels, and experiment design. Output: relative importance scores per attribute + individual-level utilities per participant. Then I held out one participant at a time and re-ran the analysis to see how stable the ranking was. It held up (details in step 11).
11
week 5 Β· day 1–4
The moment the data disagreed with me
what I thought
"Security badging will land around 20–25% importance. Lower than recipient identity, but meaningful."
why this step mattered
This is where bias eats a study. If my preferred finding is "badges don't matter", I'll want badges low. If the effect is smaller than I expected in either direction, I have to report it honestly.
what I did about it
Security badge importance came in at 7.1% β€” lower than I'd expected, even against my own counter-hypothesis. I sat with that for a day. Re-checked the data for entry errors. Ran the leave-one-out check: held out each of the 12 participants in turn and re-ran the analysis. Rank ordering of attributes held in every version, badge importance ranged from 5.9% to 8.4% across the 12 holdouts. The direction of the effect was robust, even if the exact magnitude is uncertain at this sample size.
12
week 5 Β· day 5 β†’ week 6
Testing interactions and writing up what it means
what I thought
"Surely badges matter more at $400 than $15. Technical signals must kick in somewhere."
why this step mattered
An average that hides a real interaction is a misleading average. If badges only matter for large amounts, the headline changes from "badges are useless" to "badges are contextual". I had to test it before writing up.
what I did about it
Split the data by amount (low $15 vs high $400) and re-ran the analysis for each subset. Then did the same by age (under 35 vs 35+). Badge importance stayed in the single digits in every cut except the 45–59 bracket, where one fraud-scarred participant (P07) pulled a two-person average up β€” though with n = 12 split this far, I'm explicit that these subgroup numbers are noisy at best. The main result (identity > technical) held in every cut. Used Claude as a drafting partner for the case-study writeup β€” fed it my raw notes and the data, pushed back when it tried to soften the "badges don't matter" finding, kept the honest version. Led with the finding, flagged every small-sample caveat.

The Instrument

what participants actually saw ✦
✦ a real CBC task (as participants saw it)
"Imagine you need to send $75. Which of these screens would you feel more comfortable completing?"
OPTION A
9:41 5G
← Confirm payment
MT
Marcus Taylor
@mtaylor Β· 5 mutual contacts
βœ“
4 prior transactions
Last sent Oct 2025
Amount
$75.00
Send $75.00
namereal (first + last)
photoreal portrait
mutuals5
history4 prior
verifiedno
security badgeabsent
OPTION B
9:41 5G
← Confirm payment
?
@cash_flow_22
no mutual contacts
!
First-time recipient
No prior transactions
Amount
$75.00
πŸ”’ Protected by bank-level encryption
Send $75.00
nameusername only
photoplaceholder
mutuals0
historynone
verifiedno
security badgepresent

Neutral "generic pay app" frame β€” no brand colors, no Venmo blue or Cash App green. Each participant saw 12 tasks assembled from the same component kit; only the six attributes varied.

✦ task design across the 12 sessions

Tasks rotated across three amount tiers.

Every participant saw 4 tasks at each amount ($15 / $75 / $400), drawn from the 24-card D-efficient design and randomized per-participant. Below: one representative prompt per tier, plus the full 12-task layout a participant actually saw.

LOW STAKES Β· $15
"Imagine you need to send $15. Which would you feel more comfortable completing?"
Coffee-run money. Testing whether trust signals matter at all when stakes are low.
MID STAKES Β· $75
"Imagine you need to send $75. Which would you feel more comfortable completing?"
Groceries, dinner. Roughly a typical everyday Venmo send amount.
HIGH STAKES Β· $400
"Imagine you need to send $400. Which would you feel more comfortable completing?"
Rent share, deposit. Testing the H3 amount Γ— badge interaction hypothesis.
ALL 12 TASKS FROM A SINGLE PARTICIPANT'S SESSION (block A)
T01 Β· $15
M. Taylor
5 mutuals
@cf_22
πŸ”’ enc
T02 Β· $75
Jordan L.
βœ“ 4 prior
@skate_4
1 mutual
T03 Β· $400
Priya R.
5 mutuals
@bass_99
πŸ”’ enc
T04 Β· $15
@dev_r
0 mutuals
Sam Chen
βœ“ 4 prior
T05 Β· $75
@moon_03
no mutuals
Ana Vasquez
βœ“ 4 prior
T06 Β· $400
K. Okafor βœ“
1 mutual
@night_22
no mutuals
T07 Β· $15
Leo Patel
πŸ”’ enc
@grr_99
1 mutual
T08 Β· $75
Maya B. βœ“
5 mutuals
@tune_3
πŸ”’ enc
T09 Β· $400
Rohan S.
βœ“ 4 prior
@val_11
0 mutuals
T10 Β· $15
@box_4
no mutuals
Tara N.
1 mutual
T11 Β· $75
Noah K. βœ“
πŸ”’ enc
@ink_02
1 mutual
T12 Β· $400
Eva T.
5 mutuals
@wolf_8
πŸ”’ enc
Block A layout shown (6 participants); Block B rotates the remaining 12 cards from the 24-card design. Every participant saw 4 tasks per amount tier, order randomized. Attribute combinations are deliberately near-orthogonal β€” that's what D-efficiency measures.
✦ the stimulus kit

16 component variants, one per attribute level.

Built as a small design system so every stimulus was a compositing job from identical pieces. Typography, spacing, and CTA stayed locked β€” only the trust-signal components swapped in.

RECIPIENT DISPLAY NAME Β· 3 levels
Marcus Taylor
real first + last name
L1
M. Taylor
initial + last name
L2
@cash_flow_22
username only
L3
PROFILE PHOTO Β· 3 levels
MT
real portrait
L1
M
initial avatar
L2
?
placeholder
L3
MUTUAL CONTACTS Β· 3 levels
πŸ‘₯ 5 mutual contacts
L1
πŸ‘₯ 1 mutual contact
L2
no mutual contacts
L3
SHARED HISTORY Β· 2 levels
βœ“
4 prior transactions
L1
!
First-time recipient
L2
VERIFIED CHECKMARK Β· 2 levels
Marcus Taylor
βœ“
L1
Marcus Taylor
L2 (absent)
SECURITY BADGE Β· 3 levels ← the headline variable
πŸ”’
Protected by bank-level encryption
L1
πŸ”’
small padlock icon only
L2
no badge
L3
✦ session structure

How each 30-min Zoom session ran.

01
Consent + warm-up
informed consent Β· quick chat to settle in
02
Demographics + P2P usage
age, primary app, frequency, recent send amount
03
CBC walkthrough + practice
5 min explainer of each attribute Β· 1 throwaway task
04
12 CBC tasks + 1 trap
main block Β· randomized Β· participant works silently
05
3 holdout tasks
held back for predictive validation
06
Open debrief + thanks
"walk me through your thinking" + field notes + coffee

The Results

relative attribute importance ✦
✦ relative importance scores (higher = drives more of the choice)
n = 12 Β· choice-based conjoint (Conjointly academic tier) Β· small sample β€” read as directional
Recipient display name
38.4%
Shared transaction history
21.2%
Profile photo
15.8%
Mutual contacts badge
10.9%
Verified checkmark
6.6%
Security badge
7.1%
In-sample hit rate: 84.0%
Log-likelihood: βˆ’58.14
McFadden RΒ²: 0.417
Name Β· history Β· photo p < .001
✦ part-worth utility: recipient name
Full legal name+0.92
Initial + last name+0.11
Username onlyβˆ’1.03

A full legal name adds +0.92 utility relative to the mean; a bare username subtracts βˆ’1.03 β€” a 1.95-point swing, the widest of any attribute and nearly five times the security badge's 0.40.

✦ part-worth utility: security badge
"Bank-level encryption" banner+0.18
Small padlock icon+0.04
No security badgeβˆ’0.22

The full swing from "no badge" to the explicit "bank-level encryption" banner is just 0.40 utility β€” about a fifth of the name swing (1.95). The only reliably non-zero piece is the small penalty for showing nothing; a bare padlock icon does nothing at all.

✦ interaction tests (H3 + exploratory)
Badge Γ— transaction amount
Badge importance at $15: 6.8% Β· at $75: 7.0% Β· at $400: 7.5%. Interaction effect p = 0.18. Not significant. H3 rejected.
Badge Γ— age group
20–29: 5.3% Β· 30–44: 6.3% Β· 45–59: 14.5%. Interaction p = 0.14. Not significant β€” the 45–59 figure rides entirely on P07 (n = 2), so read it as exploratory, not pre-registered.
✦ full part-worth utilities with 95% CIs

Every attribute level, every confidence interval.

Hierarchical Bayes (HB) estimation via Conjointly, 10,000 MCMC iterations, 2,000 burn-in. Utilities are zero-centered within each attribute. CIs are 95% posterior intervals. With n=12 they are wide β€” and that's the point: the case is in the ranking, not the precision.

Attribute Β· Level
Utility (Ξ²)
SE
z
95% CI
p
Display name Β· full legal (Marcus Taylor)
+0.92
0.19
4.84
[+0.55, +1.29]
<.001
Display name Β· initial + last (M. Taylor)
+0.11
0.17
0.65
[βˆ’0.22, +0.44]
.516
Display name Β· username only (@cash_flow_22)
βˆ’1.03
0.21
4.90
[βˆ’1.44, βˆ’0.62]
<.001
Shared history Β· 4 prior transactions
+0.58
0.14
4.14
[+0.31, +0.85]
<.001
Shared history Β· first-time recipient
βˆ’0.58
0.14
4.14
[βˆ’0.85, βˆ’0.31]
<.001
Profile photo Β· real portrait
+0.41
0.13
3.15
[+0.16, +0.66]
.002
Profile photo Β· monogram
+0.02
0.12
0.17
[βˆ’0.22, +0.26]
.867
Profile photo Β· silhouette (default)
βˆ’0.43
0.13
3.31
[βˆ’0.68, βˆ’0.18]
.001
Mutuals Β· 3+ shown
+0.29
0.12
2.42
[+0.06, +0.52]
.016
Mutuals Β· 1 shown
+0.04
0.11
0.36
[βˆ’0.18, +0.26]
.720
Mutuals Β· none
βˆ’0.33
0.12
2.75
[βˆ’0.57, βˆ’0.09]
.006
Verified checkmark Β· present
+0.16
0.09
1.78
[βˆ’0.02, +0.34]
.075
Verified checkmark Β· absent
βˆ’0.16
0.09
1.78
[βˆ’0.34, +0.02]
.075
Security badge Β· "bank-level encryption"
+0.18
0.10
1.80
[βˆ’0.02, +0.38]
.072
Security badge Β· small padlock icon
+0.04
0.10
0.40
[βˆ’0.16, +0.24]
.689
Security badge Β· none
βˆ’0.22
0.10
2.20
[βˆ’0.42, βˆ’0.02]
.028
● p < .05
● trending (.05 ≀ p < .10)
● n.s.
HB posterior estimates Β· zero-centered within attribute Β· Conjointly report export, Nov 2025
✦ model fit

The model learned real signal.

Log-likelihood
vs. null: βˆ’99.81
βˆ’58.14
McFadden's pseudo-RΒ²
0.2–0.4 = excellent fit; higher is stronger
0.417
Likelihood-ratio χ²
df = 10
83.35
Overall p-value
vs. random-choice null
< .001
Root Likelihood (RLH)
null baseline: 0.500
0.668
✦ holdout validation

Predictions held up on unseen tasks.

Holdout tasks
3 per participant Β· 36 total
36
Predicted correctly
chance baseline = 18/36
29 / 36
Hit rate
binomial p < .001 vs. 50%
80.6%
Internal consistency
trap task pass rate
12 / 12
Jackknife rank stability
Kendall's Ο„, H4 threshold 0.80
Ο„ = 0.93
✦ participant-level importance heatmap

Every participant weighed attributes their own way.

HB gives an individual-level utility vector per participant. Rows = participants (P01–P12), columns = the six attributes, cell intensity = relative importance for that person. Notice: the "security badge" column is the dimmest across almost every row β€” this finding isn't being dragged by one or two outliers. The heterogeneity is real (look at P07's row β€” the one outlier who weighted the badge far higher than anyone else, at 24%, second only to name), but the modal pattern dominates.

NAME
HISTORY
PHOTO
MUTUALS
VERIFIED
BADGE
P01
41
19
18
10
6
6
P02
36
24
16
12
7
5
P03
40
21
14
12
6
7
P04
43
18
15
11
7
6
P05
34
25
18
10
7
6
P06
37
23
14
12
8
6
P07 ↴
28
18
14
9
7
24
P08
42
21
16
10
6
5
P09
39
22
16
12
6
5
P10
35
24
17
12
7
5
P11
40
20
16
12
7
5
P12
46
19
16
9
5
5
MEAN
38.4
21.2
15.8
10.9
6.6
7.1
SD
4.6
2.3
1.3
1.2
0.8
5.1
Values are participant-level relative importance (%) β€” values sum to 100 per row. P07 is the one outlier who weighted the badge far higher than anyone else (24%, second only to their own 28% on name) β€” a 45-year-old who explicitly mentioned a recent fraud scare in debrief. Even including P07, the group-level rank ordering holds. The badge's high SD (5.1) reflects this one person, not widespread variance.

Insights β†’ Design

numbers into opinions with teeth ✦
INSIGHT 01
Trust is social infrastructure, not technical chrome.

Name, photo, history and mutuals together = 86.3% of the decision. Technical reassurance = 13.7%. Stop designing encryption PR pages before the product solves the "is this the right person" problem.

INSIGHT 02
Legal names are worth building for, even culturally.

Venmo's username culture optimizes for delight. But the data says a verified legal name moves trust more than every security badge combined. There's a design challenge here: surface legal identity without killing the playful handle.

INSIGHT 03
"Shared history" is a mostly-free trust boost.

21% importance. Showing "you've sent $X to this person N times before" right above the confirm button is a low-lift, high-yield change β€” and none of the four apps I tested make this prominent in the confirm flow.

INSIGHT 04
Mutuals matter more than verified ticks.

10.9% vs 6.6%. "5 mutual friends" is a stronger trust signal than a platform-verified checkmark. Social proof beats institutional proof for transaction-level trust.

INSIGHT 05
Stakes don't change the calculus.

I expected users to shift toward technical signals at higher amounts. They didn't. At $400 they wanted more identity confirmation, not more cryptography. This has implications for step-up friction design.

THE BIG RECOMMENDATION
Move "is this the right person?" UI above the fold on every confirm screen.

Legal name, profile photo, shared-history pill, mutuals. That's the trust stack. The padlock stays, but it earns its keep at 7%.

Limits & Reflections

what the study can't tell you ✦
Stated preference, not real money.

Conjoint asks what people would do. Nobody actually sent $400 in my study. A follow-up with live-fire behavioral data (click-through on real confirm screens) could either validate or deflate these utilities.

n = 12 is small. Treat everything as directional.

The biggest honest caveat. Orme's rule for CBC wants ~63 respondents for aggregate estimation at this design's size. I'm at 12. Point estimates have wide credible intervals; the rank ordering of attribute importance was stable under leave-one-out jackknifes, but treat specific percentages as rough. A follow-up on a paid panel (Prolific, ~$600 for n β‰ˆ 400) is the natural next step.

Convenience sample, not representative.

My recruits are people in my network β€” skewing young, urban, college-educated. The 45–59 bracket (n = 2) is statistically meaningless. Nothing here can be generalized to older or less digitally-fluent users. That matters a lot for a trust question, where older users may weight technical signals very differently.

Badges were visible, but neutral.

My stimuli used generic "bank-level encryption" wording. A branded trust seal (Norton, McAfee, FDIC) might perform differently. Worth a follow-up to isolate wording effects from badge-presence effects.

Trust β‰  behavior.

"Feel comfortable completing" is a proxy for behavioral intent. Some users might feel cautious but still send the money because they have to. This study measures felt trust, not revealed conversion.

✦ what I'd study next

Now I want to see if the same model holds for sending money to strangers.

Everything here assumed the recipient was someone the sender recognized. The interesting next question is marketplace payments β€” Facebook Marketplace, Craigslist, OfferUp. When the recipient is a stranger, does encryption badging finally earn its weight? Or does social context (mutuals, platform reputation) stay dominant?

I'd run the same conjoint with a marketplace framing, same 7 attributes, and test whether the importance ranking scrambles. If it does β€” interesting. If it doesn't β€” even more interesting.

← back to work
D.
made with care Β· 2026