Independent Research · US Market

What makes users trust
a payment app?
A conjoint study.

A small-n quantitative UX study measuring how users of US P2P payment apps (Venmo, Zelle, PayPal, Cash App) weigh specific trust signals when deciding to send money. 12 participants. Choice-based conjoint with a hierarchical Bayesian model. One finding that flipped my hypothesis on its head.

⏱ 6 weeks 📊 Choice-based conjoint 📋 n = 12 👤 Solo · Personal Project 🎓 Academic Case Study 📈 6 Trust Attributes

✦ what this study found

I assumed visible encryption badges drove trust. They score 7.1% importance. Recipient's real name scored 5.4× higher.

participants recruited from my own network — classmates, friends-of-friends, two family members. All active monthly users of at least one US P2P payment app.

38.4%

relative importance of recipient display name — the largest driver of perceived trust across all six tested attributes. Dominates at every transaction amount.

7.1%

relative importance of the security / encryption badge. The industry-standard trust signal is second-from-last on the list — barely ahead of the verified checkmark.

📊 a finding that overturned my hypothesis

Trust in payment apps is overwhelmingly social, not technical. Identity beats cryptography on every metric.

Across 12 respondents and 144 choice observations, the four identity attributes (name, photo, shared history, mutual contacts) together account for 86.3% of total trust-driver importance. Encryption and verified-tick combined sit at 13.7%. Effects this large show up even at small n.

✦ scope of this project

Personal academic case study. One designer, six-week timeline, pre-registered, data and code open-sourced, small self-collected sample.

No client, no team, no production implications. Zero budget — so no paid panel, no Prolific. Just 12 people from my own network, a rigorous quantitative method, and honest reporting of both the finding and the caveats that come with n = 12.

✦ why this project matters to me

I kept hearing designers defend security badges as "trust signals". But I'd never once looked at one before sending $40 to a friend. I wanted a number, not an opinion.

This was a personal academic case study — no team, no funding, no client, no real-world implications. Just me, a 6-week timeline, and twelve generous people from my network who sat through a conjoint task each. I wanted to teach myself how to run a quantitative study end-to-end, and I wanted an honest answer to a question that had been bugging me for months — even if "honest" meant "with n = 12 caveats attached".

The Question

what actually drives trust? ✦

Every P2P payment app — Venmo, Zelle, PayPal, Cash App — wraps itself in "bank-level encryption" badges, padlock icons, fraud-protection microcopy. The assumption baked into the industry is that visible security communication drives user trust.

But when I send $20 to someone on Venmo, I don't read the encryption badge. I check the profile photo, the username, the last transaction. My trust lives somewhere else entirely.

So which is it? Is the industry right and I'm an outlier, or is the industry wrong? And if it is wrong, by how much? Conjoint analysis can answer that with a number.

PRE-REGISTERED HYPOTHESES

H1Security badges drive trust

H₀: β_badge = 0 · H₁: β_badge > 0 · predicted importance: ≥ 20%

The industry's implicit bet. Reject H₀ if badge importance ≥ 15% with p < .05.

H2Identity signals drive trust more than technical signals

H₀: Σβ_identity ≤ Σβ_technical · H₁: Σβ_identity > Σβ_technical

Directional test comparing combined importance of name + photo + mutuals + history vs. badge + verified tick.

H3Badge importance scales with transaction amount

H₀: β_badge×amount = 0 · H₁: β_badge×amount > 0

Interaction term test. At $400, technical signals may compensate for reduced familiarity.

H4Rank ordering of attributes is stable under resampling

Leave-one-out jackknife across all 12 participants. Kendall's τ > 0.80 required.

Robustness check. Small samples can produce fluke orderings; this tests for it explicitly.

Pre-registered in a timestamped Notion doc before data collection. Predicted directions and thresholds were committed before a single choice was recorded.

The Method

choice-based conjoint · 7 attributes ✦

✦ why choice-based conjoint (CBC)

If I asked people "how important is encryption to you?" they'd all say "very important". Self-report on trust is useless — everyone performs the rational answer. Conjoint analysis asks instead: which of these two options would you pick? Trade-offs force revealed preferences, and with enough repetitions across many respondents, I can decompose each choice into part-worth utilities per attribute level.

I mocked a neutral “generic pay app” confirmation screen with 6 trust attributes (varied at 2–3 levels each) plus a transaction amount as context. Each participant saw 12 forced-choice tasks drawn from a 24-card D-efficient design. Modelled in Conjointly's free academic tier to recover individual-level utilities — not just aggregate averages.

ATTR 01

Recipient display name

3 levels: full legal name · first name + last initial · username only

ATTR 02

Profile photo

3 levels: clear human face · obscured/group/pet · default avatar

ATTR 03

Shared history

2 levels: 4 prior transactions · first-time recipient

ATTR 04

Mutual contacts badge

3 levels: "5 mutual friends" · "1 mutual" · no mutual shown

ATTR 05

Verified checkmark

2 levels: platform-verified tick present · absent

ATTR 06

Security badge visibility

3 levels: prominent "bank-level encryption" · small padlock icon · none

ATTR 07 · transaction amount (context, not preference)

Amount tier (rotated across tasks)

3 levels: $15 · $75 · $400. Varied across tasks so I could listen for whether participants reasoned differently at higher stakes.

Study Design

sample size, power, guardrails ✦

✦ why n = 12 (and what that buys me)

I'm one person with no funding. A paid panel (Prolific, Qualtrics) was out of reach, so my recruiting pool was people I could reach personally — classmates, friends-of-friends, two family members. Twelve was the number who'd sit with me for a 30-minute Zoom each.

I kept the quantitative method (choice-based conjoint) anyway. A dedicated conjoint tool (I used Conjointly's free academic tier) can still produce interpretable utility estimates at n = 12 — the error bars are wide, but the rank ordering of attribute importance was stable when I held out different participants and re-ran. That's enough to answer "which signals matter most?" with confidence, even if it's not enough to pin down the exact importance scores to the US population.

Orme's rule of thumb says ~63 respondents for aggregate CBC estimation at this design's size (largest attribute = 3 levels). I'm an order of magnitude below that — and I say so in the Limitations section. The goal here was to practice the craft and get a directionally honest answer, not publish a representative study.

METHOD

Choice-based conjoint

12 forced-choice tasks per respondent from a 24-card D-efficient design (D-eff 0.91). 3 holdout tasks for validation.

ANALYSIS

Conjoint utility model

Ran the analysis in Conjointly's free academic tier. Got per-attribute importance scores and individual-level utilities out of the box — no coding required.

COST

$0 cash

Zero budget. Sessions on Zoom. Stimuli in free Figma. Thanked each participant with coffee or a handwritten note.

The Participants

12 from my own network, all active P2P users ✦

Twelve participants recruited from my own network — classmates, friends-of-friends, two family members. All active monthly users of at least one P2P payment app (Venmo, Zelle, Cash App, PayPal). A convenience sample, not representative of the US. The Limitations section names this explicitly.

✦ age range of participants

20–29

of 12

30–44

of 12

45–59

of 12

Skews young — reflects who I could reach. A limitation named explicitly in the Reflections section.

✦ primary P2P app used

Venmo

of 12

Cash App

of 12

Zelle

of 12

PayPal

of 12

PEOPLE ASKED

COMPLETED

CHOICE OBSERVATIONS

144

AVG SESSION

32 min

The Process

step by step. what i thought → why → what i did ✦

Twelve steps over six weeks. Each one is broken into the thought that kicked it off, why it mattered, and what I actually did. This is the part I wish other case studies showed me when I was learning.

week 1 · day 1–2

Turning a hunch into a falsifiable question

what I thought

"Security badges are probably performative. Nobody reads them."

why this step mattered

A hunch is not a research question. If I couldn't state a version that could be proven wrong, I'd end up p-hacking my way to my priors. I needed to lock the hypothesis before I saw any data.

what I did about it

Reframed the hunch as four competing hypotheses (H1–H4 in the Question section). Pre-registered them in a Notion doc with predicted effect sizes and minimum importance thresholds before writing the survey. This kept me honest later when I was tempted to re-interpret results.

week 1 · day 3–4

Deciding between self-report, behavioral, and conjoint

what I thought

"The fastest method is a Likert survey 'how important is X to you, 1–7'. But that'll give me garbage data."

why this step mattered

Choice of method determines what the data can say. Direct importance ratings suffer from social desirability bias (everyone rates security 7/7) and no-trade-off problem (everything ends up "very important"). I needed a method that forces trade-offs.

what I did about it

Compared three options: Likert survey (fast, noisy), MaxDiff (forced ranking, but can't estimate interactions), CBC conjoint (slower to design, but I could model interactions and recover individual utilities). Picked CBC. Documented the reasoning in the pre-reg doc so I couldn't switch methods mid-study because results were inconvenient.

week 1 · day 5–7

Selecting and operationalizing the attributes

what I thought

"There are dozens of things on a payment confirmation screen. I can't test them all."

why this step mattered

CBC task complexity scales with attribute count. Past ~6 attributes, respondents start simplifying by ignoring some (non-compensatory decision-making), which corrupts the utility estimates. I had to be ruthless.

what I did about it

Did a screener exercise: spent 2 hours screenshotting every P2P confirmation screen I could find (Venmo, Cash App, Zelle, PayPal + three neobanks). Listed every visual element. Then I dropped the whole list into Claude and asked it to help me cluster the elements into candidate "trust signal" categories and flag blind spots I might have missed — it surfaced "shared transaction history" as a category I'd had scattered across three rows. I kept the clustering I agreed with, discarded a couple suggestions that felt forced, and dropped anything not visible on more than half the apps. Landed on the 6 attributes + 1 context (amount) shown earlier. Defined each at 2–3 levels, keeping the largest at 3 so the sample-size calc stayed manageable.

week 2 · day 1–3

Generating the choice design

what I thought

"I'll just random-sample combinations. Should be fine, right?"

why this step mattered

Random designs produce huge standard errors because levels get correlated by accident. A D-efficient design minimizes those correlations and means I need far fewer respondents for the same precision. This is where most first-time CBC researchers get burned.

what I did about it

Used Conjointly's built-in design generator to produce a D-efficient partial-profile design. 24 unique tasks, blocked into 2 versions of 12 tasks each. D-efficiency came out at 0.91 — I'd been warned to aim above 0.80. Also added 3 "holdout" tasks not used in model fitting, reserved for validating predictive accuracy later.

week 2 · day 4–6

Designing the stimulus so attributes actually land

what I thought

"I'll describe the options in text. Simpler, right?"

why this step mattered

Trust is a visual/gut response. Text bullets like "shows padlock icon · yes" don't trigger the same instinctive evaluation as an actual screen. But if I made the stimuli too realistic I'd smuggle in extra variables I hadn't controlled for.

what I did about it

Mocked a neutral "generic pay app" confirmation screen in Figma stripped of any brand elements (no Venmo blue, no Cash App green). Designed 16 component variants (one per attribute level), so every stimulus was a compositing job from the same kit. Kept typography, spacing, and CTA identical. Exported 24 images as the CBC stimuli.

week 3 · day 1–2

Pilot with 2 friends, and rewriting everything

what I thought

"The survey feels clear to me. Let's ship it."

why this step mattered

Surveys always feel clear to the person who wrote them. A pilot catches ambiguous wording, broken randomization, and task fatigue before you burn your tiny participant pool on a broken protocol.

what I did about it

Ran a soft pilot with 2 friends (not counted in the final 12). Median completion was 14 minutes way too long, consistency dropping in the last 3 tasks. Both pilots misread "mutual contacts" as "mutual funds". I rewrote the label, dropped from 14 to 12 tasks, added a micro-illustration explaining "mutual contacts" on the intro screen, and repiloted. Completion time dropped to ~8 minutes of actual task work (plus ~20 min of onboarding + debrief for the real sessions). Shipped.

week 3 · day 3–5

Building attention checks that actually work

what I thought

"I know everyone in my sample personally — I don't need attention checks."

why this step mattered

Friends want to help you. That means they'll click through even when their head isn't in it. Attention checks aren't about catching cheaters here — they're about flagging tasks where someone zoned out, so I don't treat a tired click as a meaningful preference.

what I did about it

Built two lightweight checks: (1) a trap task where one option strictly dominated the other on every attribute — anyone picking the worse option there got a gentle "hey, want to revisit this one?" prompt in a follow-up email; (2) a timing filter — sub-10-second responses flagged in my analysis script. Across 12 sessions, 1 participant failed the trap, talked it through, and I kept the re-done response. Zero timing failures. Small sample means I could actually pay attention.

week 3 · day 6–7

Running the sessions and keeping field notes

what I thought

"Send everyone the link and collect the data."

why this step mattered

When the sample is 12 people, unmoderated completion is a waste of a rare resource. If I'm already scheduling Zoom time with each person, I can confirm they understood the task and catch quirks in real time.

what I did about it

Scheduled 30-min Zoom sessions over two weeks. Each one: 5 min walkthrough of the attributes, 12 CBC tasks completed by the participant (I stayed silent), 3 holdout tasks, 10 min debrief with open questions. Kept a field journal for each session — quirks, confusion points, verbatim remarks. After all 12 sessions wrapped, I pasted the anonymized debrief notes into Claude and asked it to help me surface recurring themes and tensions across participants. It flagged a pattern I'd half-noticed — people kept mentioning the recipient's profile photo even when I hadn't asked — which shaped how I read the quantitative results later.

week 4 · day 1–3

Cleaning the data before looking at any results

what I thought

"Just load the CSV and start modeling."

why this step mattered

Looking at aggregate results first anchors your expectations. By the time I was tempted to cut respondents, I'd already be influenced by what cutting them does to the headline. Clean first, analyze second.

what I did about it

Exported the raw choice data from the survey tool into a spreadsheet. Applied my pre-registered exclusion criteria in a separate tab before computing anything else — result was 0 of 12 excluded (after the trap-task correction from step 7). Flagged straight-liners (same option every task) in a third tab — zero cases. Saved a timestamped copy of the clean dataset before running any analysis.

week 4 · day 4–7

Running the conjoint analysis

what I thought

"I have the choices. Now I just count which option wins most, right?"

why this step mattered

Counting winners tells you almost nothing — every choice involves multiple attributes at once, so you can't just say "option B won, so badges matter". Conjoint analysis decomposes the choices into per-attribute utility scores and translates those into relative importance. I'd need a tool that does this, because the math is not spreadsheet-friendly.

what I did about it

Uploaded the 12 participants' choice data into Conjointly (free academic tier). It handled the modeling — I just needed to specify my attributes, levels, and experiment design. Output: relative importance scores per attribute + individual-level utilities per participant. Then I held out one participant at a time and re-ran the analysis to see how stable the ranking was. It held up (details in step 11).

week 5 · day 1–4

The moment the data disagreed with me

what I thought

"Security badging will land around 20–25% importance. Lower than recipient identity, but meaningful."

why this step mattered

This is where bias eats a study. If my preferred finding is "badges don't matter", I'll want badges low. If the effect is smaller than I expected in either direction, I have to report it honestly.

what I did about it

Security badge importance came in at 7.1% — lower than I'd expected, even against my own counter-hypothesis. I sat with that for a day. Re-checked the data for entry errors. Ran the leave-one-out check: held out each of the 12 participants in turn and re-ran the analysis. Rank ordering of attributes held in every version, badge importance ranged from 5.9% to 8.4% across the 12 holdouts. The direction of the effect was robust, even if the exact magnitude is uncertain at this sample size.

week 5 · day 5 → week 6

Testing interactions and writing up what it means

what I thought

"Surely badges matter more at $400 than $15. Technical signals must kick in somewhere."

why this step mattered

An average that hides a real interaction is a misleading average. If badges only matter for large amounts, the headline changes from "badges are useless" to "badges are contextual". I had to test it before writing up.

what I did about it

Split the data by amount (low $15 vs high $400) and re-ran the analysis for each subset. Then did the same by age (under 35 vs 35+). Badge importance stayed in the single digits in every cut except the 45–59 bracket, where one fraud-scarred participant (P07) pulled a two-person average up — though with n = 12 split this far, I'm explicit that these subgroup numbers are noisy at best. The main result (identity > technical) held in every cut. Used Claude as a drafting partner for the case-study writeup — fed it my raw notes and the data, pushed back when it tried to soften the "badges don't matter" finding, kept the honest version. Led with the finding, flagged every small-sample caveat.

The Instrument

what participants actually saw ✦

✦ a real CBC task (as participants saw it)

"Imagine you need to send $75. Which of these screens would you feel more comfortable completing?"

OPTION A

9:41 5G

← Confirm payment

Marcus Taylor

@mtaylor · 5 mutual contacts

✓

4 prior transactions
Last sent Oct 2025

Amount

$75.00

Send $75.00

namereal (first + last)

photoreal portrait

mutuals5

history4 prior

verifiedno

security badgeabsent

OPTION B

9:41 5G

← Confirm payment

@cash_flow_22

no mutual contacts

First-time recipient
No prior transactions

Amount

$75.00

🔒 Protected by bank-level encryption

Send $75.00

nameusername only

photoplaceholder

mutuals0

historynone

verifiedno

security badgepresent

Neutral "generic pay app" frame — no brand colors, no Venmo blue or Cash App green. Each participant saw 12 tasks assembled from the same component kit; only the six attributes varied.

✦ task design across the 12 sessions

Tasks rotated across three amount tiers.

Every participant saw 4 tasks at each amount ($15 / $75 / $400), drawn from the 24-card D-efficient design and randomized per-participant. Below: one representative prompt per tier, plus the full 12-task layout a participant actually saw.

LOW STAKES · $15

"Imagine you need to send $15. Which would you feel more comfortable completing?"

Coffee-run money. Testing whether trust signals matter at all when stakes are low.

MID STAKES · $75

"Imagine you need to send $75. Which would you feel more comfortable completing?"

Groceries, dinner. Roughly a typical everyday Venmo send amount.

HIGH STAKES · $400

"Imagine you need to send $400. Which would you feel more comfortable completing?"

Rent share, deposit. Testing the H3 amount × badge interaction hypothesis.

ALL 12 TASKS FROM A SINGLE PARTICIPANT'S SESSION (block A)

T01 · $15

M. Taylor

5 mutuals

@cf_22

🔒 enc

T02 · $75

Jordan L.

✓ 4 prior

@skate_4

1 mutual

T03 · $400

Priya R.

5 mutuals

@bass_99

🔒 enc

T04 · $15

@dev_r

0 mutuals

Sam Chen

✓ 4 prior

T05 · $75

@moon_03

no mutuals

Ana Vasquez

✓ 4 prior

T06 · $400

K. Okafor ✓

1 mutual

@night_22

no mutuals

T07 · $15

Leo Patel

🔒 enc

@grr_99

1 mutual

T08 · $75

Maya B. ✓

5 mutuals

@tune_3

🔒 enc

T09 · $400

Rohan S.

✓ 4 prior

@val_11

0 mutuals

T10 · $15

@box_4

no mutuals

Tara N.

1 mutual

T11 · $75

Noah K. ✓

🔒 enc

@ink_02

1 mutual

T12 · $400

Eva T.

5 mutuals

@wolf_8

🔒 enc

Block A layout shown (6 participants); Block B rotates the remaining 12 cards from the 24-card design. Every participant saw 4 tasks per amount tier, order randomized. Attribute combinations are deliberately near-orthogonal — that's what D-efficiency measures.

✦ the stimulus kit

16 component variants, one per attribute level.

Built as a small design system so every stimulus was a compositing job from identical pieces. Typography, spacing, and CTA stayed locked — only the trust-signal components swapped in.

RECIPIENT DISPLAY NAME · 3 levels

Marcus Taylor

real first + last name

M. Taylor

initial + last name

@cash_flow_22

username only

PROFILE PHOTO · 3 levels

real portrait

initial avatar

placeholder

MUTUAL CONTACTS · 3 levels

👥 5 mutual contacts

👥 1 mutual contact

no mutual contacts

SHARED HISTORY · 2 levels

✓

4 prior transactions

First-time recipient

VERIFIED CHECKMARK · 2 levels

Marcus Taylor

✓

Marcus Taylor

L2 (absent)

SECURITY BADGE · 3 levels ← the headline variable

🔒

Protected by bank-level encryption

🔒

small padlock icon only

no badge

✦ session structure

How each 30-min Zoom session ran.

Consent + warm-up

informed consent · quick chat to settle in

Demographics + P2P usage

age, primary app, frequency, recent send amount

CBC walkthrough + practice

5 min explainer of each attribute · 1 throwaway task

12 CBC tasks + 1 trap

main block · randomized · participant works silently

3 holdout tasks

held back for predictive validation

Open debrief + thanks

"walk me through your thinking" + field notes + coffee

The Results

relative attribute importance ✦

✦ relative importance scores (higher = drives more of the choice)

n = 12 · choice-based conjoint (Conjointly academic tier) · small sample — read as directional

Recipient display name

38.4%

Shared transaction history

21.2%

Profile photo

15.8%

Mutual contacts badge

10.9%

Verified checkmark

6.6%

Security badge

7.1%

In-sample hit rate: 84.0%

Log-likelihood: −58.14

McFadden R²: 0.417

Name · history · photo p < .001

✦ part-worth utility: recipient name

Full legal name+0.92

Initial + last name+0.11

Username only−1.03

A full legal name adds +0.92 utility relative to the mean; a bare username subtracts −1.03 — a 1.95-point swing, the widest of any attribute and nearly five times the security badge's 0.40.

✦ part-worth utility: security badge

"Bank-level encryption" banner+0.18

Small padlock icon+0.04

No security badge−0.22

The full swing from "no badge" to the explicit "bank-level encryption" banner is just 0.40 utility — about a fifth of the name swing (1.95). The only reliably non-zero piece is the small penalty for showing nothing; a bare padlock icon does nothing at all.

✦ interaction tests (H3 + exploratory)

Badge × transaction amount

Badge importance at $15: 6.8% · at $75: 7.0% · at $400: 7.5%. Interaction effect p = 0.18. Not significant. H3 rejected.

Badge × age group

20–29: 5.3% · 30–44: 6.3% · 45–59: 14.5%. Interaction p = 0.14. Not significant — the 45–59 figure rides entirely on P07 (n = 2), so read it as exploratory, not pre-registered.

✦ full part-worth utilities with 95% CIs

Every attribute level, every confidence interval.

Hierarchical Bayes (HB) estimation via Conjointly, 10,000 MCMC iterations, 2,000 burn-in. Utilities are zero-centered within each attribute. CIs are 95% posterior intervals. With n=12 they are wide — and that's the point: the case is in the ranking, not the precision.

Attribute · Level

Utility (β)

95% CI

Display name · full legal (Marcus Taylor)

+0.92

0.19

4.84

[+0.55, +1.29]

<.001

Display name · initial + last (M. Taylor)

+0.11

0.17

0.65

[−0.22, +0.44]

.516

Display name · username only (@cash_flow_22)

−1.03

0.21

4.90

[−1.44, −0.62]

<.001

Shared history · 4 prior transactions

+0.58

0.14

4.14

[+0.31, +0.85]

<.001

Shared history · first-time recipient

−0.58

0.14

4.14

[−0.85, −0.31]

<.001

Profile photo · real portrait

+0.41

0.13

3.15

[+0.16, +0.66]

.002

Profile photo · monogram

+0.02

0.12

0.17

[−0.22, +0.26]

.867

Profile photo · silhouette (default)

−0.43

0.13

3.31

[−0.68, −0.18]

.001

Mutuals · 3+ shown

+0.29

0.12

2.42

[+0.06, +0.52]

.016

Mutuals · 1 shown

+0.04

0.11

0.36

[−0.18, +0.26]

.720

Mutuals · none

−0.33

0.12

2.75

[−0.57, −0.09]

.006

Verified checkmark · present

+0.16

0.09

1.78

[−0.02, +0.34]

.075

Verified checkmark · absent

−0.16

0.09

1.78

[−0.34, +0.02]

.075

Security badge · "bank-level encryption"

+0.18

0.10

1.80

[−0.02, +0.38]

.072

Security badge · small padlock icon

+0.04

0.10

0.40

[−0.16, +0.24]

.689

Security badge · none

−0.22

0.10

2.20

[−0.42, −0.02]

.028

● p < .05

● trending (.05 ≤ p < .10)

● n.s.

HB posterior estimates · zero-centered within attribute · Conjointly report export, Nov 2025

✦ model fit

The model learned real signal.

Log-likelihood

vs. null: −99.81

−58.14

McFadden's pseudo-R²

0.2–0.4 = excellent fit; higher is stronger

0.417

Likelihood-ratio χ²

df = 10

83.35

Overall p-value

vs. random-choice null

< .001

Root Likelihood (RLH)

null baseline: 0.500

0.668

✦ holdout validation

Predictions held up on unseen tasks.

Holdout tasks

3 per participant · 36 total

Predicted correctly

chance baseline = 18/36

29 / 36

Hit rate

binomial p < .001 vs. 50%

80.6%

Internal consistency

trap task pass rate

12 / 12

Jackknife rank stability

Kendall's τ, H4 threshold 0.80

τ = 0.93

✦ participant-level importance heatmap

Every participant weighed attributes their own way.

HB gives an individual-level utility vector per participant. Rows = participants (P01–P12), columns = the six attributes, cell intensity = relative importance for that person. Notice: the "security badge" column is the dimmest across almost every row — this finding isn't being dragged by one or two outliers. The heterogeneity is real (look at P07's row — the one outlier who weighted the badge far higher than anyone else, at 24%, second only to name), but the modal pattern dominates.

NAME

HISTORY

PHOTO

MUTUALS

VERIFIED

BADGE

P01

P02

P03

P04

P05

P06

P07 ↴

P08

P09

P10

P11

P12

MEAN

38.4

21.2

15.8

10.9

6.6

7.1

4.6

2.3

1.3

1.2

0.8

5.1

Values are participant-level relative importance (%) — values sum to 100 per row. P07 is the one outlier who weighted the badge far higher than anyone else (24%, second only to their own 28% on name) — a 45-year-old who explicitly mentioned a recent fraud scare in debrief. Even including P07, the group-level rank ordering holds. The badge's high SD (5.1) reflects this one person, not widespread variance.

Insights → Design

numbers into opinions with teeth ✦

INSIGHT 01

Trust is social infrastructure, not technical chrome.

Name, photo, history and mutuals together = 86.3% of the decision. Technical reassurance = 13.7%. Stop designing encryption PR pages before the product solves the "is this the right person" problem.

INSIGHT 02

Legal names are worth building for, even culturally.

Venmo's username culture optimizes for delight. But the data says a verified legal name moves trust more than every security badge combined. There's a design challenge here: surface legal identity without killing the playful handle.

INSIGHT 03

"Shared history" is a mostly-free trust boost.

21% importance. Showing "you've sent $X to this person N times before" right above the confirm button is a low-lift, high-yield change — and none of the four apps I tested make this prominent in the confirm flow.

INSIGHT 04

Mutuals matter more than verified ticks.

10.9% vs 6.6%. "5 mutual friends" is a stronger trust signal than a platform-verified checkmark. Social proof beats institutional proof for transaction-level trust.

INSIGHT 05

Stakes don't change the calculus.

I expected users to shift toward technical signals at higher amounts. They didn't. At $400 they wanted more identity confirmation, not more cryptography. This has implications for step-up friction design.

THE BIG RECOMMENDATION

Move "is this the right person?" UI above the fold on every confirm screen.

Legal name, profile photo, shared-history pill, mutuals. That's the trust stack. The padlock stays, but it earns its keep at 7%.

Limits & Reflections

what the study can't tell you ✦

Stated preference, not real money.

Conjoint asks what people would do. Nobody actually sent $400 in my study. A follow-up with live-fire behavioral data (click-through on real confirm screens) could either validate or deflate these utilities.

n = 12 is small. Treat everything as directional.

The biggest honest caveat. Orme's rule for CBC wants ~63 respondents for aggregate estimation at this design's size. I'm at 12. Point estimates have wide credible intervals; the rank ordering of attribute importance was stable under leave-one-out jackknifes, but treat specific percentages as rough. A follow-up on a paid panel (Prolific, ~$600 for n ≈ 400) is the natural next step.

Convenience sample, not representative.

My recruits are people in my network — skewing young, urban, college-educated. The 45–59 bracket (n = 2) is statistically meaningless. Nothing here can be generalized to older or less digitally-fluent users. That matters a lot for a trust question, where older users may weight technical signals very differently.

Badges were visible, but neutral.

My stimuli used generic "bank-level encryption" wording. A branded trust seal (Norton, McAfee, FDIC) might perform differently. Worth a follow-up to isolate wording effects from badge-presence effects.

Trust ≠ behavior.

"Feel comfortable completing" is a proxy for behavioral intent. Some users might feel cautious but still send the money because they have to. This study measures felt trust, not revealed conversion.

✦ what I'd study next

Now I want to see if the same model holds for sending money to strangers.

Everything here assumed the recipient was someone the sender recognized. The interesting next question is marketplace payments — Facebook Marketplace, Craigslist, OfferUp. When the recipient is a stranger, does encryption badging finally earn its weight? Or does social context (mutuals, platform reputation) stay dominant?

I'd run the same conjoint with a marketplace framing, same 7 attributes, and test whether the importance ranking scrambles. If it does — interesting. If it doesn't — even more interesting.

← back to work

made with care · 2026

What makes users trust a payment app? A conjoint study.

I assumed visible encryption badges drove trust. They score 7.1% importance. Recipient's real name scored 5.4× higher.

The Question

The Method

Study Design

The Participants

The Process

The Instrument

Tasks rotated across three amount tiers.

16 component variants, one per attribute level.

How each 30-min Zoom session ran.

The Results

Every attribute level, every confidence interval.

The model learned real signal.

Predictions held up on unseen tasks.

Every participant weighed attributes their own way.

Insights → Design

Limits & Reflections

Now I want to see if the same model holds for sending money to strangers.

What makes users trust
a payment app?
A conjoint study.