Anatomy of Mercor's Data Breach

A technical analysis a complete operational data (production database, user & customer data) loss


Disclaimer: All personally identifiable information (PII) in this document has been obfuscated. Names are partially masked (e.g., T** O****), emails redacted (e.g., e****a1@gmail.com), phone numbers truncated (e.g., +4479571****), bank details masked (e.g., 000**-***), financial identifiers hidden (e.g., acct_1Rc*****), IP addresses truncated (e.g., 71.194.*.*), and MAC addresses partially redacted (e.g., 1C:93:7C:**:**:**). This analysis is conducted for educational and security research purposes.

Note on source material: This entire analysis is based on two small sample files made publicly available by Lapsus$ — a database schema sample and a database export containing table structures with example rows, plus partial Airtable workspace exports. These files were shared after Mercor allegedly paid a ransom to have the data removed from the group's leak site — a fact confirmed to us directly by Lapsus$. Despite receiving payment, the group continues to share samples and is actively engaged in selling the full dataset to private bidders. Together these two files represent a fraction of a percent of the claimed 211GB production database. We did not have access to the full database, the 939GB of source code, the 3TB of cloud storage, the Slack exports, or the Tailscale VPN data. Everything documented in this report — every bank routing number, every Apple Foundation Model output, every Persona KYC session token, every desktop screenshot URL — was found in these two small files alone. The full breach is orders of magnitude larger. What follows is the tip of the iceberg.


Table of Contents

  1. Executive Summary
  2. Why This Breach Is Serious
  3. Platform Overview
  4. Evidence - The Database Layer by Layer
  5. Reverse Engineering - Architecture and Infrastructure
  6. Exposed Surface Area Summary
  7. Technical Architecture Reverse-Engineered
  8. Grounds for Legal Action
  9. Conclusion - What Happens Now
  10. Appendix A - Complete Table Inventory

Executive Summary

This document presents a systematic technical analysis of a small sample from a database export from Mercor, an AI-powered talent marketplace that connects software engineers, AI data labelers, and knowledge workers with companies seeking contract labor. As reported by the Wall Street Journal, Mercor has rapidly become one of the key intermediaries in the AI industry — placing contractors inside organizations like Meta, OpenAI, Google DeepMind, Anthropic, Apple, and Amazon to perform AI training, data labeling, software engineering, and other knowledge work.

What we analyzed is two small sample files shared by Lapsus$ after Mercor allegedly paid a ransom to have the breach data removed. Despite that payment, the group continues to distribute samples and is actively selling the full dataset to private bidders. Together these files represent a tiny sliver of the claimed 211GB production database. Yet even these small samples contain over 250 table schemas with sample data rows exported from Mercor's Aurora MySQL production environment, plus Airtable workspace exports containing actual AI training data and model evaluation records. The samples cover every operational dimension of the platform — from contractor signup through identity verification, AI-conducted interviews, job placement, real-time work surveillance, and payment disbursement.

If these samples — containing just one or two rows per table — already expose full bank routing numbers, government ID verification tokens, desktop screenshot URLs, signed legal documents, and proprietary AI model outputs from Apple and Amazon, the full 211GB database contains the same data for every contractor and every transaction Mercor has ever processed.

Scope of This Article and the Full Scale of the Breach

Important: This article analyzes only two small sample files from the production database, shared by Lapsus$ after Mercor allegedly paid a ransom. The full production database is 211GB, which is itself a fraction of the claimed 4-terabyte breach. Every finding documented below was derived from these small samples alone. The full database would contain the complete records for every contractor, every transaction, every screenshot, and every payment Mercor has ever processed.

The Breach at a Glance

Mercor's official account attributes the breach to a supply-chain attack on the open-source Python package LiteLLM — a widely used AI proxy library estimated to be present in 36% of cloud environments. On March 27, 2026, using a maintainer's compromised credentials, the TeamPCP hacking group published two malicious PyPI package versions (1.82.7 and 1.82.8) that were available for download for approximately 40 minutes. The reported attack chain: the poisoned dependency landed in Mercor's development environment, swept the machine for SSH keys, AWS tokens, Kubernetes secrets, and .env files, deployed privileged containers across Mercor's Kubernetes clusters, and used the stolen credentials to begin exfiltrating data through Mercor's Tailscale VPN.

However, there are reasons to question whether LiteLLM was the sole or even primary attack vector. Exfiltrating 4 terabytes of data — production databases, 939GB of source code repositories, 3TB of cloud storage including video recordings and screenshots, plus Slack, Airtable, and Tailscale exports — is not a fast operation. At typical egress speeds, this would have taken days to weeks of sustained data transfer. A 40-minute window of malicious package availability seems insufficient to establish the deep, persistent access required to systematically exfiltrate this volume of data across this many distinct systems (Aurora MySQL, S3 buckets, GitHub repositories, Airtable, Slack, Tailscale). It is entirely possible that Mercor was already compromised through other means — whether through prior credential exposure, an insider threat, or a separate vulnerability — and that the LiteLLM incident was coincidental or merely one of multiple entry points. Mercor's characterization of itself as "one of thousands of companies" affected by LiteLLM may be an attempt to deflect from deeper, more embarrassing security failures.

Lapsus$ group subsequently claimed responsibility for the breach, posting samples of the allegedly stolen data. Lapsus$ confirmed to us directly that ransom negotiations with Mercor took place and that Mercor paid. Despite that payment, the group continues to distribute samples and is actively selling the full dataset to private bidders.

Mercor confirmed the security incident but characterized itself as "one of thousands of companies" affected by the LiteLLM compromise. The company declined to answer whether any customer or contractor data had been accessed, exfiltrated, or misused.

Security researcher Archie Sengupta noted it was a "very big breach." Y Combinator president Garry Tan was more direct: "Incredible amount of SOTA training data now just available to China thanks to @mercor_ai leak. Every major lab. Billions and billions of value and a major national security issue."

What Was Taken - The Full 4TB

The attackers claim to have exfiltrated the following assets. This article only analyzes the first item — the production database. The remaining categories are not covered in this analysis but are described here to convey the full scale of exposure.

Asset Size Contents
Production Database 211 GB The subject of this article. 250+ Aurora MySQL tables containing candidate profiles (resumes, work history, skills, education), PII (names, emails, phones, addresses, dates of birth, possibly SSNs and government ID documents), interview recordings/transcripts and AI assessment scores, employer/client data (companies, contracts, pricing), and internal user accounts and credentials.
Source Code 939 GB The complete contents of Mercor's GitHub organization — including the mercor-monorepo and all associated repositories. This exposes proprietary AI/ML models for candidate matching and evaluation, the full platform backend and frontend code, API keys, secrets, and internal service credentials embedded in repositories, and all infrastructure-as-code (Terraform/Terragrunt deployment configs, CI/CD pipelines, cloud architecture).
Cloud Storage Buckets ~3 TB The actual files referenced by the S3 URLs found in the database. Organized into three categories: Video — AI interview recordings of candidates (the ai-interviewer-recordings and dailyco-recordings S3 buckets), containing face and voice biometric data; GCF-Source — Google Cloud Function source code, representing additional serverless application logic beyond the main repositories; FME Review & Verification — Identity verification documents including passports, driver's licenses, and facial recognition/biometric data used in the Persona KYC flow (the mercor-background-check-photos, certn-api-s3-certn-images, certn-api-s3-one-id-images, and certn-api-s3-certn-rcmp-documents buckets). Also included: every Insightful desktop screenshot ever captured from contractor machines (the mercor-insightful-screenshots-production bucket), and signed legal documents (offer letters, CIIAs, NDAs).
Tailscale VPN Data Included Internal network topology and routing configurations, device certificates and authentication keys, access paths to internal services, dashboards, and admin tools. This is effectively a map of Mercor's internal network.
Slack Export Included A full export of Mercor's enterprise Slack workspace (mercor.enterprise.slack.com) and potentially client-specific workspaces like project-mega.slack.com and glowstone-mli-rubrics.slack.com. Slack exports include every message, file upload, DM, and channel history — candid internal discussions, client communications, incident response threads, and operational decisions.
Airtable Export Included Complete exports of all Airtable workspaces used for annotation and project management (6+ distinct workspace IDs found in the database). This exposes task definitions, contractor submissions, quality review data, and client project configurations — effectively the work product of Mercor's annotation pipeline.
Google Workspace Unknown It is unclear whether the attackers obtained a full export of Mercor's Google Workspace. Even the small sample analyzed here contains 30+ Google Doc URLs, 10+ shared Drive folder URLs, Google Sheets, and Google Forms. The full database would contain vastly more. If the Workspace was also exfiltrated, it would include all internal documents, email (Gmail), calendar entries, and shared drives.

Why This Matters Beyond Mercor

The database analyzed in this report is merely the index — the structured metadata that describes, catalogs, and points to the stolen assets. Think of it as the card catalog for an entire stolen library:

As Garry Tan noted, the AI training data alone — the prompts, responses, evaluations, and RLHF annotations produced by Mercor's contractors for organizations like OpenAI, Meta, and Google DeepMind — represents potentially billions of dollars in value. If this data reaches competitors — whether domestic rivals or labs in other countries — it would allow them to shortcut years of investment. The source code for Mercor's proprietary ranking algorithms (MercorScore, the Bradley-Terry tournament system, the Bayesian fraud model) adds further competitive intelligence value.

Together, this represents one of the most comprehensive corporate breaches in recent memory: not a single database table or a handful of credentials, but the complete digital footprint — code, data, communications, files, network maps, and work product — of an organization entrusted with some of the most sensitive work in the AI industry.


Why This Breach Is Serious

Why AI Training Data Is Worth Billions

To understand why this breach is significant and not just another corporate data leak, it helps to understand what AI training data is and why companies like OpenAI, Anthropic, Apple, Amazon, Meta, and Google pay enormous sums to produce it.

Modern AI models like GPT-4, Claude, and Gemini are not programmed — they are trained. The raw intelligence comes from pre-training on internet text, but the ability to follow instructions, reason carefully, and refuse harmful requests comes from a second phase that depends entirely on human-generated data. This is the data Mercor's contractors produce. It falls into several categories, all of which are present in the breach:

Supervised Fine-Tuning (SFT) data — Humans write high-quality responses to prompts, demonstrating how the model should behave. The TASKS and TASK_VERSIONS tables across Mercor's 84 Airtable workspaces contain these prompt-response pairs, organized by domain (legal, medicine, finance, coding, etc.). A single SFT dataset covering a specialized domain can cost millions of dollars to produce because it requires experts — lawyers, doctors, engineers — writing at $95/hour for months.

Reinforcement Learning (RL) preference data — Humans compare two model outputs and judge which is better. This is the core of RLHF (Reinforcement Learning from Human Feedback), the technique that transformed GPT-3 into ChatGPT. The API_PREFERENCE workspaces, PHASE_1_TASKS (Amazon), and the GPT-4 vs Claude Evaluation project all contain this data — complete with the prompts, both model responses, and the human preference judgment. This data teaches models what humans actually want, which is the hardest and most expensive part of AI development.

RL rubrics and evaluation criteria — Before humans can judge model outputs, someone must define what good looks like. The CRITERIA, RUBRIC_VERSIONS, QA_SPECS, and LLM_CALL_CONFIGURATION tables across 60+ Airtable workspaces contain these rubrics. They encode the evaluation methodology itself — the scoring frameworks, the edge cases, the quality thresholds. This is proprietary intellectual property that defines how each AI lab measures progress. A competitor with access to these rubrics doesn't just get the training data — they get the recipe.

RL environments and Chain-of-Thought data — The AMAZON_LLM_COT_EVALUATION workspace contains full Chain-of-Thought traces — the step-by-step reasoning that models produce before giving a final answer. The ACADEMIC_REASONING_SFT workspace contains a COT table explicitly for reasoning supervision. The Panacea — Consulting RL Envs project built reinforcement learning environments. This data teaches models how to think, not just what to say.

Benchmark evaluation data — The ATHENA_HLE workspaces (likely Humanity's Last Exam) and AIME_RUBRICS (AIME math competition) contain evaluation data for some of the most important AI benchmarks. The MODEL_RESPONSES and AWAITING_REVIEW_METRICS tables contain graded model outputs against these benchmarks. If this data is used to train future models, it contaminates the benchmarks — the models will appear to perform better than they actually do, undermining the entire AI evaluation ecosystem.

Pre-release model outputs — The APPLE_ENDPOINT_SANDBOX workspace contains actual outputs from Apple's unreleased Foundation Models (afm-text-083, afm-model-086). These responses reveal the model's capabilities, limitations, safety alignment, and failure modes before Apple has publicly launched them. For a competitor, this is the equivalent of obtaining a rival's product prototype.

Why this data is so expensive to reproduce:

Each data point requires a skilled human — often a domain expert — spending minutes to hours crafting, evaluating, or comparing model outputs. At Mercor's reported average rate of $95/hour across 30,000+ contractors, the annual cost of data production runs into hundreds of millions of dollars. OpenAI, Anthropic, and the other labs have each spent years and billions of dollars building these datasets incrementally, refining their rubrics, and developing their evaluation methodologies.

The breach doesn't just expose data. It exposes the methodology — the rubrics, the evaluation criteria, the domain taxonomies, the quality control processes, and the scoring frameworks that each lab has spent years developing. Any competitor with access to this material — domestic or foreign — could replicate years of alignment research in months, at a fraction of the cost, by simply adopting the proven evaluation frameworks and training on the stolen preference data.

This is why Garry Tan called it "billions and billions of value." The data in these Airtable workspaces is not supplementary. It is the core competitive advantage of the AI labs that produced it — and it is now for sale.

The Extent - What Data Was Exposed

The breadth of personally identifiable information (PII) in this breach is staggering. The following inventory documents every category of sensitive data present in the database dump, with specific column names, source tables, and — where available — the format of the exposed data as observed in sample records. This inventory is intended to serve as a factual reference for affected individuals, regulators, and legal counsel.

1. Personal Identity Information

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Full legal name name, first_name, last_name MercorUsers_New, MercorUserFinancials (embedded in Stripe JSON) T** O****, H****i A****a (full plaintext names)
Personal email address email MercorUsers_New, Candidates, LinkedinWarmIntros, UserReferences, MLExperimentsJobPerformanceReviews e****a1@gmail.com, a*****y@gmail.com, a*****s@gmail.com (full plaintext)
Phone number with country code phone MercorUsers_New +4479571**** (full international format)
Date of birth birthday UserMetadata, Candidates, WorkAuthorization_Audit Date field — exact DOB for each contractor
Physical home address physicalLocation, residenceCity, residenceState, residenceZipCode UserMetadata, UserLocation, Candidates City, state, zip code, and country of residence
Profile photograph profilePic MercorUsers_New URL to stored profile image
Country of residence residenceCountry, countryOfResidence UserLocation, UserMetadata, Candidates USA, United Kingdom
LinkedIn profile URL linkedinUrl, url Candidates, LinkedinWarmIntros, LinkedinUsers https://www.linkedin.com/in/s**-s**-s******-d***** (full URL with real name)

2. Government Identity Documents and Biometrics

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Government ID verification outcome governmentIdStatus IDVerificationChecks not_applicable, passed, failed
Liveness detection result livenessStatus IDVerificationChecks Binary pass/fail — confirms a live facial scan was performed
Facial comparison thumbnail thumbnail_key (in providerResponse JSON) IDVerificationChecks intr_AAABnNOWs0wnj7Tmg0hBQpL5_thumbnail.jpg — a stored facial image key
Persona KYC session token sessionId, sessionToken IDVerificationChecks face_baseline_intr_AAABnNOWs0wnj7Tmg0hBQpL5 — replayable session ID
Persona account identifier persona_account_id (in providerResponse JSON) IDVerificationChecks act_QMTuQh33A4QU23J8ECPSd32BBKb4
Address verification status addressStatus IDVerificationChecks Confirms whether home address was verified against government records
Verification attempt count attemptNumber, maxAttempts IDVerificationChecks Tracks repeated identity verification attempts

Note: The cloud storage buckets (mercor-background-check-photos, certn-api-s3-one-id-images, certn-api-s3-certn-rcmp-documents) reportedly contain the actual document images — passports, driver's licenses, and RCMP criminal record documents — referenced by these database records.

3. Financial and Banking Data

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Bank name bank_name (in accountDetails JSON) MercorUserFinancials BANK OF M******* (plaintext)
Bank routing number routing_number (in accountDetails JSON) MercorUserFinancials 000**-*** (full routing number in plaintext)
Bank account last 4 digits last4 (in accountDetails JSON) MercorUserFinancials 07**
Bank account holder name account_holder_name (in accountDetails JSON) MercorUserFinancials H****i A****a (full legal name on bank account)
Stripe Express account ID providerMethodId, stripeAccountId UserPaymentMethods, MercorUsers_New acct_1Rc*****
Full Stripe account JSON accountDetails MercorUserFinancials Complete Stripe API response including all fields above plus charges_enabled, payouts_enabled, default_currency, TOS acceptance timestamp, and external account details
Wise transfer & quote IDs wiseTransferId, wiseQuoteId WiseDisbursements Transfer identifiers for international payments
Payment amounts totalPayableAmount, totalBillableAmount, totalAmount PaymentLineItems, MoneyOut_Audit, WiseDisbursements Amounts in cents (e.g., 250000 = $2,500.00)
Pay rates payableRate, billableRate Jobs, Jobs_Audit Exact hourly/monthly compensation — both what contractor earns and what client pays
Tax form status tax_form Jobs Tax filing status per contractor
Stripe subscription ID stripeSubscriptionId Jobs Billing subscription identifier
Payout schedule and currency schedule.interval, default_currency (in JSON) MercorUserFinancials daily payout with 7 day delay, currency cad
Payment failure reasons dispatchFailureReason, failureReason PaymentLineItems, MoneyOut_Audit, WiseDisbursements Structured failure codes revealing payment issues

The MercorUserFinancials.accountDetails field is particularly egregious — it stores the complete Stripe Connect API response as a JSON blob, which includes the contractor's full legal name, personal email, bank name, routing number, last four digits of the account, account holder name, country, currency, and TOS acceptance details. This is not a reference or a token — it is the raw financial identity of each contractor stored in a single database column.

4. Employment and Performance Records

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Employment contract terms payableRate, billableRate, commitment, expected_hours, startDate, expiresAt Jobs, Jobs_Audit Full contract terms including pay rate, hours, and duration
Signed offer letters offerLetter Jobs, WorkTrial_Audit S3 key or base64 encoded signed legal document
Digital signatures signature Jobs, WorkTrial_Audit, WorkAuthorization_Audit Contractor's digital signature on legal agreements
CIIA/NDA agreements ciiaa_direct, ciiaaPassthrough Jobs, WorkTrial_Audit Confidentiality and IP assignment agreements
Terms of work tow Jobs, WorkTrial_Audit Full terms of engagement
Safety waiver safety_waiver Jobs Safety waiver acceptance
Dismissal date and reason dismissalDate, dismissalReason, dismissalFlag Jobs, JobPerformanceReviews_New Date of termination and categorized reason
Offboarding reason Offboarding Reason MLExperimentsJobPerformanceReviews Plaintext offboarding justification
Performance scores score, Quality of Work, Engagement, performanceScore JobPerformanceReviews_New, MLExperimentsJobPerformanceReviews, ContractorPerformance_New Numeric ratings with text justifications
Performance review text reviewNotes, Justification for rating, performanceSummary, jobPerformanceSummary JobPerformanceReviews_New, MLExperimentsJobPerformanceReviews Free-text evaluations of individual contractors
Reviewer identity reviewedBy, Reviewer JobPerformanceReviews_New, MLExperimentsJobPerformanceReviews Named Mercor staff who wrote the review (e.g., A*** K*****)
Client project name Account, Project, projectName MLExperimentsJobPerformanceReviews, JobPerformanceReviews_New OpenAI, Apertus - Elephant — links contractor performance to specific client

The MLExperimentsJobPerformanceReviews table is especially damaging: it contains the contractor's full name, email, client company name (e.g., OpenAI), project name, reviewer's name, quality score, engagement score, offboarding reason, and a free-text justification — all in a single row. Sample: A***** D****, a*****s@gmail.com, OpenAI, Apertus - Elephant, reviewed by A*** K*****, rated 4 - Redefines Expectations.

5. Criminal Background and Adverse Media Checks

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Criminal background check status status BackgroundCheck, BackgroundCheck_New clear / consider (whether criminal history was flagged)
Adverse media check status adverseMediaCheckStatus BackgroundCheck Whether negative news/media was found about the individual
Background check package package BackgroundCheck e.g., tasker_pro — defines which checks were run
RCMP criminal record documents Referenced via S3 bucket certn-api-s3-certn-rcmp-documents-ca-central-1-production Royal Canadian Mounted Police criminal record check documents
External candidate ID at Checkr/Certn externalCandidateId, backgroundCheckId, reportId BackgroundCheck Cross-references to external background check providers
Work location for check workLocation BackgroundCheck Country/jurisdiction of background check

6. Work Authorization and Immigration Status

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Work authorization status workAuthorizationStatus UserMetadata, Candidates, WorkAuthorization_Audit Whether individual is authorized to work in a given country
Physical country vs. residence country physicalCountry vs. residenceCountry UserLocation, WorkAuthorization_Audit Mismatch between these fields is flagged as fraud — revealing who may be working from an unauthorized location
Location attestation with signature agreedToLocation, signature, attestedAt WorkAuthorization_Audit Signed attestation of physical work location

Work authorization status is classified as sensitive personal data under GDPR and many state privacy laws. Its exposure, combined with physical location data and location mismatch fraud flags, could be used to identify individuals working from countries where they lack authorization — creating potential immigration enforcement risk.

7. Device Fingerprints, Network Identifiers and Surveillance Data

Data Element Database Column(s) Source Table(s) Format Observed in Sample
IP address ip InsightfulScreenshots 71.194.*.* (full IPv4 address, geolocatable)
MAC address gateways InsightfulScreenshots ["1C:93:7C:**:**:**"] (unique hardware identifier)
Hardware fingerprint (HWID) hwid InsightfulScreenshots 8f9f16f0-1fb7-47e4-a2a1-209838aa5c5e (persistent device ID)
Computer hostname computer InsightfulScreenshots desktop-ue2kgro
Operating system & version os, osVersion InsightfulScreenshots win32, 10.0.19045
Application file path appFilePath InsightfulScreenshots C:\Program Files\Google\Chrome\Application\chrome.exe
Active window title windowTitle InsightfulScreenshots Full window title revealing document/conversation content
Browser URL visited browserUrl InsightfulScreenshots Full URL being viewed at time of screenshot
Desktop screenshot image storageUrl InsightfulScreenshots Direct S3 URL to actual screenshot image file
Productivity score externalProductivityScore InsightfulScreenshots Numeric productivity rating per screenshot interval
Timezone timezone InsightfulScreenshots, Timelog America/Chicago — reveals approximate geographic location
Session duration duration, timeStart, timeEnd Timelog Exact milliseconds worked per session
Pay deduction reason reasonForDeduction, appName Deductions Why money was subtracted from pay, linked to specific application

The combination of IP address + MAC address + HWID creates a triple device fingerprint that uniquely identifies not just the person but the specific physical machine they used. Under GDPR, device fingerprints are explicitly classified as personal data. Under CCPA, unique device identifiers constitute personal information.

8. Fraud Profiling and Algorithmic Decision-Making

Data Element Database Column(s) Source Table(s) Format Observed in Sample
Fraud probability score posteriorProbability, modelScore FraudEvents, FraudSignalAuditLog Bayesian probability (0.0–1.0) that individual is fraudulent
Fraud decision currentDecision, status FraudStates, FraudCheck APPROVE / ESCALATE / REJECT — algorithmic verdict on individual
LLM-generated fraud reasoning currentReasoning, manual_review_rational FraudStates, FraudCheck AI-written paragraph explaining why individual was flagged: "The primary concern is a maximum location mismatch score of 1.0, indicating the user's IP address is entirely inconsistent with their stated profile location..."
Fraud signal inventory currentKeySignals, flag_reasons FraudStates, FraudCheck ["location_mismatch: 1.0", "email_diff: 0.125", "email_is_pwned: False"]
HaveIBeenPwned result email_is_pwned (in signals) FraudStates Whether contractor's email was found in known data breaches
VPN/Tor detection Referenced in fraud signals FraudStates, FraudSignalAuditLog Whether VPN or Tor usage was detected
Cheating detection isCheating, cheatingProbability, signs CheatingDetection Whether individual was flagged for cheating during interviews
Duplicate account detection userIdList DuplicateGroups Groups of accounts believed to belong to the same person

Automated fraud decisions directly impacted individuals' ability to earn income through the platform. Under GDPR Article 22, individuals have the right not to be subject to decisions based solely on automated processing that produce legal or similarly significant effects. The exposure of the complete fraud reasoning — including the LLM-generated explanations — reveals the inner workings of an automated decision-making system that determined whether people could work and earn money.

9. Communications and Third-Party PII

Data Element Database Column(s) Source Table(s) Format Observed in Sample
In-platform message content content Comms, CommsSent Full text of messages between contractors, recruiters, and clients
Outreach email content subject, content, messageTemplate EmailTemplates, OffPlatformCampaignSteps Full email templates with subject lines
Phone call logs Call metadata AircallComms Aircall VoIP call records
Professional reference PII name, email, company, relationship UserReferences Third parties' names, emails, and employers — people who did not sign up for Mercor
LinkedIn profiles of non-users linkedinUrl, email LinkedinWarmIntros Full LinkedIn URLs and email addresses of people contacted for warm intros
Voucher/endorser PII voucherUserId, candidateEmail, candidateName, candidateLinkedinId CandidateVouches Names, emails, and LinkedIn IDs of both vouchers and vouched-for candidates
Recruiter notes noteBody, notesForCandidate ListingNotes, Candidates Candid internal commentary about individuals

The exposure of third-party PII is particularly significant for legal liability. UserReferences contains the names, email addresses, employers, and relationships of professional references — individuals who never created Mercor accounts and never consented to having their data stored in Mercor's production database. LinkedinWarmIntros contains LinkedIn URLs and emails of people contacted for recruitment outreach. These third parties had no contractual relationship with Mercor and no opportunity to consent to or opt out of data collection.

10. PostHog Behavioral Analytics De-Anonymized

Data Element Database Column(s) Source Table(s) Format Observed in Sample
User email linked to analytics session userEmail PosthogAnalytics Personally identified analytics sessions (defeating anonymization)
Company context company PosthogAnalytics Which company the user was associated with during the session
Session timing startTimeUtc, endTimeUtc PosthogAnalytics Exact session start/end times
Active/inactive time activetime, inactivetime PosthogAnalytics How long the user was actively engaged vs. idle
Entry URL startUrl PosthogAnalytics The URL the user was on when the session started

PostHog sessions are typically anonymous or pseudonymous. The PosthogAnalytics table explicitly links userEmail to session data — effectively de-anonymizing behavioral analytics and creating a personally identifiable record of how each contractor and company user navigated the platform.

Any single category above would trigger breach notification obligations under most privacy laws. The combination creates exposure across multiple overlapping regulatory regimes:

Regulation Applicable Data Key Provisions
GDPR (EU/UK) All categories — Mercor processes data of EU/UK contractors (sample shows United Kingdom, Harrow residence) Articles 5, 6, 9 (special categories), 13-14 (transparency), 22 (automated decisions), 33-34 (breach notification within 72 hours)
CCPA/CPRA (California) Personal identity, financial, employment, device identifiers, behavioral analytics Right to know, right to delete, right to opt-out of sale/sharing, private right of action for data breaches resulting from failure to maintain reasonable security
Illinois BIPA Facial geometry scans from Persona liveness detection, facial comparison thumbnails stored as image keys $1,000–$5,000 per violation statutory damages, private right of action, no harm requirement
FCRA (Federal) Background check results, adverse media checks, fraud decisions used for employment decisions Requires permissible purpose, adverse action notices, accuracy obligations, private right of action
ECPA / Wiretap Act Desktop screenshots capturing communications, browser URLs, window titles Consent requirements for interception of electronic communications
State Data Breach Notification Laws (all 50 US states) Name + financial account number, name + SSN, name + government ID Mandatory notification to affected individuals, typically within 30-60 days
PIPEDA (Canada) All categories — sample shows Canadian contractor (country: CA, BANK OF M*******, routing_number: 000**-***) Breach notification to Privacy Commissioner and affected individuals
SOX / PCI-DSS Financial account data, payment card information if present, bank routing numbers Compliance obligations for financial data handling

The exposed data supports claims for:

The Scope - Who Is Affected

The breach affects multiple distinct populations, each with different legal standing:

  1. Contractors (Primary Class) — Every person who signed up, completed an interview, or performed work through Mercor has their full PII exposed: full legal name, personal email, phone number, date of birth, home address, government ID verification status, bank name and routing number, employment terms with exact pay rates, performance reviews with dismissal reasons, and in many cases desktop screenshots of their computer screens while working. The MercorUserFinancials table alone contains sufficient information for bank account fraud — the bank name, routing number, last four digits of account number, account holder name, and country are all stored in plaintext JSON.

  2. Client Companies — Companies that hired through Mercor have their project names (including OpenAI, Apertus - Elephant), internal tooling references, billing details, hiring criteria, candidate evaluation notes, Slack workspace URLs, Okta SSO group configurations, and annotation platform URLs exposed. These include some of the most valuable and secretive AI organizations on the planet.

  3. Mercor Employees — Internal staff are identifiable through the IacDeploymentRuns table (GitHub usernames as actor fields), CatfishAuditLog (Slack user IDs and real names), DATABASECHANGELOG (migration author names), MLExperimentsJobPerformanceReviews (reviewer names like A*** K*****), and the IAM table (users with ghost role assignments within client companies).

  4. Third Parties Who Never Consented — Professional references (UserReferences) provided their name, email, employer, and relationship to the contractor. LinkedIn contacts (LinkedinWarmIntros) had their profile URLs and email addresses stored. Vouching parties (CandidateVouches) provided detailed relationship information. These individuals had no direct contractual relationship with Mercor, likely received no privacy notice, and had no opportunity to consent to or opt out of data collection. Their data was collected incidentally through the contractors they were associated with.

The Scale - Mercor Client Ecosystem

What elevates this breach from a typical startup data leak to an industry-wide crisis is who Mercor's clients are.

Meta, OpenAI, and Google DeepMind are among Mercor's publicly known clients — as reported by the Wall Street Journal — but even our small sample reveals direct evidence of engagements with at least six major technology companies, plus numerous additional clients identifiable through project codenames and Airtable workspace names.

Confirmed Client Engagements Found in the Sample

The sample file contains not just the production database tables but also an ./EXPORTS/ directory with full Airtable workspace dumps — organized by client name. These exports contain the actual work product: prompts, model responses, evaluation rubrics, and contractor submissions. The client names appear directly in the directory structure:

Client Evidence in Sample What Was Exposed
Apple Airtable workspace: AIRTABLE_APPLE_ENDPOINT_SANDBOX_APP3PG4U42BALES9K containing tables: TEXT, DEEP_L, TEXT_ORCHESTRATOR, RUBRIC_AUTO_GEN Apple's proprietary AI model outputs. The TEXT table contains prompt-response pairs from Apple Foundation Models (afm-text-083, afm-model-085, afm-model-086) — Apple Intelligence's internal language models. Sample: model afm-text-083 responding to user prompts with temperature=0.7, top_p=0.9. The DEEP_L table shows translation evaluation (text→Spanish). The TEXT_ORCHESTRATOR table shows orchestrator model (afm-model-086) being tested. This is pre-release Apple Intelligence evaluation data.
Amazon Airtable workspace: AIRTABLE_AMAZON_LLM_COT_EVALUATION___UPDATED_APP0JM1SJ4XOHMAQC containing tables: DOMAINS, PHASE_1_TASKS, PHASE_1_REVIEWS, TALENT Amazon's LLM Chain-of-Thought evaluation data. The DOMAINS table shows evaluation categories (math, stem). The PHASE_1_TASKS table contains full model A vs. model B comparison data with complete Chain-of-Thought reasoning traces, final responses, and preference judgments. Tasks are claimed by named Mercor staff (e.g., n****k@mercor.com). This exposes Amazon's internal model evaluation methodology and scoring rubrics.
OpenAI Performance review record: Account: OpenAI, Project: Apertus - Elephant, reviewed by named staff. Feather platform URL: feather.openai.com/campaigns/998855ab-.... Project codename in Projects_Audit. Named contractor (A***** D****, a*****@gmail.com) rated 4 - Redefines Expectations on OpenAI project work. Direct URL to OpenAI's internal Feather annotation platform with campaign UUID.
Anthropic Airtable workspace: AIRTABLE_API_PREFERENCE containing PROMPTS, RESPONSES, ROLES, DOMAINS tables. Project: GPT-4 vs Claude Evaluation comparing GPT-4 and Claude 3.5 Sonnet. AgentSandboxes table shows agentType: claude. LLM preference evaluation data comparing Anthropic's Claude 3.5 Sonnet against GPT-4 across use cases. AI coding agent sandbox sessions running Claude. Exposes model comparison methodology and evaluation criteria.
Meta Publicly confirmed client per WSJ. Project references in Projects_Audit and ProjectIntegrations. Contractor work product, project configurations, Slack workspace integrations.
Google DeepMind Publicly confirmed client per WSJ. Contractor work product and project data in the full database.

Airtable Workspace Inventory

The sample file reveals 25+ distinct Airtable workspaces that were exported as part of the breach. Each workspace name follows a pattern that often includes the client name or project identifier. Beyond the named clients above, the Airtable exports include:

Airtable Workspace Domain Notable Tables
APEX_LEGAL APEX benchmark - Legal TASKS, CRITERIA, TALENT, LLM_CALL_CONFIGURATION
APEX_INSURANCE APEX benchmark - Insurance TASKS, CRITERIA, TALENT, IMPORTED_TABLE
APEX_DATA_SCIENCE APEX benchmark - Data science TASKS, CRITERIA, TALENT, LLM_CALL_CONFIGURATION
APEX_MECHANICAL_ENGINEERING APEX benchmark - Engineering TASKS, HELPER, FAILURE_ANALYSIS, TALENT
APEX_DIY APEX benchmark - DIY/consumer TASKS, CRITERIA, TALENT
ATHENA_HLE___RUBRICS Athena HLE (Humanity's Last Exam) rubrics TASKS, MODEL_RESPONSES, AWAITING_REVIEW_METRICS
ATHENA_HLE__STEM_ Athena HLE STEM evaluation ATHENA_STEM_V_1, QA_SPECS
BEAR_MEDICINE Medical domain tasks DISCIPLINES, REVIEWER_ASSESSMENT, WRITER_DAILY_ACTIVITY, BONUS_PAYOUTS, PODS
AIME_RUBRICS AIME (math competition) rubrics TEAMS, TASKS, USERS
ARXIV_Q_A (multiple versions) Academic paper Q&A generation WORK_QUEUE, DOUBLE_BLIND, LEAD_AUDIT_QA, TESTING_ARXIV_LINKS
AUTO_REVIEWER Automated review system SUBMISSIONS, LLM_CALL_CONFIGURATIONS, PROJECTS
09_29_CAND_MODEL_EVAL Candidate model evaluation (IB1, IB2, CML) IB_1, IB_2, CML, CML_DEPRECATED_
API_PREFERENCE API preference evaluation PROMPTS, RESPONSES, ROLES, DOMAINS, PROMPT_TEMPLATES
APEX_EXPANSION_WEBSITE_TASKS Website-related expansion CRITERION, FILE, TASK
APEX_EVALS General evaluation framework EVALUATION_RESULTS
APEX_V1_REVISION Apex V1 revision EXPERT, RUBRIC, CRITERION, ROLE

The ATHENA_HLE workspaces are particularly significant — "HLE" likely refers to Humanity's Last Exam, a high-profile AI benchmark designed to test frontier model capabilities. The MODEL_RESPONSES table in the rubrics workspace suggests Mercor contractors were grading AI model outputs against this benchmark, and the AWAITING_REVIEW_METRICS table indicates an active review pipeline. If this data reached adversarial actors, it could be used to game or contaminate one of the most important AI evaluation benchmarks.

The BEAR_MEDICINE workspace reveals medical domain annotation work with DISCIPLINES, REVIEWER_ASSESSMENT, and WRITER_DAILY_ACTIVITY tables — indicating Mercor contractors were creating or evaluating medical AI training data, adding healthcare data to the breach's sensitivity profile.

Evidence from Named Projects in the Database

Beyond the Airtable exports, the production database tables contain additional project references:

Project Codename Domain Evidence Source
Apertus — Elephant AI model evaluation (OpenAI-linked) MLExperimentsJobPerformanceReviews: Account: OpenAI
Project Mega Large-scale annotation (dedicated Slack workspace: project-mega.slack.com) ProjectIntegrations, ActionsQueue
Panacea — Consulting RL Envs Reinforcement learning environments Projects_Audit, 400+ billable hours
Agentic Code Final QC Audit AI code generation quality control (GitHub issue solving) TaskDefinitions
GPT-4 vs Claude Evaluation LLM preference ranking (GPT-4 vs Claude 3.5 Sonnet) Airtable export: AIRTABLE_AIRTABLE_AI_AGENT_DEMO
Creative Writing Evals Creative content evaluation Projects_Audit
arXiv Q&A Academic paper Q&A generation (multiple Airtable versions incl. Snowflake integration) Airtable exports (3+ copies with dates)
Queensland (litigation) Legal domain Projects_Audit
FP&A / Corporate Finance Finance domain Projects_Audit
Obsidian Human data client (billingModel: "invoice", tagged humandataclient) Company

The Magnificent Seven, Frontier AI Labs, and the Competitive Fallout

Mercor is not a niche startup. According to Big Think and TechCrunch, Mercor has signed deals with six of the seven "Magnificent Seven" tech giants — Apple, Microsoft, Alphabet, Amazon, Meta, and Nvidia — plus frontier model developers OpenAI and Anthropic. The company employs over 30,000 contractors, pays an average rate of $95/hour, and reached a $500 million annual revenue run rate within 17 months of launch. It is valued at $10 billion.

This means the stolen data — the 211GB database, the 939GB of source code, the 3TB of cloud storage, and the 84 Airtable workspaces documented above — contains the operational records, AI training data, and work product for engagements touching nearly every major AI program in the Western world.

The small sample analyzed in this report already confirms direct evidence of work for Apple (Foundation Model outputs), Amazon (LLM Chain-of-Thought evaluation), OpenAI (Feather platform, Apertus project), Anthropic (Claude evaluation), and Meta (multimedia annotation templates). The full 211GB database — which we have not seen — would contain the complete records for all six Magnificent Seven clients plus the frontier labs.

The competitive implications are severe:

  1. The training data itself is the prize. The leaked RLHF annotations, model evaluation data, and preference rankings produced by Mercor's contractors represent billions of dollars in training data investment. This data — now in the hands of Lapsus$ and available to any buyer — could be used by any competitor to accelerate their own model development without incurring the cost of generating it. As Y Combinator president Garry Tan noted: "Incredible amount of SOTA training data now just available to China thanks to @mercor_ai leak. Every major lab. Billions and billions of value."

  2. Apple Foundation Model outputs are in the dump. The AIRTABLE_APPLE_ENDPOINT_SANDBOX workspace contains actual afm-text-083 and afm-model-086 model responses — pre-release Apple Intelligence outputs. These provide direct insight into Apple's model capabilities, safety alignment approach, and weaknesses before public release. Any competitor — whether a Silicon Valley rival or a lab in Beijing, London, or Tel Aviv — now has access to Apple's unreleased model behavior.

  3. Amazon's Chain-of-Thought evaluation methodology is exposed. The AIRTABLE_AMAZON_LLM_COT_EVALUATION workspace reveals how Amazon evaluates LLM reasoning quality, including the full prompts, complete Chain-of-Thought traces, and preference rubrics. The methodology itself is as valuable as the data — it reveals what Amazon considers "good reasoning" and how they measure it.

  4. The Anthropic/Claude evaluation data could inform adversarial attacks. The preference evaluation data comparing Claude 3.5 Sonnet against GPT-4 — including the exact prompts, response pairs, and preference reasoning — could be used to identify weaknesses in Claude's alignment or to train models that specifically exploit those weaknesses.

  5. Mercor's global contractor base spans dozens of jurisdictions. With 30,000+ contractors across many countries, Mercor's database contains work authorization records, physical location data, and IP-based geolocation. The platform's fraud detection system flags contractors whose physical IP doesn't match their declared residence — meaning the database contains a map of which contractors may be working from undisclosed locations.

Beyond the companies confirmed in the data, multiple sources — including former Mercor employees — claim that Mercor also maintains engagements with Chinese AI laboratories, including companies developing frontier models that compete directly with the labs whose training data is now in the breach. If true, this means Mercor was a single point of compromise connecting competing labs on opposite sides of the global AI race, with training data, evaluation methodologies, model outputs, and contractor talent pools for all of them sitting in the same breached infrastructure.

Even setting aside the question of direct Chinese client relationships, the stolen data — RLHF annotations, preference rankings, model evaluation rubrics, and Chain-of-Thought traces produced for OpenAI, Anthropic, Apple, Amazon, Meta, and Google — is now available on the black market. Given that Lapsus$ is actively auctioning the data, this material will reach whoever is willing to pay for it.

The TaskDefinitions table also references autograder configurations using openai/gpt-4.1 and openai/gpt-5 as scoring models, and task rubrics include constraints like "LLMs other than ChatGPT are prohibited" — rules that only make sense when the work product is destined for a specific model vendor's training pipeline.

The scope of client engagements extends far beyond AI companies. The Airtable workspaces alone span legal, insurance, data science, mechanical engineering, medicine, academic research, and mathematics — suggesting Mercor's contractor workforce touches data and systems across a wide range of industries. Any attacker with access to the full dump could enumerate every active client engagement by cross-referencing the Company, Projects_Audit, ProjectIntegrations, Listings_New tables, and the complete Airtable export directory.

The Airtable Export - 84 Workspaces, 1055 Files

A separate directory tree from the breach (EXPORTS/) reveals the full structure of the exfiltrated Airtable data. The export contains 84 unique Airtable workspaces totaling 1,055 JSONL files — each file containing the complete contents of one Airtable table. This is not a sample. It is the complete export of every Airtable base connected to Mercor's Fivetran data pipeline.

The directory structure reveals how Airtable sits at the center of Mercor's operation. It is used as:

  1. The annotation task management system — Every domain-specific project has its own Airtable base with a standardized schema: TASKS, TASK_VERSIONS, CRITERIA, DOMAIN, SUBDOMAIN, TALENT, QA_SPECS, WORKFLOW, LLM_CALL_CONFIGURATION, CONTROL_PANEL, and FILES. This is a fully industrialized annotation pipeline.

  2. The work product repository — Tables like PHASE_1_TASKS (Amazon), TEXT (Apple), PROMPTS/RESPONSES (API Preference), and MODEL_RESPONSES (Athena HLE) contain the actual task inputs and outputs — the prompts sent to AI models, the model responses, and the human evaluations. This is the training data itself.

  3. The talent and compensation ledgerTALENT tables appear in nearly every workspace, tracking which contractors worked on which tasks. CALCULATED_BONUSES, BONUS_PAYOUTS, TIMELOG, and CLAIMS tables track compensation. WRITER_STATS, REVIEWER_STATS, and WRITER_DAILY_ACTIVITY tables (in BEAR_MEDICINE) track individual productivity.

  4. The QA and audit systemQA_SPECS, LEAD_AUDIT_QA, DOUBLE_BLIND, and REVIEWER_ASSESSMENT tables track quality control processes.

The named workspaces can be organized into categories that reveal the full breadth of Mercor's operations:

Client-Named Workspaces (Direct Client Evidence):

Workspace Client Content
APPLE_ENDPOINT_SANDBOX Apple Apple Foundation Model outputs (afm-text-083, afm-model-086), translation testing (DEEP_L), orchestrator testing (TEXT_ORCHESTRATOR), rubric auto-generation
AMAZON_LLM_COT_EVALUATION (2 versions) Amazon LLM Chain-of-Thought evaluation: DOMAINS, PHASE_1_TASKS, PHASE_1_REVIEWS, MODEL_A_STRENGTHS
AAIE___META_MULTIMEDIA_TEMPLATE_COMMAND_CENTER Meta Meta multimedia annotation template with OVERALL_META, PROJECTS, FORMS, and TEMPLATE tables. Workspace name explicitly says "META" and "USE META_X_MULTIMEDIA_SPL_AIRTABLE_TEMPLATE"
API_PREFERENCE / API_PREFERENCE_V2 / API_PREFERENCE__COPY__FOR_BRENDAN / API_PREF___KANIX Anthropic/Multi-vendor LLM API preference evaluation: PROMPTS, RESPONSES, ROLES, DOMAINS, PROMPT_TEMPLATES, QA. Multiple versions and personal copies for named staff

APEX - Mercor's AI Benchmark Suite (Compromised):

The APEX_ prefix identifies Mercor's proprietary suite of AI benchmarks — domain-specific evaluation frameworks used to measure AI model performance across verticals. Each APEX benchmark has its own Airtable workspace with a standardized schema: TASKS, TASK_VERSIONS, CRITERIA, DOMAIN, SUBDOMAIN, QA_SPECS, WORKFLOW, LLM_CALL_CONFIGURATION, and CONTROL_PANEL. The complete APEX suite spans 15+ domains:

Workspace Benchmark Domain Notable Tables
APEX_LEGAL Legal reasoning Standard APEX schema
APEX_INSURANCE Insurance domain Standard APEX + IMPORTED_TABLE
APEX_FINANCE Financial services Standard APEX + HELPER
APEX_ACCOUNTING Accounting Standard APEX
APEX_CONSULTING Management consulting Standard APEX + TEST_HEX_TABLE
APEX_DATA_SCIENCE Data science Standard APEX
APEX_MECHANICAL_ENGINEERING Engineering Standard APEX + FAILURE_ANALYSIS, HELPER
APEX_MEDICINE Medical/healthcare Standard APEX
APEX_FOOD Food industry Standard APEX + DELIVERIES
APEX_GAMING Gaming Standard APEX
APEX_RETAIL___E_COMMERCE Retail & e-commerce Standard APEX + DOMAIN_QC
APEX_SALES___MARKETING Sales & marketing Standard APEX
APEX_SHOPPING_STYLISTS Personal shopping Standard APEX
APEX_DIY (2 versions) DIY/consumer Standard APEX
APEX_WEBSITE_TASKS / APEX_EXPANSION_WEBSITE_TASKS Web content CRITERION, FILE, TASK

The exposure of the complete APEX benchmark suite — including all tasks, criteria, scoring rubrics, and LLM_CALL_CONFIGURATIONrenders these benchmarks untrustworthy. Any AI model trained on the leaked APEX data will appear to perform well on these benchmarks without genuinely possessing the evaluated capabilities. This is benchmark contamination at scale. Unless Mercor rebuilds the entire APEX suite from scratch with new tasks, new criteria, and new evaluation data, every APEX benchmark result produced after this breach is suspect. The EVALS workspace — which contains APEX_RESULTS, BOREALIS_RESULTS, and LUCIUS_RESULTS — further confirms that APEX was actively used to evaluate and compare models, making the contamination risk concrete and immediate.

Other AI Benchmark and Evaluation Workspaces:

Workspace Purpose Notable Tables
ATHENA_HLE___RUBRICS Humanity's Last Exam rubric grading MODEL_RESPONSES, AWAITING_REVIEW_METRICS, CLAIMS
ATHENA_HLE__STEM_ (4 versions incl. July 3, 2025 dated copy) HLE STEM vertical evaluation ATHENA_STEM_V_1
APEX_HLE_BASED_RUBRICS HLE-derived rubric system CRITERIA, LLM_CALL_CONFIGURATION
APHRODITE__SEARCH_HLE Search-based HLE evaluation HLE search variant
ACADEMIC_REASONING_SFT Supervised fine-tuning for academic reasoning COT (Chain-of-Thought), ROLES, TALENTS
AIME_RUBRICS AIME math competition rubrics TEAMS, USERS, TASKS
EVALS / EVALS__COPY_ General evaluation framework APEX_RESULTS, BOREALIS_RESULTS, LUCIUS_RESULTS, _09_04_HLE_RUBRICS
09_29_CAND_MODEL_EVAL (5 versions) Candidate model evaluation (IB1, IB2, CML) Iterative model comparison datasets

Medical Domain Workspaces:

Workspace Purpose Notable Tables
BEAR_MEDICINE Medical annotation DISCIPLINES, REVIEWER_ASSESSMENT, ASSESSMENT, WRITER_DAILY_ACTIVITY, REVIEWER_STATS, WRITER_STATS, ALL_TIME_TOP_5, BONUS_PAYOUTS, CLAIM_LOCK, AHT_STATS, ASSESSMENT_ANALYSIS, PODS
BEAR_RADIOLOGISTS Radiology-specific annotation Radiologist-specific tasks
BANKERS Financial/banking domain Banking-specific tasks

Aircall Integration (complete phone system export):

The export also includes a full Aircall directory — Mercor's VoIP phone system — containing 27 tables: CALL, CALL_TRANSCRIPTION, CALL_TRANSCRIPTION_CONTENT_UTTERANCE, CALL_SENTIMENT, CALL_SENTIMENT_PARTICIPANT, CALL_SUMMARY, CALL_ACTION_ITEM, CALL_TAG, CALL_TOPIC, CONTACT, CONTACT_EMAIL, CONTACT_NUMBER, USERS, USER_AVAILABILITY, and more. This represents the complete call history including full transcriptions, sentiment analysis, AI-generated summaries, and contact information for every recruiter phone call.

What the Airtable Export Means:

The Airtable export transforms this breach from a database leak into a complete AI training data theft. The database tables documented in the rest of this article provide the metadata — who worked on what, when, and how much they were paid. The Airtable export contains the actual work product: every prompt, every model response, every human evaluation, every rubric score, every Chain-of-Thought trace, and every preference judgment that Mercor's contractors produced for Apple, Amazon, OpenAI, Anthropic, Meta, and dozens of other clients.

The iterative versioning visible in the workspace names (e.g., APEX_RUBRICS with 12+ dated copies from August 7, 2025 through January 23, 2026) reveals that this export captured the complete historical evolution of Mercor's benchmark and evaluation pipeline — not just a snapshot, but the full development history of rubrics, task definitions, and evaluation criteria across months of refinement. For the APEX benchmarks specifically, this means every iteration of every benchmark task is now public — an attacker can study how the benchmarks evolved and craft model training data that targets the final versions.

Customer and Third-Party Platform URLs Found in the Dump

Beyond project codenames, the dump contains direct URLs to customer platforms, internal tools, and third-party services — embedded in configuration fields, JSON blobs, onboarding documents, and metadata columns across dozens of tables. An exhaustive search of the file reveals 1,800+ unique URLs. The most sensitive are catalogued below.

Client Annotation and Work Platforms

These are URLs to the actual platforms where Mercor contractors perform work for clients. Each one identifies a specific client engagement and, in many cases, a specific campaign or task within that client's systems:

URL / Domain What It Reveals Source Table
feather.openai.com/campaigns/998855ab-60e7-4aed-9f08-5fccd56fe53e OpenAI's internal Feather annotation platform — a specific campaign UUID, confirming Mercor contractors work directly inside OpenAI's tooling Projects_Audit (annotationPlatform)
alabaster-studio.com/project/abacus/conversation/7c9facb4-... A client project management / collaboration platform — captured as the live browser URL during a monitored work session InsightfulScreenshots (browserUrl)
glowstone-mli-rubrics.slack.com (channels: C0994P7BH2N, D09969QHV62) A client-specific Slack workspace for MLI rubric development — likely a client or partner organization's dedicated workspace ProjectIntegrations, ActionsQueue
project-mega.slack.com A dedicated Slack workspace for a single large-scale annotation project ProjectIntegrations
6 distinct Airtable workspace IDs (appX7l7xADlyFD3nL, appEzeshKTIKSrvBV, app9DBchZKUj2auMZ, appCZwMqiIUkP7KIQ, appLmn3266lQsaUXK, appYFQOZicXUoO2yz) Airtable used as an annotation and project management platform — each app ID is a distinct workspace, likely per-client or per-project Projects_Audit, OnboardingDocument
ta-01km6j8ztpd4vttvzb7ctgqteh-8080-ms3c95f46vnxcii7cwsi84ago.w.modal.host A Modal.com serverless deployment — indicating Mercor or a client runs ML model inference on Modal AgentSandboxes or service configuration

Mercor Internal Infrastructure URLs

These URLs expose Mercor's own internal architecture, allowing an attacker to map the entire operational surface:

URL / Domain What It Reveals Source Table
work.mercor.com Primary contractor work portal (100+ URLs with job IDs like /create/job_AAABm...) Comms, ActionsQueue
team.mercor.com Company-facing team portal Comms, EmailTemplates
talent.docs.mercor.com/how-to/okta-access Internal documentation portal — includes onboarding guides for Okta and Insightful setup ActionsQueue
api.mercor.com API gateway endpoint Configuration fields
dev.coil.mercor.com Development webhook endpoint for the coil microservice ProjectIntegrations
coil.mercor.com Production coil service endpoint ProjectIntegrations
c-mercor.okta.com Okta SSO instance — the identity provider for all contractor and staff authentication ActionsQueue, UserMetadata
linear.app/mercor Mercor's Linear issue tracker — exposes internal engineering project management Configuration metadata
pic-gen.r2.mercor.com Cloudflare R2 image generation service Asset URLs
ddcd-2601-642-4c01-5a8d-...ngrok-free.app An ngrok development tunnel — a temporary public URL exposing a local dev server, including the developer's IPv6 address embedded in the subdomain Webhook configurations

AWS S3 Buckets

Each S3 bucket below contains files that are directly addressable via URL if the bucket permissions are misconfigured. The bucket names alone reveal the categories of stored data:

S3 Bucket Contents
mercor-insightful-screenshots-production Every screenshot captured from contractor desktops during monitored work
mercor-background-check-photos Background check identity documents and photographs
ai-interviewer-recordings Audio/video recordings of AI-conducted interviews
dailyco-recordings Daily.co video call recordings
production-pdx-5557735*****-web-recordings Production call recordings (AWS account ID 5557735***** is embedded in the bucket name)
kite-uhn-brain-injury.s3.ca-central-1.amazonaws.com Medical documents — bucket name references brain injury records at UHN (University Health Network), a major Canadian hospital system
certn-api-s3-certn-images-ca-central-1-production Certn identity verification images
certn-api-s3-certn-rcmp-documents-ca-central-1-production RCMP (Royal Canadian Mounted Police) criminal record check documents
certn-api-s3-one-id-images-ca-central-1-production OneID government identity verification images

The S3 bucket kite-uhn-brain-injury is particularly alarming — it suggests that either Mercor or a client project involved handling protected medical records, and the bucket name alone leaks the nature of the data and the institution involved.

Google Workspace Documents

The dump contains direct URLs to 30+ Google Docs, 2+ Google Sheets, 2+ Google Forms, and 10+ shared Google Drive folders used for project onboarding, task instructions, rubric definitions, and team coordination:

Many of these Google Docs likely remain live and accessible if the sharing permissions are set to "anyone with the link" — a common practice for contractor onboarding materials.

Communication and Collaboration Evidence

Platform Evidence Count
Slack 4 distinct workspaces: mercor.enterprise.slack.com, project-mega.slack.com, glowstone-mli-rubrics.slack.com, 6385b64336a9545.slack.com 4 workspaces, 5+ named channels
Google Meet Meeting room codes: deo-ixih-ivt, cae-eois-jwn, hhr-erjm-svp, pmi-ogrs-aap, szd-qvcr-hfp, zoz-shgt-epy 6+ meeting rooms
LinkedIn Contractor profile URLs with full names Multiple profiles
Aircall Call recordings via media-web.aircall.io and assets.aircall.io Recruiter phone call audio
Ashby HQ Job postings at jobs.ashbyhq.com and app.ashbyhq.com Hiring platform
Certn Background check portals: mercor.certn.co/hr/applications/{uuid}/, enrollment at certn.trustmatic.ws/web-enrolment/ Identity verification flows

What This URL Inventory Means

An attacker with this data does not need to guess what Mercor's clients are or what systems contractors access. The URLs are already in the database. Specifically:

  1. OpenAI's Feather platform URL with a campaign UUID gives an attacker a direct entry point to probe OpenAI's annotation infrastructure
  2. S3 bucket names allow targeted enumeration attacks — checking whether buckets are publicly accessible or brute-forcing object keys based on the naming patterns visible in the dump
  3. Google Docs and Drive folders may still be live and accessible if shared via link — giving an attacker access to project rubrics, onboarding materials, and task instructions
  4. Slack workspace identifiers enable social engineering against teams working on specific projects
  5. The ngrok tunnel URL embeds a developer's IPv6 address, adding another vector for targeting Mercor engineering staff
  6. The AWS account ID (5557735*****) embedded in the S3 bucket name enables targeted cloud reconnaissance

The Screenshot Problem

The most dangerous element of this breach is the Insightful time-tracking screenshot system — and the danger compounds with every client Mercor serves, every platform URL catalogued above, and every S3 bucket of screenshots that can be systematically correlated.

Mercor requires contractors to install the Insightful (formerly Workpuls) monitoring agent on their computers. This agent captures a screenshot of the contractor's desktop every few minutes while they are clocked in. Each screenshot is uploaded to mercor-insightful-screenshots-production.s3.amazonaws.com and indexed in the InsightfulScreenshots table with rich metadata:

A sample screenshot record from the dump shows a contractor working in Google Chrome on alabaster-studio.com/project/abacus/conversation/... — with their IP (71.194.*.*), MAC address (1C:93:7C:64:**:**), hardware ID, and full filesystem path to Chrome all recorded.

Here is why this is catastrophic in context:

The database contains all the ingredients for a systematic visual intelligence operation. An attacker can join tables to correlate screenshots with client projects and platform URLs:

  1. Which client project a contractor was assigned to (from ProjectIAM and Jobs)
  2. Which annotation platform that project uses (from Projects_Audit.annotationPlatform — e.g., feather.openai.com, specific Airtable workspace IDs)
  3. Every screenshot taken while the contractor worked on that project (from InsightfulScreenshots filtered by contractorId and projectId)
  4. The exact URLs, window titles, and application contents visible in those screenshots — cross-referenced against the known client platform URLs to confirm which client's systems are shown

This means an attacker doesn't just get a list of Mercor's clients — they get a visual archive of what contractors saw inside those clients' systems. If the project was for OpenAI, the screenshots show OpenAI's Feather annotation interface, the prompts being graded, and the evaluation criteria. If the project was for Meta, the screenshots show Meta's internal tooling. If the project involved reinforcement learning environments, the screenshots show the RL training data and reward models.

The scope of what these screenshots can reveal includes:

Perhaps most critically, the screenshots create an involuntary record of contractor misconduct. As the Wall Street Journal has reported on the growing concerns around AI training data supply chains, contractors in these roles often have privileged access to sensitive client systems. If any contractor was engaged in unauthorized data exfiltration — copying proprietary datasets, screenshotting confidential research, leaking model weights, or otherwise violating their employment agreements — that activity was captured frame by frame by the monitoring system and is now available to anyone with the dump.

The monitoring system that was designed to protect Mercor's clients has become a comprehensive, timestamped, visually indexed archive of everything those clients wanted to keep secret.

This creates a cascading breach. Mercor's data exposure is not just a breach of Mercor — it is a proxy breach of every client organization whose internal systems, annotation platforms, Slack workspaces, and proprietary tooling were visible on a contractor's screen during monitored work sessions. The number of indirectly breached organizations equals the number of clients Mercor has ever served.


Platform Overview

Mercor presents itself publicly as an AI-powered hiring platform. The database tells a more complete story: it is a full-stack labor marketplace and employment management system that spans acquisition, vetting, matching, contracting, surveillance, and payment.

The platform operates across at least three distinct product surfaces:

  1. Talent Portal — Where contractors create profiles, complete interviews, apply to listings, and track their work
  2. Company Portal — Where client companies post listings, review candidates, manage projects, and receive invoices
  3. Godmode / Internal Admin — An internal dashboard (GodmodeCompanies, GodmodeArbitraryCells) used by Mercor staff for operations

The backend is a microservices architecture with at least 13 named services: coil, site_fe, team_fe, work_fe, mercor_go, mercor_api, mercor_api_nginx, celery, workflow, db_trigger_consumer, steve, woz, and payments_temporal_worker. These are deployed on AWS ECS and managed via Terraform/Terragrunt in the mercor-monorepo GitHub repository.

The primary database is Aurora MySQL (AWS), with the analytics warehouse being Snowflake (evidenced by dbt model tables like DbtFirmSchoolRank and DbtSchoolRankings). Schema migrations are managed by Liquibase (evidenced by DATABASECHANGELOG and DATABASECHANGELOGLOCK tables).


Evidence - The Database Layer by Layer

The following sections present a systematic walk through every domain of the exposed database, with obfuscated sample records drawn directly from the dump. This is the evidence base for the claims made above.


Part I - User and Identity Layer

The Contractor Profile

At the core of Mercor's data model is the contractor. The MercorUsers_New table stores the primary user record, while MercorUsers_New_backup appears to be a historical snapshot. A sample (obfuscated):

Field Value
userId 7d10d057-0c11-438a-ace1-9a9c8a50c925
email e****a1@gmail.com
name T** O****
phone +44795718****
location United Kingdom, Harrow
createdAt 2025-08-30 09:49:20
lastLogin 2025-09-20 09:16:33
insightfulId wesvspdyd5m3zg2
stripeAccountId NULL
isDeleted 0

The insightfulId field is particularly significant — it links this user to their Insightful (formerly Workpuls) monitoring agent, meaning every screenshot taken of this person while working is tied to this identifier.

The MercorUsers_New table extends the backup with additional fields: phoneVerificationStatus, phoneVerifiedAt, phoneOptIn — indicating ongoing additions to the user data model. The authType field suggests support for multiple authentication providers (Firebase, Google OAuth, email/password).

Location and Residence Data

UserLocation stores both declared residence and physical presence:

Field Value
residenceCountry USA
physicalCountry USA
residenceState NULL
physicalState NULL

The distinction between residence and physical country is central to Mercor's fraud detection logic — a mismatch between declared location and actual IP-derived location is one of the primary fraud signals.

UserMetadata enriches the contractor record with:

UserAvailability_Audit captures declared working hours: maxWeeklyHours, desiredWeeklyHours, expectedStartOffset, and timezone — allowing Mercor to understand contractor bandwidth and scheduling preferences.

Referral and Social Vouching

CandidateVouches is a comprehensive social trust mechanism. When a voucher endorses a candidate, they fill out a structured questionnaire:

Each field has a paired *Detail text field. This creates a rich graph of professional and social relationships.

UserReferences stores professional references with names, companies, relationships, and contact emails — conventional hiring data now sitting in an exposed database.

UserState tracks lifecycle metrics: resumeUploaded, interviewsCompletedCount, jobApplicationsCount, totalMillisWorked.


Part II - Identity Verification and Fraud Detection

The KYC Layer

Mercor uses Persona as its identity verification provider. The IDVerificationChecks table records each check with:

A sample Persona response shows:

{
  "type": "baseline",
  "interview_id": "intr_AAABnNOWs0wnj7Tmg0hBQpL5",
  "thumbnail_key": "intr_AAABnNOWs0wnj7Tmg0hBQpL5_thumbnail.jpg",
  "persona_account_id": "act_QMTuQh33A4QU23J8ECPSd32BBKb4"
}

The thumbnail key references a stored facial image from the verification session.

BackgroundCheck and BackgroundCheck_New record criminal background and adverse media checks (via Checkr or Certn):

Field Example
externalCandidateId Checkr candidate UUID
workLocation USA
package tasker_pro
status clear / consider
adverseMediaCheckStatus clear

ScreeningPackage defines what checks are bundled per company engagement, including checkConfig (JSON with individual check types) and graceDays (how many days a contractor has to complete checks before being blocked).

The Fraud Pipeline

Mercor operates a multi-stage fraud pipeline that is one of the most sophisticated components in the database. It runs at four stages: profile, interview, post-interview, and on-project.

FraudStates — The current fraud verdict per user, maintained as a state machine:

Field Example Value
userId 000087ef-2296-445c-b355-9d5e600e0af2
currentStage profile
currentDecision ESCALATE
currentConfidence medium
currentReasoning "The primary concern is a maximum location mismatch score of 1.0, indicating the user's IP address is entirely inconsistent with their stated profile location..."
currentKeySignals ["location_mismatch: 1.0", "email_diff: 0.125", "email_is_pwned: False"]

The reasoning field contains LLM-generated natural language explanations — almost certainly from Vertex AI / Gemini based on the signal schema.

FraudCheck — The central fraud queue:

FraudSignalAuditLog — Every individual signal evaluated:

FraudEvents — Bayesian belief updates per event:

This is a textbook Beta-Binomial Bayesian fraud model — prior beliefs updated with evidence to produce posterior fraud probability estimates.

ProductionFraudState — Final fraud disposition:

OnProjectFraudWindows — Time-based on-project fraud analysis:

CheatingDetection / CheatingDetection_Audit — Interview cheating detection:

QAReviewLog — Manual fraud review outcomes:

AutoFraudChecks — Automated rule-based checks triggered on a schedule or event.

DuplicateGroups — Groups of user IDs believed to be the same person (userIdList), with merge tracking (mergedIntoGroupId).


Part III - The Hiring Pipeline

Listings

Listings_New is the job posting table. A Mercor listing is considerably more structured than a typical job board entry:

Field Description
title Job title
description Full job description
rateMin / rateMax Pay rate range
hoursPerWeek Expected commitment
payRateFrequency hourly / monthly
workArrangement Remote / hybrid
eligibleLocation Which countries can apply
ineligibleResidenceLocation Explicitly excluded countries
listingType Job category
evaluationCriteria JSON rubric for ranking candidates
automatedCommsOn Boolean — auto-send rejection emails
automaticRejectionsOn Boolean — auto-reject below threshold
timeToAutoReject Days until auto-rejection fires
goalNumHires Target headcount
referralBoost Bonus multiplier for referred candidates
isExploreAlways Always appear on public explore page
disableApplications Freeze new applications

EvaluationCriteria stores the per-listing scoring rubric used during candidate ranking — each criterion has shortCriteria, type (hard filter or soft score), hardFilter boolean, and position for display ordering.

ListingNotes stores internal recruiter notes per listing — including candid operational commentary. A sample (obfuscated):

"33 leads confirmed on sheet by B***** to send offers — @N*** to staff RM for conversion"*

This reveals that Mercor staff are managing candidate pipelines directly, with named individuals responsible for conversions.

Candidates

Candidates / Candidates_Audit tracks every application:

Field Description
status applied / shortlisted / offered / rejected
listingStepConfigId Which step in the hiring funnel
notesForCandidate Recruiter notes visible to candidate
birthday Date of birth at application time
physicalLocation Where they were when applying
workAuthorizationStatus Work eligibility
rejectionReason Categorized rejection reason
starred Recruiter-starred flag
automaticRejectAt Scheduled auto-rejection timestamp
numCommsSent / lastCommSentAt Outreach tracking
referralId Linked referral if any

CandidateMatchScores provides ML-generated match scores:

MercorScores stores the tournament-based ranking scores:

PairwiseComparisons stores individual A/B comparisons:

This implements a Bradley-Terry tournament ranking model — candidates are repeatedly compared in pairs, with each comparison updating relative ranking scores.

TalentViewSearchUsers and SharableTalentViewConfig enable companies to create saved talent searches and share curated candidate shortlists with colleagues. SharableTalentViewConfigUsers adds per-candidate evaluation data including likeCount, dislikeCount, and free-text feedback.


Part IV - Interviews and Assessments

AI Interview System

Mercor's interview process is AI-conducted and rubric-graded. The Forms_Audit table reveals the full interview configuration:

AssessmentRubrics defines the grading framework:

AssessmentRubricItems_Audit stores individual rubric criteria:

FormSubmissions records every interview submission:

AssessmentEvalState tracks the grading pipeline:

InterviewEvals stores scored results:

InterviewIssues records reported problems during interviews:

InterviewScores provides the final aggregate score per interview.


Part V - Work Trials and Onboarding

Work Trial Contracts

WorkTrial_Audit captures the structured trial engagement contract:

Field Description
payableAmount Amount payable to contractor (cents)
billableAmount Amount charged to company (cents)
ciiaaDirect Confidentiality agreement (direct)
ciiaaPassthrough Confidentiality agreement (passthrough)
tow Terms of work
offerLetter S3 key or base64 of signed offer letter
signature Digital signature string
startDate / endDate Trial period
projectId Linked project
billingAccountId Billing target

The presence of offerLetter and signature fields indicates that signed legal documents are stored directly in the production database.

WorkTrialConfig defines reusable work trial templates per company:

Onboarding Pipeline

OnboardingState defines the onboarding funnel steps:

Field Example
shortName interview_completed
name Interview Completed
threshold 1
order 0

OnboardingDocument stores the per-project onboarding materials (links, instructions, or document content) shown to newly hired contractors.

TierProgress tracks contractor progress through Mercor's internal tier/certification system — mapping contractors to planId, tierId, status, and completedAt.

PlanAssignments assigns contractors to specific plans with defined startDate, endDate, userHours allocation, and tasksCompleted tracking.


Part VI - Projects and AI Task Management

Project Structure

Projects_Audit reveals the full project configuration:

Field Description
companyId Client company
name Internal project name
screenshotEnabled Whether Insightful monitoring is active
userGroupEmail Google Group for project members
projectType Project category
annotationPlatform e.g., Scale AI, Label Studio
annotationPlatformIDs External platform project identifiers
ssotLink Single source of truth document URL
taskMetricsDatastore Where task data is stored
status active / archived
notes Internal notes
offerExtendedText Custom text in offer letters for this project

ProjectIAM / ProjectIAM_Audit defines role-based access: each record maps a userId to a roleId within a projectId, with status and assignedBy for audit purposes.

ProjectIntegrations is particularly revealing — it links each project to:

This table effectively maps every production project to its Slack workspace and Okta group, providing a complete picture of Mercor's organizational structure.

AI Task System

TaskDefinitions / TaskDefinitions_Audit define the structure of AI training tasks:

Field Description
rubric JSON grading rubric for this task type
autograder Autograding configuration (model, prompts)
task_schema JSON Schema defining the task response format
metadata Additional task configuration

TaskAudits records individual task submissions for review:

Field Description
taskDefinitionId Which task definition was used
recordId The submitted task record
s3KeyPrefix S3 location of submission artifacts
authorId Contractor who submitted
auditorId Reviewer assigned
status pending / approved / rejected
outcome Final grading outcome
autoOutcome Automated grading result
dispute Dispute information if challenged
disputedBy Who filed the dispute

TaskAssignments maps tasks to specific jobs and users, with appliedBy tracking who made the assignment.

DeliverableBatches groups deliverables for invoicing:

ProjectCustomColumns adds arbitrary metadata fields to projects, with sqlQuery indicating some columns are dynamically computed from database queries. ProjectCustomColumnValueHistory tracks changes to these values over time.

ProjectArchetypes stores character/role descriptions for specific project types — suggesting Mercor operates AI roleplay or persona-based annotation tasks (archetypeText, elements).

ProductivityProjectRules defines per-project productivity monitoring rules (rules JSON, is_active, versioned).


Part VII - Time Tracking and Productivity Surveillance

The Insightful Integration

This is the most invasive component of the exposed data. Mercor uses Insightful (formerly Workpuls) — a workforce monitoring agent installed on contractors' computers — to capture screenshots and activity data.

InsightfulScreenshots — Every screenshot record contains:

Field Example (obfuscated)
storageUrl https://mercor-insightful-screenshots-production.s3.amazonaws.com/screenshots/[id]/[timestamp]_[uuid].png
storageKey screenshots/wmcw2pdyvenmluy/1767129970810_3b62edd1-...
screenshotTimestamp 1767129970810 (Unix ms)
ip 71.194.*.*
gateways ["1C:93:7C:64:**:**"] (MAC address)
os win32
osVersion 10.0.19045
agentVersion 7.9.3
computer desktop-ue2kgro
hwid 8f9f16f0-1fb7-47e4-a2a1-209838aa5c5e
appName Google Chrome
appFileName chrome.exe
appFilePath C:\Program Files\Google\Chrome\Application\chrome.exe
windowTitle Alabaster Studio - Google Chrome
browserUrl alabaster-studio.com/project/abacus/conversation/[uuid]
browserSite alabaster-studio.com
isBlurred 0
externalProductivityScore 1

Every screenshot includes the contractor's IP address, MAC address (gateway), hardware fingerprint, operating system, the exact application open, the window title, and the URL being visited — all timestamped to the millisecond.

The storageUrl field contains direct S3 URLs to screenshot image files. The S3 bucket mercor-insightful-screenshots-production is referenced explicitly.

The hwid (hardware ID) field provides a persistent device fingerprint that can re-identify a contractor even if they change their email or create a new account.

Timelog

Timelog / Timelog_Audit records every work session:

Field Description
externalId Insightful shift/session ID
externalProjectId Insightful project ID
employeeId Insightful employee identifier
duration Session duration (ms)
timeStart / timeEnd Session timestamps
timezone Contractor's timezone
taskId / taskName Task being worked on
lineItemUid Linked payment line item
adjustmentReason If hours were manually adjusted
userId Mercor user ID
isCompleted Whether session was completed normally
linkFailReason If Insightful–Mercor link failed

Deductions

Deductions records time deducted from pay:

Field Description
durationToSubtractMs Milliseconds deducted
appName Application that triggered the deduction
reasonForDeduction Why the time was removed
payoutCycleID Which pay cycle was affected
approvedBy / approvedAt Approval chain
appliedBy / appliedAt Application record

This reveals that Mercor can and does subtract pay from contractors based on monitored activity, with an approval workflow for doing so.


Part VIII - Payments and Financial Infrastructure

Contractor Payment Methods

UserPaymentMethods / UserPaymentMethods_Audit stores linked payment accounts:

Field Example (obfuscated)
provider stripe
providerMethodId acct_1R0V**** (Stripe Express account)
methodType express_account
status onboarded
countryCode USA

US contractors use Stripe Express accounts. International contractors use Wise (evidenced by WiseDisbursements). The metadata field includes context like "context": "backfill" — indicating historical payment method imports.

MercorUserFinancials stores additional financial account details:

Payment Line Items

PaymentLineItems is the core payment ledger:

Field Description
cycleStartTs / cycleEndTs Pay period boundaries
totalPayableAmount Amount owed to contractor (cents)
totalBillableAmount Amount charged to company (cents)
status pending / paid / failed
jobUid Linked job contract
timelogUid Linked timelog entry
bonusUid Linked bonus if applicable
referralUid Linked referral payment
dispatchFailureReason Why a payment failed
moneyOutId Linked outbound transfer

PayoutCycles defines payment periods:

PayoutConfigs stores payout rules:

MoneyOut_Audit records every outbound payment:

WiseDisbursements records international transfers:

Company Billing

BillingAccounts manages company-side billing:

BillingConfigs defines billing rules:

BillingRateCards defines per-contract rate structures:

InvoiceLineItems records invoice line entries:

RevenueAdjustments records revenue corrections:

Referral System

Referrals / Referrals_Audit tracks the contractor referral program:

Field Description
referredUserId / referringUserId The parties
totalEarned / totalEarningsPotential Referral payment amounts
state Current state
paidAt When the referral bonus was paid
disputeStatus If disputed
isGuaranteedReferral Whether guaranteed payment applies
referral_cap Maximum referral earnings
isPaymentBlocked Payment hold

ReferralEligibility manages the conditions under which referral payments vest — including onboardingStateId requirements and criteriaId checks.

GuaranteedReferralQuota manages quota-based guaranteed referral programs:


Part IX - Communications and Outreach

Internal Messaging

Comms is the platform messaging table:

Field Description
commId Message identifier
groupId Conversation thread
senderId / receiverId Parties
content Message body
type Message type (system, human, etc.)
triggerRef What triggered this message
listingReferenceUID Associated listing

CommsSent records delivery tracking — when messages were sent, to whom, via what channel.

EmailTemplates stores company-specific email templates:

External Outreach

LinkedinWarmIntros manages LinkedIn outreach campaigns:

Field Example (obfuscated)
linkedinUrl https://www.linkedin.com/in/[username]
email s**@homeinheritance.com
referringUserId Internal user who made the intro
commEvent WARM_INTRO/OUTREACH
status sent

OffPlatformCampaigns and OffPlatformCampaignSteps manage multi-step email/LinkedIn outreach sequences:

AircallComms records phone call logs from Mercor's Aircall integration — the VoIP platform used for recruiter outbound calls, with call metadata and outcomes.

FirstTimeInvites tracks first-contact outreach to candidates:

Notification Infrastructure

AutomationTemplates defines automated workflow triggers:

ProjectAutomations links automation templates to specific projects.


Reverse Engineering - Architecture and Infrastructure

The database schema, table names, column conventions, and embedded metadata allow us to reverse-engineer Mercor's complete technical architecture — from microservice names to third-party integrations — purely from the contents of this dump.


Part X - Infrastructure and DevOps

Deployment Pipeline

IacDeploymentRuns is one of the most operationally sensitive tables:

Field Example
runType plan / apply
environment staging / production
status success / failed
commitSha 784cfd495ddfa3b67187433cb7cb66f2d27ad458
branch dacq/backend-v2
actor k*********77 (GitHub username)
githubRunId 23520976410
githubRunUrl https://github.com/Mercor-io/mercor-monorepo/actions/runs/23520976410
prNumber 26645
stacksAffected ["iac/aws/envs/staging"]
resourcesAdded 25
resourcesChanged 2
resourcesDestroyed 6
summary Full Terraform plan output (including deprecation warnings)
durationSeconds 134

This table exposes:

Named Terraform service stacks include: talent-success-coil, referrals-coil, iac/aws/envs/staging.

ProductionDeployment records ECS production releases:

PreprodDeployment records pre-production (staging) releases:

ProductionVersion maintains a single-row current version pointer:

RollbackExecution records emergency rollback events:

Database Schema Management

DATABASECHANGELOG and DATABASECHANGELOGLOCK are Liquibase tables that record every schema migration:

These tables reveal the full history of schema changes, including the names of engineers who authored migrations, the migration scripts' filenames (revealing internal project structure), and the exact timestamp each change was applied to production.

Agent Sandboxes

AgentSandboxes records AI coding agent sessions:

Field Description
agentType Type of AI agent
status active / stopped / expired
backendType Compute backend
host Sandbox hostname
stopReason Why session ended
transcriptRawUrl S3 URL of raw conversation transcript
transcriptConsolidatedUrl S3 URL of consolidated transcript
acpSessionId Agent control protocol session ID
sandboxToken Authentication token for sandbox
claimedAt / expiresAt Session lifecycle timestamps

The sandboxToken field suggests that expired sandbox tokens are persisted in the database — a potential credential exposure if these tokens have long validity windows.


Part XI - Analytics and ML Layer

School and Firm Rankings

DbtFirmSchoolRank contains Mercor's proprietary employer prestige scores:

Field Example
firmId 000013c1653de847e38d755ca1c310a5
firmName 75th ranger regiment, u.s. army
academicField overall
nProfiles 2
avgSchoolRank 90.00
firmSchoolRank 81723
firmSchoolRankPercentile 0.528839

This table represents a proprietary ranking of ~154,000 firms by the average educational prestige of their employees — effectively a derived signal used to score resumes. It is computed from the full contractor profile database using an empirical Bayesian model (ebPriorStrength, ebAvgSchoolRank).

DbtSchoolRankings ranks individual schools within academic fields:

Resume Evaluation

UserResumeEvaluation stores ML-generated resume scores:

Field Description
workExperienceScore Quality of work experience
yearsOfWorkExperience Parsed years of experience
graduationYear Estimated graduation year
mScore Composite score
inferredRole Predicted job function
educationScore Academic credential score
awardScore Competitive award weighting
rateAcademicCompetitions Participation in academic competitions
rateCompetitiveProgramming Competitive programming score
rateHackathonPerformance Hackathon achievement score
technicalSkills JSON list of detected skills
highestDegree Parsed degree level
searchFlag, imageFlag, transcriptFlag Data quality flags

Behavioral Analytics

PosthogAnalytics links PostHog behavioral sessions to user identity:

This directly links PostHog analytics sessions (which include click-level behavior) to user identity — a significant privacy concern as PostHog sessions are typically anonymized.

SearchAnalytics records search quality metrics:

ForecastMetrics stores ML forecast outputs:

Used for capacity planning, fill rate forecasting, and contractor supply predictions.

ML Experiments

MLExperimentsJobPerformanceReviews reveals the experimental ML pipeline:

Column Description
Date of review Review date
Account Client company
Project Project name
Reviewer Reviewer name (Mercor staff)
Work type Category of work
Review type Type of performance review
Name / Email Contractor identity
Quality of Work Score
Engagement Score
Offboarding Reason Why contractor was removed
Justification for rating Free-text explanation

This table contains raw performance review data used to train or evaluate ML models for automated contractor performance assessment — with staff names, contractor names, and qualitative judgments all stored in plaintext.


Part XII - Reference Data Layer

Skills and Certifications

Skills is the platform's skills taxonomy:

CertificationPolicies_Audit defines the rules for earning certifications:

Certifications_Audit records individual earned certifications:

SkillCertifications_Audit and SkillCertificationsEvidence_Audit track per-skill certification with scores and source evidence.

ContractorEndorsements stores peer endorsements:

Company Data

Company stores client company records:

IAM / IAM_Audit manages company-level role assignments:

A sample IAM record shows a user with roleId: ghost being REMOVED from a company — revealing Mercor's internal staff operated within client company contexts under a ghost role identity.

URL Management

ShortenedUrls manages the platform's link shortening system:

UrlClicks records every click on shortened URLs:

Even with ipHash (rather than raw IP), the combination of userId, country, and timestamp enables click attribution across the contractor population.

Catfish Audit Log

CatfishAuditLog is a security/compliance tool:

Field Description
slackUserId / slackUserName Mercor staff member
targetEmail Person being looked up
platform Where the lookup happened
intent Declared reason for the lookup
status Success/failure

This table records every time an internal Mercor employee looks up a user's information through an internal tool called "Catfish" — indicating awareness that internal user lookup is an auditable, privacy-sensitive operation. Ironically, this audit log itself now sits in the exposed dataset.


Exposed Surface Area Summary

Domain Tables Sensitivity Key Exposure
User & Identity ~10 Critical PII (name, email, phone, location) for all contractors
Identity Verification & Fraud ~12 Critical Government ID outcomes, facial comparison tokens, fraud verdicts
Hiring Pipeline ~10 High Application status, rejection reasons, recruiter notes
Interviews & Assessments ~15 High Interview responses, scores, cheating flags, rubrics
Work Trials & Onboarding ~6 High Signed legal documents, offer letters, digital signatures
Projects & AI Tasks ~15 Medium-High Client company projects, task definitions, AI training data
Time Tracking ~4 Critical Per-minute screenshots, browser URLs, MAC addresses, hardware fingerprints
Payments & Finance ~20 Critical Stripe account IDs, bank details, exact payment amounts, payout records
Communications ~10 Medium Message content, outreach campaigns, phone call logs
Infrastructure & DevOps ~10 High Commit SHAs, GitHub URLs, ECS ARNs, Terraform configs, sandbox tokens
Analytics & ML ~10 Medium Resume scores, school rankings, PostHog identity links
Reference Data ~15 Medium Skills taxonomy, certifications, endorsements, company configurations

Technical Architecture Reverse-Engineered

The following architecture is entirely reconstructed from database table names, column values, JSON blobs, and embedded metadata. No source code or documentation was available — everything below was inferred from the data alone.

Backend Services

Based on the database content, Mercor's backend comprises at least 13 microservices:

Service Inferred Function
mercor_api Primary API backend
mercor_api_nginx API gateway / reverse proxy
mercor_go Go-language service (likely performance-critical paths)
coil Contractor-facing service (multiple instances by function)
site_fe Public website frontend
team_fe Company/team portal frontend
work_fe Work/task frontend
celery Async task queue
workflow Workflow orchestration
db_trigger_consumer Database event consumer
steve Internal tool/admin service
woz Fraud/ML pipeline service
payments_temporal_worker Temporal.io worker for payments

Frontend Portals

Data Infrastructure

Third-Party Integration

Provider Purpose
Persona Identity verification (KYC)
Stripe US contractor payments (Express accounts)
Wise International contractor payments
Insightful Workforce monitoring / screenshot capture
Okta SSO for company and internal access
Aircall Recruiter phone calls
PostHog Product analytics
Vertex AI / Gemini Fraud LLM reasoning
OpenAI (GPT-4.1 / GPT-5) AI interview conductor and task autograder
Checkr / Certn Background checks
HaveIBeenPwned Email breach checking
Customer.io Transactional email
GitHub Actions CI/CD pipeline
Terraform / Terragrunt Infrastructure as code
Temporal.io Payments workflow orchestration
Liquibase Database schema versioning

The evidence documented throughout this report supports multiple independent legal claims by distinct plaintiff classes. This section consolidates the factual basis for each claim, cross-referencing the specific database tables, column names, and sample values that constitute the evidentiary foundation.

I. Client Company Claims - Loss of Proprietary AI Training Data and Trade Secrets

This is the most consequential category of legal exposure. Mercor's client companies — Apple, Amazon, OpenAI, Anthropic, Meta, Google, and others — entrusted Mercor with their most valuable competitive assets: the data, methodologies, and evaluation frameworks that define how their AI models are built. All of it is now in criminal hands.

A. Trade Secret Misappropriation

Under the federal Defend Trade Secrets Act (DTSA) and state Uniform Trade Secrets Acts, a trade secret is information that derives economic value from not being generally known and is subject to reasonable efforts to maintain its secrecy. The breach exposes client trade secrets across three categories:

1. AI Training Data as Trade Secrets. The SFT data, RLHF preference rankings, and Chain-of-Thought traces produced by Mercor's contractors for each client constitute trade secrets. Each dataset represents millions of dollars of investment and years of iterative refinement. The TASKS, TASK_VERSIONS, and PHASE_1_TASKS tables across 84 Airtable workspaces contain the actual work product — prompts, model responses, and human evaluations — that each client paid to produce. Their value derives entirely from secrecy: once a competitor has access to another lab's RLHF preference data, they can train equivalent alignment without the cost.

2. Evaluation Methodology as Trade Secrets. How an AI lab evaluates its models — what rubrics it uses, what scoring thresholds it applies, how it structures domain-specific benchmarks — is core intellectual property. The CRITERIA, RUBRIC_VERSIONS, QA_SPECS, and LLM_CALL_CONFIGURATION tables across 60+ workspaces expose this methodology in full. Amazon's Chain-of-Thought evaluation framework, Apple's endpoint testing rubrics, and the cross-model preference evaluation criteria are all now available to any buyer. This is not just data — it is the recipe for how each lab measures AI progress.

3. Pre-Release Model Capabilities as Trade Secrets. The APPLE_ENDPOINT_SANDBOX workspace contains actual outputs from Apple's unreleased Foundation Models (afm-text-083, afm-model-086). These responses reveal the model's capabilities, safety alignment, and failure modes before public launch. Under trade secret law, the unauthorized disclosure of pre-release product capabilities is a textbook misappropriation.

Key legal point: Trade secret protection requires "reasonable efforts to maintain secrecy." Mercor's storage of this data — in plaintext, behind a flat network with no segmentation, accessible via a single VPN hop — likely fails this standard. Clients may argue that they maintained secrecy on their end but that Mercor's negligent security destroyed the trade secret status of the data. This creates a damages claim for the full economic value of the lost trade secrets.

B. Breach of Confidentiality and NDA Violations

The database confirms confidentiality agreements governed the relationship. The Jobs table contains ciiaa_direct, ciiaaPassthrough, confidentiality, and tow (terms of work) fields. The WorkTrial_Audit table contains signed CIIAs and offer letters. The exposure of:

constitutes a breach of these confidentiality obligations. Each client has a separate breach of contract claim with damages measured by the economic harm caused by the disclosure.

C. Loss of Competitive Advantage

The breach doesn't just expose data — it destroys competitive moats. If a Chinese AI lab purchases the stolen data, they acquire:

Each client's AI training pipeline is now potentially replicable by any competitor with access to the stolen Airtable workspaces. The damages extend beyond the cost of producing the data — they include the competitive harm of having that data available to rivals.

D. Secondary Breach via Desktop Screenshots

The InsightfulScreenshots table creates a mechanism for visual intelligence extraction from client systems. Screenshots captured during monitored work sessions show whatever was on the contractor's screen — client internal dashboards, Slack conversations, code repositories, proprietary tools, unreleased product interfaces. Mercor stored these screenshots on S3 with metadata linking each image to the specific projectId. An attacker can systematically extract visual intelligence about every client's internal systems by filtering screenshots by project. This constitutes a secondary breach of each client's confidential systems, for which Mercor bears direct liability.

E. APEX Benchmark Contamination

Mercor's proprietary APEX benchmark suite — covering 15+ domains from legal to medicine to mechanical engineering — is now compromised. All tasks, criteria, scoring rubrics, and evaluation data are exposed. Any client that relied on APEX benchmark results for vendor selection, model comparison, or procurement decisions now faces the risk that those results are unreliable. Models trained on the leaked APEX data will appear to perform well without genuinely possessing the evaluated capabilities. Clients may claim damages for decisions made in reliance on benchmarks that are now contaminated.

II. Contractor Class Claims

A. Financial Data Exposure and Identity Theft Risk

The MercorUserFinancials table stores the complete Stripe Connect API response as plaintext JSON — including bank name, routing number, last four digits, account holder name, email, and country. This is sufficient for bank account fraud. Every contractor whose financial data is in this table faces ongoing risk of unauthorized transactions, account takeover, and identity theft. The UserPaymentMethods table adds Stripe Express account IDs and Wise transfer identifiers. The exposure of this data — unencrypted, untokenized, in a database accessible via a single VPN hop — constitutes negligence per se under multiple state data breach statutes.

B. Surveillance Overreach and Privacy Violations

The Insightful monitoring system captured far more than work activity:

Contractors used personal computers for Mercor work (the data shows personal Chrome installations, personal hostnames like desktop-ue2kgro). The monitoring system captured personal activity on personal devices — personal emails, banking sessions, medical information, or other private content visible in background windows. All of this is now in criminal hands. Under ECPA and state wiretap laws, the capture of third-party communications visible in screenshots (Slack messages, emails, video calls) may constitute unlawful interception.

C. Wrongful Termination via Automated Fraud Decisions

The database reveals that automated fraud decisions directly determined whether contractors could earn a living:

Under FCRA, if Mercor used these automated fraud scores or background check results (BackgroundCheck.status) to deny, suspend, or terminate contractor engagements without providing required adverse action notices, each instance is a separate violation. Under GDPR Article 22, EU/UK contractors have the right not to be subject to decisions based solely on automated processing.

D. Wage-Related Claims

The Deductions table records pay subtractions based on monitored activity — exact milliseconds deducted, which application triggered the deduction, and who approved it. If deductions were applied using data from the now-compromised monitoring system, or if the breach reveals inconsistent application, contractors have wage theft claims in addition to privacy claims.

III. Statutory Violations

A. CCPA/CPRA — Private right of action for data breaches resulting from failure to maintain reasonable security (Cal. Civ. Code § 1798.150). Plaintext bank routing numbers, unencrypted PII, and excessive data collection constitute failure to implement reasonable security. Statutory damages: $100–$750 per consumer per incident.

B. GDPR — EU/UK contractors confirmed in the data (sample: United Kingdom, Harrow). Violations include data minimization failure (Article 5(1)(c)), integrity/confidentiality failure (Article 5(1)(f)), automated decision-making without safeguards (Article 22), and breach notification delays (Article 33). Fines up to €20 million or 4% of annual global turnover.

C. Illinois BIPA — Persona's liveness detection requires a scan of face geometry, explicitly listed as a biometric identifier (740 ILCS 14/10). The IDVerificationChecks table confirms facial geometry scans were captured (livenessStatus), facial comparison performed (interview-face-comparison), and thumbnail images stored (thumbnail_key). Statutory damages: $1,000–$5,000 per violation, no harm requirement. (Note: MAC addresses and hardware fingerprints are not biometric identifiers under BIPA.)

D. FCRA — Background check results and automated fraud scores used in employment decisions without required adverse action notices. Per-violation damages.

E. ECPA / State Wiretap Laws — Desktop screenshots capturing third-party communications visible on screen. Per-interception damages.

F. PIPEDA — Canadian contractors confirmed (sample: country: CA, BANK OF M*******). Breach notification to Privacy Commissioner and affected individuals required.

IV. Negligence - Security Failures Evidenced in the Data

The database structure itself constitutes evidence of systemic negligence:

V. Third-Party Claims

Individuals who never created Mercor accounts have their data exposed:

These individuals never consented to data collection and likely never received a privacy notice. Under GDPR Article 14, Mercor was required to notify them within one month. The breach exposes them to targeted social engineering using their real relationship data.

Claim Plaintiff Class Key Evidence
Trade secret misappropriation Apple, Amazon, OpenAI, Anthropic, Meta, Google Pre-release model outputs, evaluation methodologies, RLHF data, rubrics, CoT traces
Breach of confidentiality / NDA All client companies Signed CIIAs in database, client-named Airtable workspaces with proprietary data
Competitive harm All client companies Training data, evaluation frameworks, and benchmark data now available to rivals
APEX benchmark contamination Companies relying on APEX results Complete benchmark tasks, criteria, and scores exposed
Financial data negligence 30,000+ contractors Plaintext bank routing numbers, Stripe account details
Surveillance overreach 30,000+ contractors Desktop screenshots of personal devices, personal browsing, background windows
Automated adverse actions Contractors denied/terminated Fraud scores, LLM-generated reasoning, no disclosure or appeal
CCPA violations 30,000+ contractors Failure to maintain reasonable security
GDPR violations EU/UK contractors Data minimization, automated decisions, notification delays
BIPA violations Contractors who completed Persona KYC Facial geometry scans, liveness detection
Third-party privacy References, LinkedIn contacts, vouchers Data collected without consent, now in criminal hands

The client claims are likely the largest in dollar terms — the economic value of the lost trade secrets (training data, evaluation methodologies, pre-release model outputs) runs into the billions. The contractor claims are the broadest in scope — affecting every individual who ever used the platform. Together, the total legal exposure is conservatively in the hundreds of millions of dollars before punitive damages.


Conclusion - What Happens Now

The breach is not a past event. It is an ongoing situation with no clear resolution.

The Data Is Still in Circulation

Mercor allegedly paid the attackers to have the data removed from the Lapsus$ leak site — a fact confirmed to us directly by Lapsus$ themselves. The data was taken down briefly. It reappeared. The group is now actively selling the full dataset to private bidders while continuing to distribute samples. The two files analyzed in this report were obtained after the ransom was paid. This is the predictable outcome of paying ransom for digital assets — there is no mechanism to verify deletion, no way to revoke copies already distributed, and every economic incentive for the attackers to continue monetizing the data through private sales, selective leaks, and derivative attacks. Mercor's ransom payment bought nothing except proof that they considered the data worth paying to suppress.

The attackers now possess:

This is not a dataset that loses value over time. The PII is permanent. The bank routing numbers don't expire. The government ID verification records don't reset. The signed legal documents don't un-sign. And the AI training data — the RLHF annotations, preference rankings, and rubric evaluations produced for frontier AI labs — retains its full value to any competitor seeking to accelerate their own model development.

The Ongoing Threat

With this data, the attackers (or any subsequent buyer) can:

  1. Launch targeted phishing campaigns against every Mercor contractor, using their real name, employer, project assignment, and pay rate to craft highly convincing social engineering attacks
  2. Commit financial fraud using the bank names, routing numbers, and account holder names stored in MercorUserFinancials
  3. Blackmail contractors whose desktop screenshots may reveal confidential client information, personal browsing activity, or employment at companies their current employer doesn't know about
  4. Attack Mercor's clients using the Slack workspace URLs, Okta SSO configurations, and annotation platform campaign IDs as entry points for further social engineering or credential stuffing
  5. Sell the AI training data — the prompts, responses, evaluations, and preference rankings — to competitors or foreign actors, undermining billions of dollars of investment by OpenAI, Anthropic, Apple, Amazon, Meta, and Google DeepMind
  6. Exploit the source code to identify vulnerabilities in Mercor's (and potentially its clients') systems that have not yet been patched
  7. Impersonate Mercor staff using the internal employee names, Slack IDs, and GitHub usernames found throughout the database to conduct supply-chain attacks against Mercor's clients and partners

Each of these vectors becomes more dangerous the longer the data remains in circulation — and there is no indication it will stop circulating.

The Case for Radical Transparency

There is an uncomfortable truth that Mercor, its clients, and the affected contractors must confront: the data is out. It cannot be put back.

The current trajectory — where the breach is acknowledged in vague corporate language, specific questions are deflected, and affected individuals receive minimal information about what was exposed — serves no one except the attackers. It creates an information asymmetry where the adversary has complete knowledge of what was taken, while the victims operate in the dark.

Every contractor whose bank routing number is in MercorUserFinancials deserves to know — specifically — that their bank name, routing number, and account holder name were stored in plaintext JSON and are now in the hands of criminal actors. Every contractor whose desktop screenshots are in the mercor-insightful-screenshots-production S3 bucket deserves to know that their IP address, MAC address, browser history, and application usage during work sessions are exposed. Every client whose annotation platform URLs, Slack workspaces, and proprietary model outputs appear in the Airtable exports deserves to understand the exact scope of their secondary exposure.

The alternative to transparency is prolonged paranoia. If Mercor does not disclose the specific contents of the breach, every contractor must assume the worst about what was taken. Every client must assume their internal systems were visible on a contractor's screen. Every reference, every LinkedIn contact, every vouching party must assume their personal information was collected without their knowledge and is now compromised.

Perhaps the most constructive path forward — however counterintuitive — is full, detailed, public disclosure of exactly what the breach contained. Not the raw data itself, but a complete accounting: which tables, which fields, which categories of PII, which clients, which time periods. The world can adjust to a known breach. It cannot adjust to an unknown one. Sunlight remains the best disinfectant, and in the aftermath of a breach of this magnitude, the cost of silence far exceeds the cost of honesty.

The contractors who built the AI training data that powers the world's most valuable models deserve at least that much.

A Structural Critique - Youth Velocity and the Cost of Immaturity

Mercor's three founders — Brendan Foody, Adarsh Hiremath, and Surya Midha — were 21 years old when they raised their Series A. They became the world's youngest self-made paper billionaires at 22 when their Series C valued the company at $10 billion. The average age of the Mercor team was reported at 22 years old. They are Thiel Fellows — college dropouts celebrated for building fast. They stored bank routing numbers in plaintext, ran a flat network where a single VPN hop reached everything, and let 4 terabytes walk out the door without anyone noticing.

Perhaps Mercor is best understood as a phenomenon of hype and strong mimetic desire within the AI industry. Perhaps the AI labs got ahead of themselves too early. Perhaps researchers and vendor managers chose Mercor not because they evaluated the vendor thoroughly enough to handle critical workloads, but because OpenAI was already using it.

The pattern is worth examining. OpenAI was one of Mercor's earliest major customers. The relationship began when Mercor's 20-year-old CEO cold-emailed OpenAI's head of human data operations, Shaun VanWeelden, and landed a contract to recruit Math Olympiad winners for model training. VanWeelden later left OpenAI to become Mercor's managing director. Two sitting OpenAI board members — Adam D'Angelo (Quora CEO) and Larry Summers (former U.S. Treasury Secretary) — invested in Mercor's earlier funding rounds.

This is not without precedent. Much of the AI data infrastructure landscape has been shaped by proximity to OpenAI. Scale AI's Alexandr Wang was Sam Altman's roommate during the pandemic. Scale went through Y Combinator when Altman ran it. Altman and Wang later discussed an acquisition.

With Mercor, the signal was unmistakable. OpenAI used them. OpenAI's board members invested in them. OpenAI's head of data operations joined them. Once that signal propagated, perhaps the other labs followed not because of independent evaluation, but because OpenAI had validated the choice for them. The $10 billion valuation, the press coverage, and the youngest-billionaires narrative reinforced what was already a foregone conclusion.

The Girardian irony is that this breach — the scapegoating event — may produce the same mimetic cycle in reverse. The labs may collectively abandon Mercor, collectively discover the next shiny vendor, and collectively onboard without asking the hard questions about security and privacy. The sacrifice of the scapegoat restores order. The community moves on, having learned nothing structural — only that this particular vendor was the wrong one.

Having reverse-engineered Mercor's complete operational architecture from its database schema — the annotation pipeline, the evaluation frameworks, the contractor management system, the payment infrastructure — it is clear that the underlying business is well-understood and replicable. For new entrepreneurs, the opportunity is straightforward: build the same platform, but treat security and privacy as foundational rather than an afterthought. The market for AI training data is not going away. The demand for a vendor that handles it responsibly has never been higher.



Appendix A - Complete Table Inventory

All 149+ tables organized by functional domain, with column lists and sample data where present.


Domain 1 - User and Identity

Table Key Columns Notes
MercorUsers_New userId, email, name, phone, profilePic, createdAt, lastLogin, location, isWhiteListed, source, firebaseUID, authType, isAnonymous, insightfulId, stripeAccountId, customerId, isDeleted, phoneVerificationStatus, phoneVerifiedAt, phoneOptIn Primary contractor user table. Sample: e****a1@gmail.com, T** O****, +44795718****, United Kingdom,Harrow
MercorUsers_New_backup userId, email, name, phone, profilePic, createdAt, lastLogin, location, isWhiteListed, source, firebaseUID, authType, isAnonymous, insightfulId, stripeAccountId, customerId, isDeleted Historical backup snapshot of user table
UserLocation userLocationId, userId, residenceCountry, residenceState, residenceCity, residenceZipCode, physicalCountry, physicalState, physicalCity, physicalZipCode, version, createdAt, updatedAt Tracks declared residence vs. physical location. Used in fraud detection. Sample: residenceCountry=USA, physicalCountry=USA
UserLocation_Audit All UserLocation columns + auditAction, auditTimestamp Audit trail for location changes
UserMetadata userMetadataId, userId, workAuthorizationStatus, birthday, physicalLocation, countryOfResidence, createdAt, updatedAt, maxHourCap, contractorMail, fraudStatus, oktaUserId, fraudStatusEnum, oktaAccountState, externalId, maxContracts, offPlatformEmail Extended user metadata including Okta SSO ID and fraud status
UserState id, userId, resumeUploaded, interviewsCompletedCount, jobApplicationsCount, totalMillisWorked, createdAt, updatedAt Lifecycle counters — tracks user progression through platform
UserAvailability_Audit availabilityId, version, userId, maxWeeklyHours, desiredWeeklyHours, expectedStartOffset, expectedStartOffsetUpdatedAt, earliestStartDateChoice, timezone, updatedAt, createdAt, auditAction, auditTimestamp Declared working hours and timezone preferences
UserReferences referenceId, email, name, company, relationship, userId Professional references provided by contractors
WorkAuthorization_Audit workAuthorizationId, userId, birthday, physicalCountry, workAuthorizationStatus, agreedToLocation, signature, attestedAt, source, version, createdAt, updatedAt, auditAction, auditTimestamp Work authorization attestations with digital signatures
UserPlatformStatus id, userId, status, action, source, sourceDetail, isLatest, createdAt Platform access status (active, suspended, banned)
LinkedinUsers id, name, url, email, company, position, lastUpdated LinkedIn profile cache used for warm intros and candidate sourcing
MembershipSnapshots scopeType, scopeId, userId, createdAt Point-in-time snapshots of group/project memberships

Domain 2 - Identity Verification and Background Checks

Table Key Columns Notes
IDVerificationChecks verificationCheckId, userId, candidateId, jobId, listingId, provider, source, sessionId, sessionToken, onboardingUrl, sessionStatus, verificationStatus, governmentIdStatus, livenessStatus, addressStatus, attemptNumber, maxAttempts, providerResponse, fraudDecision, flagReasons, manualReviewStatus, createdAt, updatedAt, completedAt Persona KYC session records. providerResponse contains full JSON API response including facial thumbnail keys. provider=persona
BackgroundCheck contractorID, externalCandidateId, workLocation, package, invitationId, invitationCreatedAt, invitationCompletedAt, backgroundCheckId, reportId, status, createdAt, updatedAt, adverseMediaCheckStatus Criminal background check records (Checkr). Status: clear / consider
BackgroundCheck_New Richer version of BackgroundCheck with additional fields Updated background check schema
BackgroundCheckDetails Detailed per-check results Granular check outcomes
ScreeningPackage id, companyId, name, isActive, lastUpdatedAt, checkConfig, graceDays Per-company screening package configurations defining which checks are required

Domain 3 - Fraud Detection

Table Key Columns Notes
FraudStates userId, currentStage, currentDecision, currentConfidence, currentReasoning, currentKeySignals, currentTimestamp, previousStageDecision, createdAt, updatedAt Current fraud state per user. currentDecision: APPROVE / ESCALATE / REJECT. LLM-generated reasoning. Sample signal: location_mismatch: 1.0
FraudCheck id, user_id, stage, interviewId, jobId, triggered_on, process_status, retryCount, flag_reasons, automatedReasons, status, priority, idVerificationStatus, manual_review_status, manual_review_rational, manual_review_signs, isMostRecent, assigned_to, assigned_on, splReview Central fraud queue. Tracks automated and manual review states
FraudSignalAuditLog id, userId, userVersionId, stage, signalType, modelName, triggeredOn, status, modelScore, createdAt Per-signal audit trail. Every fraud signal evaluated is logged here
FraudEvents id, eventId, userId, eventType, stage, priorAlpha, priorBeta, priorProbability, priorStatus, posteriorAlpha, posteriorBeta, posteriorProbability, posteriorStatus, evidence, createdAt, createdBy, notes Bayesian belief update log. Each event updates prior→posterior fraud probability
ProductionFraudState id, userId, status, fraudModality, source, sourceDetail, lastEvaluatedStage, productionModelId, userVersionId, isLatest, createdAt, updatedAt Final production fraud verdict. fraudModality: identity / time / quality
AutoFraudChecks Automated rule-based fraud check records Scheduled fraud scans
OnProjectFraudWindows id, employeeId, contractorId, projectId, scanDate, startTime, endTime, fraudType, fragmentCount, flags, flagMetadata, windowMetadata, screenshotMetadata, createdAt, updatedAt, userVersionId On-project time fraud analysis windows. Analyzes screenshot patterns
QAReviewLog id, userId, reviewerId, bucketName, status, assignedOn, completedAt, isActive, lockKey, createdAt, updatedAt, comments, decision, userVersionId, stage, signalType, flags Human QA reviewer assignments and decisions for fraud cases
CheatingDetection annotationId, userId, interviewId, interviewConfigId, formResponseId, formId, isCheating, cheatingProbability, signs, notes, reportedBy, createdAt, updatedAt Interview cheating detection results
CheatingDetection_Audit All CheatingDetection columns + auditAction, auditTimestamp Audit trail for cheating detection
DuplicateGroups groupId, userIdList, mergedIntoGroupId, createdAt Groups of suspected duplicate/sock-puppet accounts

Domain 4 - Hiring Pipeline

Table Key Columns Notes
Listings_New listingId, version, uid, companyId, title, description, commitment, referralAmount, createdAt, deletedAt, status, requiredInterviewConfigId, rateMin, rateMax, hoursPerWeek, location, formId, automatedCommsOn, payRateFrequency, isPrivate, autoRedirectToApply, evaluationCriteria, offersEquity, rejectionTemplateSubject, rejectionTemplateBody, campaignId, ownerIds, goalNumHires, goalDeadline, isExploreAlways, interviewSchedulingEnabled, interviewScheduleLink, disableApplications, isMostRecent, offerExtendedText, minHeadcount, maxHeadcount, referralBoost, timeToAutoReject, automaticRejectionsOn, computedExplorePageVisibility, workArrangement, eligibleLocation, ineligibleResidenceLocation, listingType Primary job listing table. Includes pay ranges, location eligibility, automation settings
Listings_New_Audit All Listings_New columns + auditAction, auditTimestamp Audit trail for listing changes
Candidates candidateId, userId, companyId, listingUid, createdAt, deletedAt, status, notesForCandidate, birthday, physicalLocation, workAuthorizationStatus, responseId, version, uid, source, countryOfResidence, isMostRecent, listingId, listingStepConfigId, linkedinUrl, actionItem, lastSignificantUpdatedAt, rejectionReason, updatedBy, starred, appliedAt, goalId, automaticRejectAt, addedAt, referralId, isEligible, numCommsSent, lastCommSentAt Per-application record. Tracks status, notes, scheduled auto-rejection, outreach counts
Candidates_Audit All Candidates columns + auditAction, auditTimestamp Audit trail for application changes
CandidateMatchScores candidateId, listingId, matchScore, contextualSummary ML-generated candidate-to-listing fit scores with LLM explanations
EvaluationCriteria evaluationCriteriaId, listingId, criteria, shortCriteria, type, hardFilter, position, updatedAt, evalCriterionCritique, evalCriterionCritiquePass, status Per-listing scoring rubric criteria
ListingNotes listingNoteId, listingId, authorUserId, assigneeUserId, notificationStatus, createdAt, noteBody Recruiter notes on listings. Contains candid operational commentary
SavedListings id, userId, listingId, listingUid, createdAt Candidates who bookmarked a listing
ListingPipelines Pipeline stage configurations per listing Hiring funnel stage definitions
TalentViewSearchUsers searchId, userId, score, addedAt, starredAt, deletedAt Users surfaced in talent search results
SharableTalentViewConfig viewId, name, description, userIds, userCount, maxCandidatesCount, createdAt, updatedAt, revokedAt, createdBy, expiryAt, viewCount, visibleSections, preferredTitle Shareable talent shortlist configurations
SharableTalentViewConfigUsers userId, viewId, workExperience, education, summary, createdAt, updatedAt, yearsOfExperience, interviews, forms, likeCount, dislikeCount, feedback Per-candidate data within shared talent views
TalentViewUserEvaluations criteriaId, userId, criteriaScore Per-criteria scores for talent view candidates

Domain 5 - Interviews and Assessments

Table Key Columns Notes
Forms_Audit formId, companyId, listingId, title, description, guide, evaluationCriteria, assessmentRubricId, items, isArchived, isAuthed, numQuestions, isUnified, allowFormRetakes, maxRetakeAttempts, allowCopyPaste, version, createdAt, updatedAt, createdBy, auditAction, auditTimestamp, prep, assessmentVersionId, feedbackConfig Interview/assessment form definitions. items contains full question list
FormSubmissions formResponseId, formId, companyId, userId, responseStatus, formVersion, startedAt, submittedAt, activeTimeSeconds, posthogSessionIds, createdAt, updatedAt, attempt, isLatestSubmission, assessmentVersionId, feedbackSentAt Every interview submission. Tracks time spent (activeTimeSeconds)
AssessmentRubrics assessmentRubricId, title, createdAt, instructions, sumScores, sumSquareScores, countScores, version, passThreshold Scoring rubric definitions with aggregate statistics
AssessmentRubrics_Audit All AssessmentRubrics columns + auditAction, auditTimestamp Rubric change history
AssessmentRubricItems_Audit assessmentRubricItemId, assessmentRubricId, criteria, shortName, points, position, format, relatedQuestionIds, version, auditAction, auditTimestamp, webSearch, smartScoring, type, config, createdAt, updatedAt Individual rubric criteria with AI scoring configuration
AssessmentEvalState id, submissionId, assessmentType, jobType, status, retryCount, createdAt, reason, triggerSource, triggeredByUserId, modalJobId, durationMs, operationId, assessmentId Grading pipeline execution state
AssessmentVersions Versioned assessment configurations Assessment version tracking
AssessmentAudits Assessment activity audit trail Audit log for assessment operations
GradedRubricItems Per-rubric-item graded scores Individual rubric item scores per submission
GradedRubricItems_Audit Audit trail for graded items Score change history
InterviewEvals interviewId, communicationScore, technicalScore, qaPairScores Aggregate interview scores by dimension
InterviewScores scoreId, userId, interviewId, interviewConfigId, points, createdAt Final interview score per user
InterviewIssues issueId, interviewId, issue, source, notes, startPosition, endPosition, reportedBy, createdAt, updatedAt Technical and integrity issues reported during interviews
PairwiseComparisons listingId, listingUid, interviewConfigId, winnerResumeId, loserResumeId, reasoning, winnerUserId, loserUserId Bradley-Terry tournament comparisons for candidate ranking
MercorScores candidateId, listingId, listingUid, resumeId, evaluationCriteria, interviewConfigId, mScoreRaw, mScoreNormalized, numComparisons, contextualSummary, userId, aggregateFeatureScore Final MercorScore per candidate per listing

Domain 6 - Work Trials and Onboarding

Table Key Columns Notes
WorkTrial_Audit workTrialId, userId, companyId, listingStepConfigId, status, payableAmount, billableAmount, ciiaaDirect, ciiaaPassthrough, tow, offerLetter, startDate, endDate, payout, payment, paymentMethod, signature, projectId, billingAccountId, createdAt, updatedAt, version, auditAction, auditTimestamp, updatedBy Work trial contract records. Contains signed legal documents and pay amounts
WorkTrialConfig workTrialConfigId, title, payableAmount, billableAmount, ciiaaDirect, ciiaaPassthrough, tow, endDate, emailTemplateSubject, emailTemplateBody, emailTemplateSubjectExtension, emailTemplateBodyExtension, interviewIds, formIds, createdAt, updatedAt, deletedAt, companyId, isUnified, projectId Reusable work trial templates
OnboardingState id, shortName, name, threshold, createdAt, updatedAt, order Onboarding funnel steps. Sample: interview_completed threshold=1 order=0
OnboardingDocument onboardingDocumentId, onboardingDocument, createdAt, projectId Per-project onboarding materials
TierProgress id, createdAt, updatedAt, userId, tierId, planId, status, completedAt, paidAt Contractor tier/level progression tracking
PlanAssignments id, createdAt, updatedAt, userId, planId, assignedBy, startDate, endDate, userHours, tasksCompleted, status Assigns contractors to specific earning/task plans

Domain 7 - Projects and AI Task Management

Table Key Columns Notes
Projects_Audit projectId, name, createdAt, updatedAt, companyId, archivedAt, externalId, onboardingDocumentId, userId, screenshotEnabled, userGroupEmail, description, requireAvailabilityUpdates, skills, projectType, offerExtendedText, annotationPlatform, annotationPlatformIDs, ssotLink, status, notes, version, auditAction, auditTimestamp, taskMetricsDatastore Full project configuration audit trail
ProjectIAM id, projectId, userId, roleId, status, assignedBy, version, createdAt, updatedAt Role assignments within projects
ProjectIAM_Audit All ProjectIAM columns + auditAction, auditTimestamp Project IAM change history
ProjectCustomColumns id, projectId, name, dataType, position, createdBy, createdAt, updatedAt, deletedAt, sqlQuery, source Dynamic metadata columns per project. Some computed via SQL
ProjectCustomColumnValueHistory id, customColumnId, jobId, value, changedBy, createdAt History of custom column values
ProjectArchetypes archetypeId, projectId, archetypeText, createdAt, updatedAt, version, elements Character/persona definitions for annotation projects
ProjectAttributeValues Project attribute key-value pairs Flexible project attribute storage
ProjectViewConfig viewId, title, projectId, viewContext, createdByUserId, createdAt, updatedByUserId, updatedAt, deletedAt, roleId, viewType Saved view configurations for project management
ProjectIntegrations id, projectId, groupMail, autoProvision, createdAt, updatedAt, oktaGroupId, integrationsData, oktaOwnerGroupId, oktaEPMGroupId, latestGroupBatch, latestBatchMemberCount, projectShortId, workspaceNotificationChannel, ownerGwGroup, epmGwGroup, slackChannelId Project integrations with Okta groups and Slack channels
ProjectAutomations Project-specific automation configurations Automation bindings per project
ProjectFunctions id, name, description, createdAt, updatedAt Named functions available in project automation
TaskDefinitions taskDefId, projectId, rubric, autograder, version, createdAt, updatedAt, task_schema, metadata AI task type definitions with grading rubrics
TaskDefinitions_Audit All TaskDefinitions columns + auditAction, auditTimestamp Task definition change history
TaskAudits uid, taskDefinitionId, recordId, s3KeyPrefix, authorId, auditorId, status, outcome, autoOutcome, createdAt, updatedAt, dispute, disputedBy Individual task submission reviews with dispute tracking
TaskAssignments id, createdAt, updatedAt, jobId, taskId, userId, appliedBy Maps tasks to jobs and users
DeliverableBatches id, uid, name, projectId, invoiceLineItemId, status, taskCount, version, isLatest, metadata, createdAt, updatedAt, createdBy Grouped task deliverable batches for invoicing
Deliverables deliverableId, jobId, userId, projectId, entityType, entityId, status, createdAt, updatedAt Individual deliverable records
Deliverables_Audit All Deliverables columns + isMostRecent, auditAction, auditTimestamp Deliverable change history
ProductivityProjectRules id, project_id, description, rules, created_by, is_active, version, created_at Per-project productivity monitoring rule configurations

Domain 8 - Jobs and Contracts

Table Key Columns Notes
Jobs jobID, contractorID, companyID, status, payableRate, commitment, ciiaa_direct, ciiaa_passthrough, tow, payment, startDate, createdAt, updatedAt, expiresAt, tax_form, expected_hours, title, stripeSubscriptionId, billableRate, version, dismissalDate, insightful, paymentMethod, projectID, checkr, idVerification, uid, payout, offerLetter, listingUID, managerId, signature, backgroundCheck, isLatest, note, referralId, roleId, provisionIdpAccess, safety_waiver, sourceId, confidentiality, billingAccountID, backgroundCheckConfig Core employment contract. Contains pay rates, legal agreements, Stripe subscription
Jobs_Audit All Jobs columns + auditAction, auditTimestamp Job contract change history
JobEvents jobEventId, jobId, contractorId, actorId, actionType, metadata, createdAt Events on job contracts (status changes, communications). Sample: comm, Contract Reminder
JobEventsQueue queueItemId, sourceType, sourceId, payload, renderedPreview, editedPreview, status, response, createdAt, resolvedBy, resolvedAt, jobEventId Queued job events pending processing or review
JobEventReasonAssociations jobEventId, reasonId, createdAt Structured reasons associated with job events
JobTasks Tasks linked to specific jobs Job-task mapping
JobPerformanceMetrics_New jobPerformanceMetricsId, jobId, performanceScore, standardError, jobPerformanceSummary, version, createdAt, updatedAt ML-generated job performance metrics
JobPerformanceMetrics_Audit jobPerformanceMetricsId, jobId, version, lvr, lvrReasoning, confidenceLevel, isFraud, wasDismissedEarly, jobSummary, auditAction, auditTimestamp, createdAt, updatedAt Detailed performance metrics audit trail including fraud flags
JobPerformanceReviews_New performanceReviewId, jobId, contractorId, companyId, projectName, taskId, score, reviewNotes, performanceReasons, dismissalFlag, dismissalReason, reviewedBy, createdAt, updatedAt, oldReviewId, feedBackFlag Human-reviewed job performance assessments
WeeklyProjectFeedback weeklyProjectFeedbackId, userId, jobID, weekStart, rating, feedbackText, submittedAt, updatedAt, createdAt Weekly contractor feedback on their project experience
ContractorPerformance_New contractorPerformanceId, contractorId, standardError, performanceScore, performanceSummary, version, createdAt, updatedAt Aggregate contractor performance across all jobs
ContractorPerformance_New_Audit All ContractorPerformance_New columns + auditAction, auditTimestamp Contractor performance change history
PerformanceReviews performanceReviewId, contractorId, reviewDate, performanceDetails, stars, taskDetails, reviewBy, createdAt, updatedAt, companyId Company-authored contractor performance reviews with star ratings
MLExperimentsJobPerformanceReviews Date of review, Account, Project, Reviewer, Work type, Review type, Name, Email, Quality of Work, Engagement, Offboarding Reason, Justification for rating Raw performance data for ML model training

Domain 9 - Time Tracking and Productivity Surveillance

Table Key Columns Notes
InsightfulScreenshots id, externalId, contractorId, projectId, storageBucket, storageKey, storageUrl, storageProvider, fileExtension, contentType, fileSizeBytes, vendorName, schemaVersion, vendorMetadata, externalIdentifiers, screenshotTimestamp, timestampTranslated, timezoneOffset, timezone, isBlurred, isOriginal, isRemoved, removedAt, externalProductivityScore, computer, hwid, os, osVersion, agentVersion, appName, appFileName, appFilePath, windowTitle, browserUrl, document, browserSite, ip, gateways, windowId, activityId, fragmentId, createdAt, updatedAt Per-screenshot records with full device fingerprint (IP, MAC, HWID), application, URL, and S3 image link
Timelog id, externalId, externalProjectId, employeeId, duration, timeStart, timeEnd, timezone, source, taskId, taskName, lineItemUid, adjustmentReason, uid, version, userId, isCompleted, linkFailReason, insightfulCreatedAt, insightfulUpdatedAt, createdAt, updatedAt Work session records synced from Insightful
Timelog_Audit All Timelog columns + audit metadata Timelog change history
Deductions id, contractId, contractorId, durationToSubtractMs, appName, reasonForDeduction, payoutCycleID, externalProjectId, externalEmployeeId, status, approvedBy, approvedAt, appliedBy, appliedAt, createdAt, createdBy, updatedAt Pay deductions for non-productive time with approval chain

Domain 10 - Payments and Financial Infrastructure

Table Key Columns Notes
UserPaymentMethods id, userId, provider, providerMethodId, methodType, status, metadata, createdAt, updatedAt, version, countryCode Contractor payment accounts. Sample: stripe, acct_1R0V****, express_account, onboarded, USA
UserPaymentMethods_Audit All UserPaymentMethods columns + auditAction, auditTimestamp Payment method change history
MercorUserFinancials id, userId, paymentProvider, providerIdentifier, accountDetails, lastFetchedOn, createdOn, updatedOn Full financial account details including bank routing numbers
PaymentLineItems id, version, cycleStartTs, cycleEndTs, totalPayableAmount, totalBillableAmount, status, createdAt, updatedAt, uid, jobUid, dispatchFailureReason, timelogUid, bonusUid, transferId, referralUid, companyId, projectId, contractorId, timeStamp, isLatestVersion, referralId, moneyOutId, eventTime, referralEligibilityId Core payment ledger. Amounts in cents
PaymentLineItems_Audit All PaymentLineItems columns + auditAction, auditTimestamp Payment line item change history
PaymentLineItems_TransactionalAudit Transactional-level payment audit Fine-grained payment operation audit trail
MoneyOut_Audit id, statementId, entityId, userId, entity, externalAccountId, externalTransferId, cycleStartTs, cycleEndTs, totalAmount, paymentMethod, status, createdAt, failureReason, payoutCycleId, auditTimestamp, auditAction, version Outbound payment records
WiseDisbursements id, moneyOutId, amount, currency, sequenceNumber, wiseTransferId, wiseQuoteId, status, failureReason, createdAt, updatedAt, accountId International Wise payment records
PayoutCycles cycleStartTs, cycleEndTs, id, status, configId, configVersion Pay period definitions
PayoutRecords Individual payout transaction records Detailed payout ledger
PayoutConfigs payoutConfigId, status, type, configuration, version Payment configuration rules
InvoiceLineItems id, name, companyId, invoiceId, sowId, taskCount, rawAmount, adjustedAmount, status, description, metadata, createdAt, updatedAt, createdBy Company invoice line items
BillingAccounts Company billing account definitions Client billing account management
BillingConfigs id, uid, version, isLatestVersion, rules, projectId, createdAt, updatedAt, createdBy Billing rule configurations (markup, caps)
BillingRateCards billingRateCardId, uid, version, isLatestVersion, sowId, formulaType, rateRows, createdAt, updatedAt, createdBy Per-SOW rate card definitions
RevenueAdjustments id, companyId, projectId, attestationId, cancelledAdjId, amountCentsUsd, category, revenueRecognitionDate, reason, createdAt, creatorId, isCancellation, formula, labels, aggregationFields, attachments, invoices Revenue adjustments and corrections
FinanceLabels Finance label definitions Labels for financial categorization
CompanyFinanceLabels companyId, financeLabelId, createdAt, creatorId Finance label assignments to companies
ReferralEligibility id, createdAt, updatedAt, referralUid, campaignId, referrerAmount, refereeAmount, referrerLineItemId, refereeLineItemId, criteriaId, onboardingStateId, referralId, entity_id, entity_type, type, jobId, billingAccountId, toolingIdempotencyKey, creatorId Referral payment eligibility and vesting conditions

Domain 11 - Referrals and Growth

Table Key Columns Notes
Referrals referralId, referredUserId, referringUserId, createdAt, version, uid, status, reason, listingId, campaignId, totalEarned, totalEarningsPotential, state, deleted_at, paidAt, disputeStatus, isActive, referral_cap, referralIdempotencyKey, isPaymentBlocked, isGuaranteedReferral Core referral records with earnings tracking
Referrals_Audit All Referrals columns + audit metadata Referral change history
ReferralReminder referralId, createdAt, lastSentAt Referral reminder email tracking
GuaranteedReferralQuota quotaId, referringUserId, offPlatformUserId, shortenedLink, weekStart, status, createdAt, updatedAt, isEmailSent Guaranteed referral program quota management
ReferrerMeta Referrer metadata and configuration Additional referrer attributes
OffPlatformCampaigns Campaign definitions for off-platform outreach External recruitment campaign management
OffPlatformCampaignSteps campaignStepId, stepNumber, campaignId, campaignType, subject, messageTemplate, parameters, scheduledAt, status, outreachedCandidateIds, failedCandidateIds, createdAt, updatedAt Multi-step outreach sequence steps
OffPlatformRecruitingManager id, managerId, offPlatformUserId, listingId, createdAt, updatedAt, updatedBy Off-platform recruiter assignments
OffPlatformUsersMapping mappingId, userId, offPlatformUserId, createdAt, updatedAt, referringUserId, status Mapping between platform and off-platform user identities

Domain 12 - Communications and Outreach

Table Key Columns Notes
Comms commId, groupId, senderId, receiverId, content, type, triggerRef, createdAt, listingReferenceUID In-platform messaging with full message content
CommsSent Communication delivery records Message send tracking
EmailTemplates emailTemplateId, companyId, subject, content, createdBy, createdAt, updatedAt, isGlobal, tags, isPersonal Email template library
AircallComms Phone call logs from Aircall VoIP integration Recruiter call records
LinkedinWarmIntros warmIntroId, linkedinUrl, email, referringUserId, listingId, commEvent, status, createdAt, updatedAt, sentAt LinkedIn outreach campaign records
PartnerChatThreads threadId, listingId, referralId, partnerId, createdAt Chat threads with referral partners
FirstTimeInvites commId, userId, listingId, createdAt, commEvent, refListingUid, contentType, subject, listingIdCount First-contact invitations to candidates
AutomationTemplates templateId, name, description, category, handler, sourceType, sourceSql, templateBody, paramsSchema, cron, idempotency, autoApprove, version, createdAt, updatedAt, deletedAt, triggerConfig, config Automated notification/workflow templates
Feedback id, user_id, question_text, question_response, rating, device, created_at, updated_at In-app user feedback submissions

Domain 13 - Company and Access Management

Table Key Columns Notes
Company companyId, name, description, website, externalName, billingModel, logo, brandVisible, billingStartDay, billingEndDay, aboutCompany, universe Client company master records
IAM roleId, companyId, status, userId_v4, id, version Company-level role assignments
IAM_Audit roleId, companyId, status, userId_v4, id, version, auditAction, auditTimestamp IAM change history. Sample: roleId=ghost, REMOVED
IAMOutbox id, resourceType, resourceId, relation, subjectType, subjectId, operation, requestedBy, requestedByService, createdAt, callerToken IAM change outbox for event-driven propagation
GodmodeCompanies companyId, createdAt, createdBy, includeInFillRate Companies accessible via internal Godmode admin
GodmodeArbitraryCells entityType, entityGmId, acKey, acValueNumber, acValueString, acValueFormula, userId, createdAt, acMetadata Arbitrary Godmode data cells for internal operations
Audience id, projectId, companyId, audienceType, slug, anchorType, anchorId, oktaGroupId, googleGroupId, slackGroupId, insightfulTaskId, createdAt, updatedAt, slackChannelId, query Audience definitions linking projects to Okta/Slack/Insightful groups
AudienceTargetProviders id, audienceId, name, externalId, type, createdAt, metadata External providers linked to audiences
DrivePermission id, driveId, googleGroupId, permissionLevel, googlePermissionId, createdAt, updatedAt Google Drive access permissions for project documents

Domain 14 - Skills Certifications and Endorsements

Table Key Columns Notes
Skills skillId, name, description, CertificationPolicy, type, parent, createdAt Hierarchical skills taxonomy
CertificationPolicies_Audit certificationPolicyId, companyId, name, description, rules, isActive, isUnified, createdAt, icon, isRevokable, requiresApproval, version, auditAction, auditTimestamp, iconColor, showBadge, displayText Certification program definitions
Certifications_Audit certificationId, certificationPolicyId, userId, evidence, status, isCertified, earnedAt, note, createdAt, updatedAt, version, auditAction, auditTimestamp Individual earned certifications. evidence contains scoring proof
SkillCertifications_Audit uid, userId, skillId, isCertified, version, lastEvaluatedAt, auditedAt, auditAction Per-skill certification status
SkillCertificationsEvidence_Audit uid, userId, skillId, isCertified, version, sourceType, sourceId, createdAt, updatedAt, auditedAt, auditAction, score, metadata Evidence backing skill certifications
ContractorEndorsements endorsementId, endorsingJobId, endorsedJobId, endorsingUserId, endorsedUserId, contents, tags, createdAt, updatedAt, source, sentiment Peer endorsements with text content and sentiment
UserResumeEvaluation evaluationId, workExperienceScore, yearsOfWorkExperience, graduationYear, mScore, inferredRole, workExperienceSkills, resumeEvalScore, awardScore, educationScore, rateAcademicCompetitions, rateCompetitiveProgramming, rateHackathonPerformance, sumScore, technicalSkills, normalisedSumScore, highestDegree, userId ML resume evaluation scores
CandidateVouches vouchId, voucherUserId, candidateUserId, candidateEmail, candidateLinkedinId, candidateName, resumeS3Key, resumeHash, howKnowSocialPlatform, howKnowSocially, howKnowWorkedTogether, howKnowStudiedTogether, howKnowOther, reasonSkills, reasonEducation, reasonEmployer, reasonExpertise, reasonOther, createdAt, updatedAt Structured peer vouching with relationship details

Domain 15 - Analytics and ML

Table Key Columns Notes
DbtFirmSchoolRank firmId, firmName, academicField, nProfiles, avgSchoolRank, medianSchoolRank, priorMeanSchoolRank, ebPriorStrength, ebAvgSchoolRank, firmsInField, firmSchoolRank, firmSchoolRankPercentile Employer prestige scores for ~154,000 firms. Used in resume scoring
DbtSchoolRankings academicField, schoolName, schoolScore, schoolRank School prestige rankings by field
PosthogAnalytics uuid, userEmail, company, startTimeUtc, endTimeUtc, activetime, inactivetime, startUrl PostHog sessions linked to user email identity
SearchAnalytics run_id, run_timestamp, avg_relevance_score, avg_prestige_score, p99_latency_ms, position_weighted_relevance_score, avg_relevant_prestige_score Search quality metrics over time
ForecastMetrics entity, id, dt, snapshot_dt, modelVersion, predictedValue ML forecast outputs for capacity and fill rate planning
MLExperimentsJobPerformanceReviews Date of review, Account, Project, Reviewer, Work type, Review type, Name, Email, Quality of Work, Engagement, Offboarding Reason, Justification for rating Raw performance review data for ML training
TalentViewUserEvaluations criteriaId, userId, criteriaScore Structured per-criteria talent evaluations
ProductivityProjectRules id, project_id, description, rules, created_by, is_active, version, created_at Per-project productivity monitoring rule definitions

Domain 16 - Infrastructure and DevOps

Table Key Columns Notes
IacDeploymentRuns id, runType, environment, status, commitSha, branch, actor, githubRunId, githubRunUrl, prNumber, stacksAffected, resourcesAdded, resourcesChanged, resourcesDestroyed, summary, durationSeconds, startedAt, completedAt, createdAt Terraform deployment records. Exposes GitHub monorepo URLs, engineer usernames, Terraform plan output
ProductionDeployment deploymentRecordId, releaseTag, buildHash, deployedAt, deploymentIds, taskDefinitionArns, status, createdAt, updatedAt ECS production deployment records. Contains AWS task definition ARNs
PreprodDeployment id, releaseTag, commitSha, deployedAt, loadTestPassed, releaseOwner, status, createdAt, updatedAt Staging deployment records with load test results
PreprodDeploymentTest id, test_message, created_at, updated_at Test table for pre-production deployment validation
ProductionVersion id, lastVersion, lastReleaseTag, lastBuildHash, updatedAt Single-row pointer to current production version
RollbackExecution Rollback event records including affected services Emergency rollback tracking
DATABASECHANGELOG ID, AUTHOR, FILENAME, DATEEXECUTED, MD5SUM, DESCRIPTION, COMMENTS, EXECTYPE, LIQUIBASE Liquibase schema migration history. Reveals engineer names, migration filenames
DATABASECHANGELOGLOCK Liquibase migration lock state Prevents concurrent schema migrations
AgentSandboxes sandboxId, userId, title, agentType, status, backendType, host, stopReason, transcriptRawUrl, transcriptConsolidatedUrl, snapshotId, lastSnapshotId, snapshotStorageKey, acpSessionId, backendId, sandboxToken, claimedAt, expiresAt, createdAt, updatedAt, deletedAt AI coding agent sandbox sessions. transcriptRawUrl links to S3 conversation logs
DrivePermission id, driveId, googleGroupId, permissionLevel, googlePermissionId, createdAt, updatedAt Google Drive permission records

Domain 17 - Reference and Miscellaneous

Table Key Columns Notes
Country id, isoCode3, name, currency, psp, createdAt, updatedAt Country reference table with payment service provider per country
TagAssignments_Audit tagAssignmentId, tagId, entityType, entityId, createdAt, updatedAt, version, auditAction, auditTimestamp Tag assignments to entities
ShortenedUrls URL shortener records Shortened URL definitions
UrlClicks id, urlId, clickedAt, ipHash, userId, country Click tracking on shortened URLs
BeelineJobMapping External job platform mapping Maps Mercor jobs to Beeline external system
UserManagement Internal user management records Admin user management
UserManagementWorkflows User management workflow state Multi-step user management processes
ActionsQueue Queued action records General purpose action queue
GoldenReviewSample Golden reference samples for review calibration QA calibration data
References Professional reference records Additional reference management
CatfishAuditLog id, slackUserId, slackUserName, targetEmail, platform, environment, intent, status, errorMessage, slackChannelId, createdAt Internal user lookup audit. Records every time staff look up user data via "Catfish" tool
CapacityApplicationLog id, capacityBudgetId, capacityLogId, projectId, actionsTakenJson, status, notes, createdAt Capacity budget application tracking
OffPlatformCampaignSteps campaignStepId, stepNumber, campaignId, campaignType, subject, messageTemplate, parameters, scheduledAt, status, outreachedCandidateIds, failedCandidateIds, createdAt, updatedAt Off-platform outreach campaign step execution

End of Appendix A


Document prepared for security research and educational purposes. All PII has been obfuscated.

JotBird Logo
Published with JotBird