Expert data, ready now.

Frontier-grade evaluation, fine-tuning, and RL data — curated and ready. Need something custom? We build it.

Built for model-training teams optimizing reasoning quality: benchmark hard negatives, verifier sets, RL reward-model data, chain-of-thought stress tests, and domain-specific math/physics/CS evaluation suites.

Model performance

Sample dataset — frontier-level mathematical reasoning problems

Built for model trainers who need high-signal reasoning evals: measurable failure modes, controlled difficulty tiers, and expert-verified ground truth for post-training, RLHF, reward model tuning, and frontier model regression tracking.

Olympiad-M — Mathematics

GPT-5.4 Thinking
30.2%
Claude Opus 4.6
23.8%
Gemini 3.1 Pro
22.1%
Deepseek v3.2 Speciale
12.3%
Qwen3.5 Plus
6.8%
Let pp be an odd prime and aa an integer not divisible by pp. Prove that the number of solutions to x2a(modp)x^2 \equiv a \pmod{p} is 1+(ap)1 + \left(\frac{a}{p}\right), where (ap)\left(\frac{a}{p}\right) denotes the Legendre symbol.
Lab
We’re fine-tuning for legal contract analysis but our model hallucinates clauses. We need harder eval data.
Sciloop
We can source practicing attorneys to write adversarial contract scenarios. Want 200 problems with gold-standard annotations?
Lab
Yes — focus on M&A and IP licensing. Include edge cases around indemnification.
Sciloop
Scoping now. We’ll have a spec and sample batch for you by Thursday.
Lab
Sample batch is exactly what we needed. Can we 5x the volume?
Sciloop
On it.

How We Work With Labs

You tell us where your model breaks.

We get on a call, understand the gaps, and figure out what data moves the needle.

We scope it together.

Domain, difficulty, format, volume — we either match you with data we’ve already curated or spec out a custom production run.

You receive verified data.

Expert-crafted, cross-checked, in your format. Come back when you need more.

What's available

Ready Now

Curated evaluation, SFT, RLHF, and reward model data across math, physics, chemistry, biology, and CS. Check with us for availability and volume.

Built to Spec

Need something specific? Tell us the domain, difficulty, and format. We produce it through our expert network — crafted, cross-checked, verified.

Common Data Needs for Reasoning Model Training

  • Long-horizon chain-of-thought supervision for advanced reasoning traces
  • Hard negative examples for refusal calibration and robust safety tuning
  • Preference/ranking data for reward model and policy optimization
  • Verifier labels and outcome-check metadata for process-reward training
  • Targeted failure-mode sets for tool-use, math, and physics reasoning
  • Multistep evaluation suites that resist benchmark leakage and memorization

For model trainers

Data needs for training and evaluating frontier reasoning systems

AI labs and model training teams usually need more than one dataset type. Sciloop supports end-to-end data workflows across evaluation, supervised tuning, and preference optimization with expert-authored scientific and mathematical content.

Reasoning model evaluation sets

Olympiad-style and research-grade tasks for tracking progress on multi-step mathematical and scientific reasoning, including difficult edge cases where pass@k remains low.

SFT and post-training reasoning data

High-precision supervised fine-tuning data with verified solutions and structured rationales to improve chain quality, rigor, and consistency in hard domains.

Preference and reward modeling data

Pairwise comparisons and scoring rubrics designed for RLHF and reward model training, with expert adjudication on correctness, novelty, and reasoning depth.

Custom frontier benchmark design

Bespoke benchmark construction for your internal goals: domain targeting, anti-contamination controls, calibration subsets, and operational delivery specs.

FAQ for AI labs and reasoning teams

What kinds of training data do reasoning models usually need?

Most reasoning teams need a blend: difficult evaluation sets, supervised training examples with verifiable solutions, preference data for policy shaping, and reward-model calibration data.

Can Sciloop support domain-specific benchmark creation?

Yes. We scope by domain, expected difficulty, output format, and evaluation objective, then produce expert-authored and expert-reviewed datasets tailored to your stack.

How do you reduce low-signal or shortcut-heavy data?

We use expert verification and structured review loops focused on novelty, correctness, and anti-shortcut design so training and eval data better reflects true reasoning capability.

Ready to talk data?

We know you're busy, so we move fast. One call, quick scoping, and first delivery in days — not weeks.

Share your target model class, failure modes, desired output schema, and acceptance criteria. We can support eval refresh cycles, SFT data expansion, reward model pairwise ranking, and tool-use trajectory curation.