Gemini Fine Gray Survival Analysis Task

Source: gemini-fine-gray-survival-analysis-task.md (ingested 2026-03-28)

What You Should Do Next

The goal is to complete the deliverable to a standard that would also satisfy the metaprompt. That means producing five filled fields that are each internally complete, technically rigorous, and demonstrate that Gemini 3.0 Pro genuinely fails while a skilled human would succeed.

The wind turbine competing risks domain is a strong choice and worth keeping. Here is why it works:

The Fine-Gray sub-distribution hazard model is not natively implemented in standard Python survival analysis packages like lifelines or scikit-survival in a way that handles all edge cases. Gemini will either refuse to implement it, implement the cause-specific hazard and mislabel it as Fine-Gray, or produce code that runs but computes the wrong quantity. That is a verifiable, not adversarial, failure.

Left truncation combined with interval censoring in the same dataset is a combination that breaks most model-generated survival analysis code in specific, documentable ways. The model will typically drop left-truncated observations silently rather than conditioning on them correctly.

The competing risks interpretation failure, specifically the fact that one minus the Kaplan-Meier estimator of any single cause overestimates the cause-specific cumulative incidence when competing risks are present, is a subtle conceptual error that frontier models make persistently and confidently.

The Execution Plan

Here is what I will build for you, in order, to complete this:

Field 1 — Task Design A fully specified prompt in wind turbine reliability analysis that contains a synthetic dataset with known ground truth, at least one embedded data inconsistency, at least one implicit constraint, at least one red herring, and explicit requirements for the model to state assumptions and produce structured output.

Field 2 — LLM's Response You run on some LLM model (e.g., Gemini 3.0 Pro in this task, but we should be general and have apis/docs/etc. or way of understanding and accessing information of certain models and their releases) on the prompt and paste its output. I will help you annotate exactly where it fails and why, mapped to specific failure modes.

Field 3 — Golden Solution A complete expert-level solution showing all intermediate steps, explicit assumptions, correct competing risks analysis using Fine-Gray or correctly-labeled cause-specific hazards, proper handling of left truncation and interval censoring, and numerically verifiable outputs.

Field 4 — Scoring Rubric A 100-point rubric organized by reasoning stage with specific criteria, partial credit rules, and at least ten automated unit test descriptions that could be implemented in Python.

Field 5 — Failure Analysis A technical comparison of Gemini 3.0 Pro against GPT-4o and Claude 3.5 Sonnet on this specific task structure, grounded in known architectural and training limitations rather than vague claims.

What I Need From You To Start

Tell me which of these two situations applies:

Situation A: I want you to scaffold the project and all five fields right now, including the complete task prompt and golden solution samples, so that I can run the LLM (Gemini 3.0 Pro here) on the prompt yourself and then compare its output to the golden solution I produce. In this case I start immediately with the task prompt and work through all phases.

Situation B: I already have a partial execution and want me to complete the missing pieces specifically the golden solution continuation from where it cut off, plus the rubric, the automated tests, and the failure analysis. In this case paste the existing partial work and I build from there.

Either way, the wind turbine competing risks domain is sound and I will keep it. The synthetic dataset approach with known data-generating process is correct for verifiability. The three target failure modes are the right ones to document.

Tell me which situation applies and I will begin immediately with the first deliverable.