📚 Complete GuideUpdated 2025

How to learn Stata with your generative AI?

A practical guide to writing prompts that generate clean, reproducible Stata code for regression and causal inference, with a focus on applied econometrics.

OLS • IV • DiD • RD • FERobust • Clustered SEsPanel Data

Getting Started

Master the art of translating econometric problems into precise Stata code through AI assistance.

Why This Guide?

  • Turn research questions into working code faster
  • Avoid common pitfalls in econometric implementation
  • Ensure reproducible and robust results

What You Will Learn

  • 📝Structure effective prompts for AI models
  • 🔧Access ready-made code for common methods
  • Verify results with systematic checklists

Golden Rules

Essential

Five fundamental principles for successful AI-assisted econometric programming.

1. Define Your Data Schema

Always specify variable names, types, and meanings. Include sample size and panel structure if applicable.

wage: numeric (hourly wage in $), educ: numeric (years), id: string (worker ID)

2. State Your Research Design

Clearly identify the econometric method and identification strategy.

OLS with robust SEs, IV using quarter-of-birth, DiD with two periods

3. Specify Standard Error Treatment

Match SE choice to your research design and data structure.

Cluster at individual level, robust for heteroskedasticity, bootstrap for complex estimators

4. Request Code Only

Ask for executable code with minimal comments, avoiding interactive prompts or placeholders.

Return Stata code block only; use actual variable names; include basic diagnostics

5. Ensure Reproducibility

Include version control, seeds, and explicit package requirements.

set seed 12345, ssc install reghdfe, version 17

Prompt Template

Copy & Fill

Use this structured template to create clear, comprehensive prompts for any econometric task.

📋 How to Use

  1. Copy the template below
  2. Replace placeholders with your specific details
  3. Paste into your preferred AI model (Claude, ChatGPT, etc.)
  4. Review output using the QC checklist
Universal Prompt Template
prompt
Role: You are a careful Stata tutor for undergraduate econometrics.

Context:
- Study question and identification strategy: <1–2 lines>
- Dataset name: <e.g., wagepanel.dta>
- Variables (name : type : meaning):
  - wage : numeric : log hourly wage
  - educ : numeric : years of schooling
  - exper : numeric : years of experience
  - id, year : panel identifiers
- Assumptions / SEs: <e.g., cluster at id; robust>
- Output format: Stata code only, with short comments. Use explicit prefixes (xtset, ivregress, etc.). No placeholder variables.

Task:
- Write Stata code to accomplish: <task>
- Include: data load, minimal cleaning (few lines), model, SEs, brief postestimation.
- Constraints: no interactive prompts; set seed when simulating; avoid deprecated commands.

Return: A single code block.

Ready-to-Use Code Recipes

Copy-paste Stata code for the most common econometric methods. Each recipe includes diagnostics and best practices.

OLS with Robust Standard Errors

Recipe 1

Basic linear regression with heteroskedasticity-robust standard errors

OLS with Robust Standard Errors
stata
* OLS with robust SEs
clear all
use "wagepanel.dta", clear

* Basic data exploration
describe wage educ exper
summarize wage educ exper, detail

* OLS with robust (Huber-White) standard errors
regress wage educ exper, vce(robust)

* Store results for later comparison
estimates store ols_robust

* Effect size interpretation for log wages
* If wage is log-transformed, uncomment:
* regress lnwage educ exper, vce(robust)
* display "Return to schooling (%) = " 100*(exp(_b[educ])-1)

Instrumental Variables (2SLS)

Recipe 2

Two-stage least squares with endogenous regressors

Instrumental Variables (2SLS)
stata
* IV / 2SLS: wage on educ instrumented by quarter-of-birth
clear all
use "wage_iv.dta", clear

* Check instrument relevance
regress educ qob exper age, vce(robust)
test qob

* 2SLS estimation
ivregress 2sls wage (educ = qob) exper age, vce(robust)

* First-stage diagnostics
estat firststage
* Rule of thumb: F-stat > 10 for strong instrument

* Over-identification test (if multiple instruments)
* estat overid

Difference-in-Differences

Recipe 3

Treatment effect estimation with before/after and treatment/control groups

Difference-in-Differences
stata
* Difference-in-Differences estimation
clear all
use "did_sample.dta", clear

* Variables: treat (1=treatment group), post (1=after policy), y (outcome)

* Basic DiD specification
regress y i.treat##i.post, vce(robust)

* Extract treatment effect
lincom 1.treat#1.post
display "Treatment effect: " _b[1.treat#1.post]

* Event study with multiple periods (if panel data available)
* xtset id year
* reghdfe y i.year##i.treat, absorb(id) vce(cluster id)

* Pre-trend test (parallel trends assumption)
* Keep only pre-treatment periods and test trend differences

Regression Discontinuity

Recipe 4

Sharp RD design around a cutoff point

Regression Discontinuity
stata
* Sharp Regression Discontinuity around cutoff c=50
clear all
use "rd_sample.dta", clear

* Create treatment indicator based on cutoff
generate D = (running_var >= 50) if !missing(running_var)

* Optimal bandwidth RD (requires rdrobust package)
* ssc install rdrobust, replace
rdrobust outcome running_var, c(50)

* Manual local linear regression (for illustration)
* Restrict to bandwidth around cutoff
keep if inrange(running_var, 45, 55)

* Center running variable at cutoff
generate centered = running_var - 50

* Local linear regression
regress outcome i.D c.centered##i.D, vce(robust)

* Treatment effect is coefficient on D
lincom 1.D

Panel Fixed Effects

Recipe 5

Two-way fixed effects with clustered standard errors

Panel Fixed Effects
stata
* Panel Fixed Effects with clustering
clear all
use "panel_data.dta", clear

* Set panel structure
xtset id year

* Two-way fixed effects (requires reghdfe)
* ssc install reghdfe, replace
reghdfe wage educ exper, absorb(id year) vce(cluster id)

* Alternative: manual fixed effects
* Unit FE with year dummies
areg wage educ exper i.year, absorb(id) vce(cluster id)

* Check for serial correlation
* xtserial wage educ exper

* Test fixed effects necessity
* xtreg wage educ exper, fe
* estimates store fe
* xtreg wage educ exper, re  
* estimates store re
* hausman fe re

Logistic Regression

Recipe 6

Binary outcome models with marginal effects

Logistic Regression
stata
* Logistic regression for binary outcomes
clear all
use "binary_outcome.dta", clear

* Basic logistic regression
logistic employed educ exper, vce(robust)

* Marginal effects at means
margins, dydx(*) atmeans

* Average marginal effects
margins, dydx(*)

* Predicted probabilities by education level
margins, at(educ=(10(2)18)) atmeans

* Goodness of fit
estat classification
estat gof, group(10)

Prompt Modification Patterns

Common patterns for refining and adjusting your prompts to get exactly what you need.

Scope Control

  • Return one Stata code block only; robust SEs; no graphs; minimal comments.
  • Include data loading, estimation, and basic diagnostics only.
  • Skip exploratory analysis; focus on the main regression.

Schema Matching

  • Match these exact variable names: [paste describe output or list]
  • Use my dataset structure: wage (numeric), educ (numeric), id (string).
  • Variables are already cleaned; no missing value handling needed.

Estimator Swapping

  • Rewrite using ivregress 2sls with instrument Z; include first-stage diagnostics.
  • Convert to fixed effects: reghdfe with unit and year absorption.
  • Change to logistic regression with marginal effects.

Standard Errors

  • Cluster at id level; two-way FE for id & year.
  • Use robust standard errors for cross-sectional data.
  • Bootstrap standard errors with 1000 replications.

Quality Control Checklist

Before Publishing

Systematic verification steps to ensure your code is correct, robust, and reproducible.

Pre-Publication Checklist

Variable names in code match dataset exactly
Sample restrictions are appropriate and documented
Standard error choice matches research design
IV diagnostics: first-stage F-stat > 10
DiD: check pre-trends with event study
RD: validate bandwidth and manipulation tests
Panel: test for serial correlation and choose FE vs RE
Reproducibility: set seed, version, explicit paths

Additional Resources

Helpful tools, packages, and references to enhance your econometric workflow.

Essential Packages

  • reghdfe - High-dimensional FE
  • rdrobust - RD estimation
  • estout - Table formatting
  • ietoolkit - Impact evaluation

Best Practices

  • • Keep dated do-files
  • • Document data sources
  • • Set project-root paths
  • • Version control with Git

Ethics & Attribution

  • • You verify all results
  • • Cite your code, not the AI
  • • Document methodology clearly
  • • Share replication files