LangStop
Chain-of-Thought Prompting — Complete Guide with Examples

Chain-of-Thought Prompting — Complete Guide with Examples

7 min read
Last updated:
Font
Size
100%
Spacing

Chain-of-Thought Prompting — a practical guide for developers 🧭

Short summary: Chain-of-Thought (CoT) prompting asks a model to show its intermediate reasoning steps. For many multi-step tasks (math, logic, planning) CoT can dramatically improve correctness. Use CoT with sampling / ensembling (self-consistency), problem decomposition (least-to-most), or exploration (tree-of-thoughts) to get better results — but expect higher cost, latency, and occasional hallucinated steps. (arXiv)


1) What is Chain-of-Thought prompting?

Definition (short): a prompting pattern that explicitly asks the model to produce intermediate reasoning steps (a “chain” of steps) before giving the final answer. Why it works (intuition): forcing the model to decompose reasoning into sub-steps helps it avoid collapsing complex reasoning into a single surface reply — it often uncovers the internal reasoning patterns that lead to the correct answer. This effect is strongest in large LLMs. (arXiv)


2) Key research / methods you should know (developer TL;DR)

  • Chain-of-Thought (CoT) — original systematic study showing CoT improves multi-step reasoning on tasks like GSM8K. Use few-shot examples that include stepwise reasoning. (arXiv)
  • Self-Consistency — instead of one greedy CoT sample, sample multiple reasoning chains and pick the most consistent final answer (ensemble over sampled chains). Often improves accuracy. (arXiv)
  • Least-to-Most (L2M) — decompose a complex problem into subproblems, solve sequentially. Helpful for compositional or symbolic tasks. (arXiv)
  • Tree of Thoughts (ToT) — generalizes CoT into a search process: explore several candidate “thoughts” (partial solutions), evaluate, backtrack/lookahead. Great for tasks needing planning/exploration. (arXiv)

(Each of these is a prompting/decoding strategy rather than a model fine-tune; they layer on top of the same LLM APIs.)


3) Concrete prompt patterns (copy-paste friendly — no runnable code)

These are plain text prompt templates you can paste into any LLM prompt field.

A. Few-shot Chain-of-Thought (CoT)

textLines: 5
1Q: [example question] 2A: Let's think step by step. 1) [step1]. 2) [step2]. ... Therefore, [final answer]. 3 4Q: [new question] 5A: Let's think step by step.

Use 2–8 examples where each example shows the stepwise reasoning and final answer. The phrase “Let’s think step by step” is a known effective trigger for many models. (arXiv)

B. Self-Consistency (sampling + majority vote) — usage pattern

textLines: 3
1Prompt: [CoT prompt from above] 2Decoding: sample N (e.g., 20) chains with temperature>0.7 3Post-process: extract final answer from each chain, select the most frequent answer (argmax frequency).

This uses an ensemble over different sampled chains to reduce brittle single-chain errors. (arXiv)

C. Least-to-Most (decomposition)

textLines: 2
1Prompt step 1: Decompose the problem: list subproblems. 2Prompt step 2: For each subproblem in order, ask model to solve using previous answers as context.

Works well for structured multi-step problems (programming puzzles, symbolic math). (arXiv)

D. Tree of Thoughts (conceptual flow)

textLines: 4
11) Initialize a set of candidate partial solutions (thoughts). 22) Iteratively expand a subset of promising thoughts (use model to propose next steps). 33) Evaluate candidates with a value function or heuristic (use model or rules). 44) Keep top-k, backtrack if dead ends, stop when goal reached.

Treat the LLM as a generator + evaluator; orchestrate search in your application code. (arXiv)


4) Examples (short & practical)

Math word problem (CoT few-shot):

textLines: 5
1Q: A bag has 3 red and 2 blue balls. If you draw 2 without replacement, what's P(both red)? 2A: Let's think step by step. P(first red)=3/5. After one red removed, remaining red=2, total=4, P(second red)=2/4=1/2. Multiply: (3/5)*(1/2)=3/10. Answer: 3/10. 3 4Q: [new problem] 5A: Let's think step by step.

Symbolic / programming reasoning (Least-to-Most):

  1. Prompt the model to break the spec into subproblems (parse input, edge cases, algorithm).
  2. Solve each subproblem sequentially and synthesize final solution.

Planning / puzzle (Tree of Thoughts sketch):

  • Use model to propose 3 candidate moves; evaluate each via a heuristic prompt; expand best two; repeat until solved.

5) When to use which method (scenarios)

  • Use CoT for: math word problems, multi-step logical reasoning, stepwise explanations.
  • Add Self-Consistency when single outputs are noisy but answers should be consistent — e.g., arithmetic or symbolic tasks.
  • Use Least-to-Most when a task naturally decomposes (complex coding tasks, multi-part QA).
  • Use Tree-of-Thoughts when the problem requires search, planning, or backtracking (games, puzzles, creative planning). (arXiv)

6) Practical engineering considerations (ML-engineer checklist)

  • Latency & cost: CoT outputs are longer → higher token cost and latency. Self-consistency multiplies queries. Budget accordingly.
  • Throughput: Batch sampling for self-consistency; parallelize multiple sampled chains.
  • Determinism: CoT with greedy decoding is deterministic; sampling + self-consistency is stochastic — log seeds and samples.
  • Monitoring: Track final answer accuracy and also chain plausibility (e.g., heuristics that detect impossible intermediate steps). Log few example chains for audit.
  • Safety / hallucination: Intermediate steps can be plausible but incorrect. Don’t rely on a chain being “true” unless verified. Add unit checks or symbolic validation where possible.
  • Evaluation: Evaluate both answer correctness and chain fidelity where relevant (e.g., grade intermediate steps).
  • API orchestration: Implement CoT orchestration in your backend: prompting templates manager, sampler, answer aggregator, and retry logic.
  • Scaling tip: For high-volume needs, consider distilling CoT behavior into a smaller model via supervised fine-tuning or RAG + verifier to reduce cost. (Research shows distillation can help but may lose some reasoning ability.)

7) Pitfalls, limitations & mitigations

  • Hallucinated steps: models invent steps confidently. Mitigate by programmatic checks, symbolic verification, or re-asking the model to justify a given step.
  • Not a silver bullet: CoT helps more for large models; smaller models may not benefit. Test model size dependency. (arXiv)
  • Over-confidence: CoT explanations make models sound confident — use external validators.
  • Long reasoning horizon: CoT chains can be long and brittle. Consider hierarchical decomposition (L2M) or pruning (ToT). (arXiv)

8) Tips & tricks (practical)

  • Use 2–8 high-quality few-shot examples that mirror the target task.
  • Use explicit step tokens (1), (2), … or “Step 1: … Step 2: …” to structure output.
  • Combine CoT + Self-Consistency: sample multiple CoT outputs, majority vote final answer. Often wins. (arXiv)
  • For deterministic needs, prefer greedy CoT plus post-verification rather than sampling.
  • When cost matters, try selective CoT: ask for steps only when confidence is low (use model confidence proxy or a lightweight classifier).
  • Use evaluation prompts: ask the model to score or check its own steps (but verify with external logic where possible).
  • Keep prompts minimal but consistent; avoid mixing styles across examples.

9) Example orchestration architecture (high level)

  1. Prompt Manager: stores templates (CoT few-shot, L2M templates, evaluation prompts).
  2. Sampler/Executor: issues N sampled CoT calls in parallel (for self-consistency) or runs ToT loop.
  3. Aggregator/Verifier: extracts final answers, runs checks (unit tests, constraints), picks most consistent.
  4. Logger/Monitor: stores chains, final answers, metrics (accuracy, latency, token usage).
  5. Retraining Trigger: if drift or repeated failures, collect failing prompts+chains to fine-tune or create validator models.

10) Quick reference — 1-line recipes

  • Improve math accuracy: Few-shot CoT + self-consistency (sample 10–40). (arXiv)
  • Complex multi-part tasks: Least-to-Most decomposition. (arXiv)
  • Planning/puzzles: Tree of Thoughts search orchestration. (arXiv)

11) Further reading (starter papers)

  • Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022). (arXiv)
  • Wang et al., Self-Consistency Improves Chain of Thought Reasoning (2022/ICLR). (arXiv)
  • Zhou et al., Least-to-Most Prompting Enables Complex Reasoning (2022/ICLR). (arXiv)
  • Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models (2023/NeurIPS). (arXiv)

Final notes

  • Use CoT and its successors as tools in your pipeline — combine with verifiers and validators.
  • If you want, I can: (A) create a one-page prompt template bundle for your team (MD/JSON), (B) sketch a backend microservice that orchestrates CoT + self-consistency (endpoints, database tables, cost estimates), or (C) produce a short slide deck summarizing the patterns and trade-offs. Which would you like next?

References used (selected): Wei et al., Chain-of-Thought Prompting; Wang et al., Self-Consistency; Zhou et al., Least-to-Most Prompting; Yao et al., Tree of Thoughts. (arXiv)