OpenAI GPT-5 Launch Shows Spelling, Geography Errors

OpenAI has released GPT-5 to all ChatGPT accounts, positioning it as “PhD-level” intelligence with significant advances in reasoning, coding, multimodal capabilities, and health support.

The unified system integrates a fast general model, a deeper reasoning variant (“GPT-5 Thinking”), and a real-time router to optimise responses. In health, GPT-5 scores highest ever on OpenAI’s HealthBench, acting as an “active thought partner” to help users understand results, ask better questions, and make informed decisions.

Despite marketing promises of fewer hallucinations, better honesty, and more human-like accuracy, early user tests revealed elementary spelling and geography errors. Experts caution that the rollout mixes genuine technical progress with over-hype, amid ongoing debates over safety, regulation, and commercial use.

OpenAI GPT-5 Debuts – Key Points

Launch & Availability
- Launched Thursday, 7 August 2025, rolled out instantly to 700 million weekly ChatGPT users.
- Replaces GPT-4o, OpenAI o3, o4-mini, GPT-4.1, and GPT-4.5 as the default for signed-in users.
- Unified architecture:
  - Fast general model for everyday queries.
  - “GPT-5 Thinking” for complex problems.
  - Real-time router decides which model to use based on query complexity, context, tool needs, and user intent (e.g., “think hard about this”).
- Tiered access: Plus users get more usage; Pro subscribers get unlimited GPT-5 and GPT-5 Pro with extended reasoning.
- Free users shift to GPT-5 mini after usage limits.
High-Profile Early Errors
- Spelling mistakes: Claimed “blueberry” had three Bs; “Northern Territory” had three Rs instead of five.
- Geography errors: Invented “New Jefst” and “Mitroinia,” misspelled Arizona as “Krizona” and Vermont as “Vermoni,” double-listed California.
- Guardian Australia confirmed “Northan Territor” on AI-generated maps.
Core Capabilities & Benchmarks
- Performance: 94.6% on AIME 2025 (math), 74.9% on SWE-bench Verified (coding), 84.2% on MMMU (multimodal), 46.2% on HealthBench Hard (health).
- Hallucinations: ~45% fewer than GPT-4o in real-world traffic; ~80% fewer than OpenAI o3 when using reasoning mode.
- Honesty: Deception rates halved vs o3 (2.1% vs 4.8% in production).
- Safety training: “Safe completions” approach balances helpfulness with risk controls, especially for dual-use domains like biology and chemistry.
Specialised Performance Areas
- Health:
  - Highest OpenAI health benchmark scores to date.
  - Designed as an “active thought partner,” not a medical professional, helping users interpret results, flag potential issues, and frame better questions.
  - Adapts responses to user’s context, knowledge level, and geography.
  - Not HIPAA-compliant; Mashable notes ongoing privacy and efficacy concerns.
  - Mental health updates include better recognition of emotional distress, nudging users who engage for extended periods, and reducing sycophancy.
- Coding: Generates full-stack apps, games, and websites from a single prompt with improved aesthetics (spacing, typography, white space).
- Writing: Excels at structurally complex forms (e.g., unrhymed iambic pentameter, free verse) with literary depth and cultural nuance.
- Multimodal: Higher accuracy in interpreting and reasoning over images, diagrams, and video sequences.
Industry Context & Rival Claims
- Elon Musk (Grok) claimed “better than PhD level in everything” in July 2025.
- Anthropic revoked OpenAI’s API access ahead of launch, citing ToS violations; OpenAI called such cross-evaluation “industry standard.”
Social Media Reactions
- Bluesky posts documented humorous but incorrect outputs.
- Public criticism focused on the gulf between “expert-level” marketing and observed factual reliability.
Ethics, Regulation & Commercial Concerns
- Prof Carissa Véliz (Institute for Ethics in AI) questioned both profitability and the reality behind marketing.
- Gaia Marcus (Ada Lovelace Institute) urged comprehensive AI regulation.
- Biological safety: “GPT-5 Thinking” treated as High-capability in bio/chem domains, with 5,000+ hours of red-teaming and layered safeguards.
Identified Weakness
- Dan Shipper (CEO, Every) noted hallucinations when reasoning mode was not engaged.
- BBC’s Marc Cieslak described the experience as evolutionary rather than revolutionary.
ChatGPT Policy Changes & Customisation
- No definitive answers to sensitive personal life questions; instead facilitates reflection.
- Four preset personalities — Cynic, Robot, Listener, Nerd — for varied conversational styles, designed to cut sycophancy by over half.

Why This Matters:

GPT-5 combines tangible technical progress — from benchmark-leading coding and multimodal reasoning to expanded, context-aware health support — with unresolved challenges in factual reliability and public trust. While its “active thought partner” health role could shift how users prepare for medical interactions, privacy gaps, mental health safeguards, and regulatory lag remain critical issues. The launch highlights the dual track of AI advancement: breakthrough capabilities paired with persistent governance and authenticity debates.