The 12 mistakes โ each with a real failure scenario and a fix
The scenario: A marketing manager builds an email classification and labelling automation, tests it on 15 examples, watches it work correctly for 3 days, and stops reviewing the outputs. Three weeks later, she notices her "SALES" label is getting almost no emails โ unusual. She digs in and discovers that her email provider changed the format of email headers, causing the automation's sender filter to misclassify most emails as SPAM (an edge case she had not anticipated). For three weeks, every sales enquiry had been routed to her archive folder instead of her inbox. She had missed 23 inbound sales leads.
Why it happens: Building the monitoring log feels like unnecessary work when the automation is working correctly. The value of monitoring is invisible during good times โ it only becomes obvious when something goes wrong.
Build the monitoring log before go-live as a non-negotiable step. A Google Sheets module logging every run (timestamp, input summary, AI output, action taken) takes 20 minutes to add. Set a weekly calendar reminder to review it. The monitoring log is not optional โ it is the difference between catching a failure in week 1 and not catching it until it has run for a month.
The scenario: A developer builds a customer support response drafting automation. He tests it with 20 carefully selected examples from his company's email history โ all clean, well-structured support requests that he has already handled and knows the correct responses for. The automation performs beautifully on every example. He deploys directly to production. Within 48 hours, three customers have received draft responses that were factually wrong about the product's features โ because real customer emails include slang, abbreviations, and references to third-party integrations that the training examples never included.
Why it happens: Test cases are selected by the person who knows the system โ so they naturally tend toward representative, well-formed examples. Real production inputs are messier and more varied than any curated test set.
Shadow mode for at least 5 working days on real inputs is not optional. Route all outputs to a review log rather than taking live actions. Real-world inputs will expose failure modes that curated test sets miss โ that is the purpose of shadow mode. The 5-10 days of shadow mode is an investment that prevents 80% of production failures.
The scenario: A freelancer builds a lead scoring automation with a system prompt that reads: "Rate this lead from 1 to 10 based on how good a fit they are for our business." The automation produces scores between 2 and 9 for similar leads, with no clear pattern. When she reviews the monitoring log, she sees that the AI is essentially making up its own criteria for what constitutes a "good fit" โ sometimes weighting company size heavily, sometimes the contact's role, sometimes the message sentiment. The scores are not meaningful because the criteria are not defined.
Why it happens: Vague prompts that seem reasonable in natural language leave too much interpretation space for the model. The model's "best guess" at unstated criteria varies with temperature and context.
Your prompt must define the exact criteria and how to weight them. "Rate 1-10 based on ICP fit. Score 8-10 only if ALL THREE are true: (1) company is in SaaS or professional services, (2) contact holds VP or Director level or above, (3) message specifically mentions a problem we solve." Explicit criteria produce consistent, auditable scores. Vague criteria produce noise.
The scenario: A developer builds an invoice extraction workflow. His prompt says "return the data as JSON." Most of the time, the model returns clean JSON. But occasionally โ especially on invoices with unusual formatting or when the model is uncertain โ it returns: "Here is the extracted data: ```json { ... } ```" or "I've extracted the following information: { ... }". The JSON parser in his automation fails on these responses, causing the affected invoices to be silently dropped rather than routed to error handling (which he also skipped).
Why it happens: Language models are trained to be helpful and conversational โ adding preambles and markdown formatting is a natural tendency that the prompt's request for "JSON" does not fully suppress without enforcement.
Use the OpenAI API's response_format: {"type": "json_object"} parameter โ this enforces valid JSON at the API level and eliminates markdown fences and preambles entirely. Also add "Return ONLY the JSON object. Begin your response with { and end with }. No other text." to your system prompt as a belt-and-suspenders measure.
The scenario: A business owner tests her email classification automation on 20 typical customer emails. All classify correctly. She deploys. In production, she discovers her automation fails (misclassifies as "OTHER") on: emails in Hindi and Telugu from clients based in India, very short emails with only a subject line and no body, automated delivery receipt emails, and emails from her own email address when she tests sending from her phone. None of these were in her test set โ all of them occur in her real production inbox.
Your test set must include deliberate edge cases, not just typical examples. Minimum edge cases to include: empty body emails, very short emails (1-2 sentences), emails from automated systems (noreply@, notifications@), emails in languages you receive, emails from your own address, and at least 2 examples specifically designed to challenge your classification boundaries (emails that could reasonably fit multiple categories). Add all production failure cases to your test set for regression testing.
The scenario: An operator's Make.com lead scoring scenario has no error handling. The OpenAI API has a brief outage. Make.com's scenario fails on 47 leads that arrived during the outage. Make.com retries 3 times and then stops processing those items. The operator never knows โ there are no failed execution alerts configured, no monitoring log, and the leads' Google Sheet rows simply stay empty in the Score and Tier columns. 47 qualified leads wait unscored in a spreadsheet with no action taken.
Configure error handling before go-live: (1) Add Make.com's error notification (Settings โ Email notifications for scenario errors); (2) Add an error handler route in your scenario โ when any module fails, route to a "human review" Google Sheet row with the input data and error message; (3) Add retry logic for transient failures โ Make.com supports automatic retries for failed scenarios.
The scenario: A team lead wants to automate meeting summaries. She builds a prompt that asks the AI to "summarise the meeting." The AI produces summaries โ but they are not what her team considers a meeting summary. Her team's meeting summaries follow a specific format: status updates on specific project areas, blockers with owner names, next actions for specific roles. None of this was in the prompt because she had never written down exactly what her meeting summary format was. She spent 4 hours building an automation that produces summaries her team will not use.
Before writing your system prompt, spend 30 minutes writing down exactly how you currently do the task manually โ every decision, every format preference, every edge case. This documentation becomes your system prompt foundation. If you cannot document the process clearly enough to hand it to a new employee, you cannot prompt an AI to do it reliably either.
The scenario: An enthusiastic beginner reads about AI automation and decides to build his full suite at once: email automation, lead scoring, meeting summaries, weekly reports, and social media content โ simultaneously. After 3 weekends of building, he has 5 automations in various states of completeness. None of them are properly tested. Two have significant bugs. He is not sure which bug belongs to which automation. His prompt for email classification has been modified so many times while building other things that he no longer has a reliable baseline. He abandons the project.
Build one automation at a time. Complete the full cycle: design โ build โ test โ shadow mode โ go live โ monitor โ optimise. Then build the next one. Each completed automation teaches you something that makes the next one significantly faster. Sequential building with learning also builds the habit infrastructure (monitoring, documentation, testing) that parallel building under time pressure skips.
The scenario: A business owner builds an email classification automation using GPT-4o because it is "the best model." Processing 500 emails per month costs $3โ$7 in API fees instead of $0.10โ$0.50 with GPT-4o mini. After 3 months, she has spent $9โ$21 on classification that could have cost $0.30โ$1.50 โ a 30x overspend for identical classification quality (GPT-4o mini and GPT-4o produce essentially identical classification results for well-structured prompts).
Test your automation on 20 examples with both GPT-4o mini and GPT-4o. Compare the outputs. For classification tasks with clear criteria, you will almost always find the outputs are indistinguishable โ use GPT-4o mini and save 30x on that automation's API costs. Reserve GPT-4o for tasks where you can demonstrate the quality difference matters: generating nuanced long-form content, processing complex documents, or tasks where GPT-4o mini consistently produces outputs requiring significant editing.
The scenario: A sales manager has a lead scoring automation that he is happy with after 1 week of human review. His approval rate is 82% โ good enough, he decides. He removes the review step. In week 3, the AI starts consistently misclassifying a new type of lead (referrals from a specific partner whose emails have a distinctive format the prompt did not anticipate). Without the review step, the misclassification runs for 9 days before he notices because the hot referral leads are not appearing in his CRM pipeline.
The criteria for removing mandatory human review: 85%+ approval rate without editing for 10 consecutive working days AND a monitoring alert in place that will fire if approval rate drops below 75%. Never remove review entirely for consequential automations โ transition from 100% review to sample-based review (20% random + all low-confidence flags). The sample review takes 5 minutes per day and catches systematic failures before they compound.
The scenario: A content creator has a working content repurposing automation with a well-tuned prompt that consistently produces LinkedIn posts in her voice. She makes a "small improvement" to the prompt โ adds some guidance for how to handle technical content. After the change, the posts start sounding generic โ less like her. She wants to roll back. She cannot remember exactly what the old prompt said. Her browser history does not help. She spends 4 hours trying to recreate the previous prompt through trial and error.
Before making any change to a production prompt, copy the current version into a Google Doc with the date and a brief description of the change you are about to make. A simple naming convention: "Email classification prompt v3 โ 2024-11-01 โ added Hindi language handling." This takes 60 seconds and provides an instant rollback capability for any prompt change. Treat your prompts like code โ version control is not optional.
The scenario: A financial advisor wants to automate the initial assessment of whether clients' investment portfolio descriptions match their stated risk tolerance. He builds the automation, tests it on 15 examples, and deploys. In production, the AI gives confident assessments that are regularly wrong โ not because the prompt is bad, but because this task requires understanding regulatory nuance, reading between the lines of how clients describe their situation, and making judgments that depend on context that is not present in the text. The automation produces plausible-sounding but unreliable assessments that the advisor has to re-evaluate anyway.
The failure was not in the implementation โ it was in the task selection. Financial risk assessment is a judgment task that requires human expertise and regulatory accountability. It is not a task AI automation is reliable enough for at the current state of the technology.
Run the one-hour feasibility test before building anything. Test 20+ real examples in ChatGPT. If fewer than 70% produce usable outputs even with significant prompt iteration โ or if the task requires judgment that you genuinely cannot express as articulable rules โ the task is not automation-ready. Build the automation for a different task and revisit this one as AI capabilities develop.
The pattern behind all 12 mistakes
Looking across these 12 mistakes, a clear pattern emerges: they almost all come from one of three failure modes.
Moving too fast. Mistakes 2, 5, 7, 8, and 10 all come from rushing past important preparatory steps because the automation seems to be working and the temptation to ship is strong. The discipline to do shadow mode properly, to test edge cases deliberately, to document before building, to build sequentially โ these are the disciplines that separate automation practitioners who have reliable production systems from those who have fragile ones that require constant firefighting.
Assuming the happy path is sufficient. Mistakes 4, 5, and 6 come from designing for how things should work rather than how they actually work in production. Real inputs are messier than test inputs. APIs do fail. Input data is sometimes malformed or empty. Error handling and edge case coverage are not defensive over-engineering โ they are the minimum bar for production readiness.
Optimising too early. Mistakes 1, 9, 10, and 11 come from removing oversight and optimising costs before the automation has proven itself in production. Monitor before you optimise. Keep human review longer than feels necessary. Save prompt versions before changing them. These are the habits that make AI automation a reliable asset rather than a fragile dependency.
The pre-launch checklist that prevents all 12 mistakes: AI automation checklist: 10 steps before you launch โ every item maps directly to one or more of the mistakes above.
Frequently asked questions
In absolute cost terms, Mistake 1 (no monitoring) and Mistake 2 (no shadow mode) cause the most damage โ they are the mistakes that let failures run undetected for weeks, accumulating real consequences (missed leads, wrong emails sent, data entered incorrectly). Mistake 12 (automating the wrong task) wastes the most build time, but at least it usually becomes apparent relatively quickly. Mistake 1 can run invisibly for months before anyone realises something is wrong.
The specific action depends on the mistake, but the general pattern is: (1) disable the automation immediately to stop the failure from continuing; (2) assess the scope of damage โ what was incorrectly processed, who was affected, what data was incorrectly entered or sent; (3) fix the root cause before re-enabling; (4) add the missing safeguard (monitoring, error handling, shadow mode review) before going live again. Do not try to fix a running broken automation โ stop it, fix it, test it, re-deploy it.
Monitoring log (Mistake 1): 20 minutes. Shadow mode setup (Mistake 2): 30 minutes. Output format enforcement (Mistake 4): 10 minutes to add the API parameter. Error handling (Mistake 6): 30โ45 minutes. Prompt versioning system (Mistake 11): 5 minutes to create a Google Doc with a naming convention. The fixes for most of these mistakes are not time-consuming โ they feel like overhead in the moment but they are cheap insurance against failures that cost significantly more time to clean up after the fact.
Yes. Tasks that require: regulatory judgment or legal accountability (medical diagnosis, legal advice, financial recommendations in regulated contexts); genuine human empathy and relationship continuity (therapy, complex customer escalations, bereavement communication); creative work where distinctiveness and novelty are the entire value (truly original creative concepting, breakthrough research); and any judgment requiring access to information that is not present in the text being processed (context from a long relationship history, body language, tone of voice). For most business automation use cases, these are not the tasks you are trying to automate anyway โ but it is worth checking the assumption before building.
Build your first AI automation the right way
The complete AI automation guide โ including the full pre-launch checklist, monitoring templates, and step-by-step workflow guides โ covers everything you need for production-ready automation from day one.
Read the Complete AI Automation Guide โThinkForAI Editorial Team
Every mistake on this list is drawn from real production AI automation failures โ either personal experience or cases from practitioners we have worked with. Updated November 2024.


