AI Engineer at Nephew

Fleek

Fleek

Software Engineering, Data Science

New York, NY, USA · Remote

Posted on May 29, 2026

About the Project

Nephew is the chief of staff small business owners can't otherwise afford. It lives in SMS, handles the inbox, manages bookings, follows up on leads, reconciles invoices, sends reminders, files things in the right place. The customer is a plumber, a dentist, a vet, a hairdresser, someone running a real business between jobs, who texts lowercase and short. We're building the agent that does the work they keep meaning to get to.

A small team, a wide surface, and a customer base that's pulling us forward faster than we can build for them.

What You'll Actually Do

Build the agent runtime: The part of Nephew that turns "follow up with the three quotes from last week" into real messages that landed in real inboxes and checks its own work before anything goes out. The agent has memory, acts on its own initiative when the moment calls for it, and stays inside the audience model whether an owner messaged first or not. You'll own how all of that fits together.

Author skills: Skills are the unit of composable behavior, how we teach Nephew to do a new shape of work. Booking flows, invoice chasing, lead triage, daily reflection, whatever the next vertical needs. You'll write them, refine them, and kill the ones that don't earn their tokens.

Build the eval and observability layer: This is where we catch regressions before owners feel them. Build the datasets, score what matters, watch the production traces, and close the loop from "this looked wrong" back into the skill that produced it. If something drifts, you want to know before the owner does.

Ship integrations: Connect Nephew to the tools small businesses actually use. The work is auth flows, webhook plumbing, mapping someone else's schema into ours, and handling the failures real APIs hand you. The proof point is a live owner's calendar on the first try, not a recorded fixture, not a demo account.

Everything else: A small team covers a lot of ground. SMS is the surface now; web, email, and voice are queued behind it. Plenty of the work that matters most this quarter isn't in this list.

The Bar

Something you wrote goes live most days. A skill you authored is the reason an owner stopped manually chasing invoices. When Nephew sends something weird at 7am on a Saturday, you can trace it end to end, runtime, skill, memory, integration and fix it before lunch. The founders are writing less code because you've taken over surfaces they used to own.

Agentic coding has to already be how you work. We're not going to teach you the workflow on the side; the people we're hiring built parts of it.

Day-1 Reality

Day 1: laptop open, repo cloned, system prompt in one tab, production traces in another. You read a real owner conversation from yesterday, the actual SMS thread, the actual memory writes, the actual skill that fired. You find something that's off. A reply that was technically correct but landed wrong. A skill that loaded when it shouldn't have. A memory write that buried something the agent needed.

Week 1: you ship the fix as a PR, it merges, it goes live, and an owner gets the better version the next time they text. You'll have talked to a founder about what to build next. You'll own a surface by the end of the week.

Who You Are

An agent harness has your fingerprints on it. Maybe you wrote one, maybe you tore a working one apart and rebuilt the parts that didn't fit. Either way, you can sit down and explain which trade-offs you made and what you'd do differently now.

You've gotten an agent to check its own work instead of hallucinating success. The flavor doesn't matter, evals, replay, self-critique, post-hoc verification, browser confirmation, whatever closed the loop. What matters is that you've fought that fight and won it more than once.

The tools you reach for include ones you built yourself. Custom commands, internal CLIs, MCP servers, harnesses tuned to your own brain. You don't take the default toolchain as final.

Taste for the audience: You can read a draft SMS and tell whether a plumber between jobs would understand it. You don't add a "decompose goal" flow because the owner doesn't think that way. You feel the audience model in your bones, not as a rule you check against.

Comfortable reading production: You can sit in a trace, follow a long thread of skill calls and memory writes, and find the place where the agent went off the rails. Production data is where you do your best thinking, not a place you visit when something's broken.

Systems instincts: You see where something will fail before it does, and you make the trade-off between robust and fast on purpose, not by accident.

Frontier-aware: You've used the current generation of models, harnesses, and orchestration patterns enough to have a working theory of which ones to reach for and when. The theory updates when the frontier moves.

Range: A morning on the agent runtime, an afternoon on a new skill, an evening on an eval or an integration. There aren't enough of us for specialists.

Speed at altitude: Same day to production is the norm. We don't sit on things. We also don't ship slop, the bar is high and the cycle is short, both at once.

Curious about how this stuff actually works: You've read papers and someone else's prompt files for the same reason you read code: to learn from the people pushing on the same problem.

Remote-first: Wherever you're sharpest. We default to async, ship in public channels, and pull the team together in person a few times a year.

Why This Role Is Different

You and the founders, no one in between. The conversation about what to build next happens between the people who'll build it. There's no PM layer to translate, no ticket queue to live inside.

What you're building is the product itself. Nephew isn't an AI feature on top of some older thing, the agent is what the owner pays for. So a skill that lands cleaner, a memory write that recalls the right thing later, an integration that doesn't drop the ball, those move retention, not some metric two steps removed from it.

The customer talks back, fast and unedited. Small business owners don't soften feedback. If Nephew sent something off, you'll hear about it, often in lowercase, often within the hour. That's the fastest signal you'll ever build against.

Mornings to afternoons, not quarters to quarters. You write a skill before lunch, an owner uses it after, and you know whether it worked by the time you log off.

Even Better If

You've spent serious time on AI engineering already, agents in production, harnesses you understand from the inside, eval infrastructure you've actually used to catch something.

You've shipped something to a non-technical audience and felt the difference. Consumer, SMB, anything where the user doesn't read your error messages.

You've built memory or state systems for agents and have opinions about what works at scale, how to keep what matters, surface it at the right moment, and avoid the failure modes that show up six months in.

Previous founder or early-stage builder. You've worn every hat and liked it.

Open source contributions to the AI tooling we live in.

Tech

Python for the agent and the skills. LangSmith for evals and observability. Pipedream for integrations. The runtime itself is TypeScript and lives in its own repo. You don't need to know all of it coming in. You need to learn fast.

How we work

A handful of people, a lot of trust, almost no process. Whoever owns the surface decides. Your first PR ships in your first week. There's no recurring meeting where we align on alignment; we build, watch what happens, and adjust.

The agent's behavior is version-controlled. If it governs what Nephew does, it's a file you can edit, review, and merge, not a setting you toggle in someone else's UI.

Ownership here means a piece of the product, not a column on a board. The thing you own is something owners actually depend on between jobs. If it breaks, you're the one fixing it. If it lands, it's clear whose work made that happen.

Why Nephew

Small business owners have been waiting for software that meets them where they are. Most of what's been built for them assumes they think like founders. They don't. We do. The thing they actually need is a coworker who can text and we're early enough that the person who shows up next gets to decide a lot of what that coworker becomes.