Senior AI Platform Engineer
Shape the technical foundation of FactualIQ’s enterprise AI platform. Architect scalable infrastructure, define platform standards, and enable mission-critical multi-agent systems for enterprise clients.
About the Role
As a Senior AI Platform Engineer, you’ll architect and build the foundational platform infrastructure that powers FactualIQ’s Decision Engine at enterprise scale. You’ll drive technical architecture decisions, establish platform patterns and best practices, and build the critical systems that enable reliable multi-agent orchestration for Fortune 1000 clients. This is a senior technical role where you’ll shape the platform’s technical direction, solve complex distributed systems challenges, and build infrastructure that supports client-critical AI workflows.
What You’ll Do
- Architect and implement highly scalable multi-agent orchestration platforms that handle thousands of concurrent agent executions with sub-second latency requirements.
- Design advanced state management systems for distributed agent coordination, including checkpoint/recovery mechanisms, distributed locking strategies, and event sourcing architectures for full workflow reproducibility.
- Build sophisticated configuration management systems that enable declarative workflow definitions with versioning, A/B testing capabilities, canary deployments, and automatic rollback mechanisms.
- Architect and implement scalable security models for data and workflow protection, access control and service infrastructure security.
- Design and implement advanced observability platforms specifically for AI workflows, including distributed tracing across agent boundaries, custom metrics for LLM performance, cost attribution systems, and automated anomaly detection.
- Create sophisticated evaluation frameworks that combine multiple validation strategies (rule-based, statistical, LLM-as-judge) with automatic performance regression detection and workflow reliability scoring.
- Build intelligent resource optimization systems including predictive scaling for agent workloads, intelligent request routing based on model capabilities, and cost-aware execution planning for LLM inference.
- Design fault-tolerant integration patterns for external services, including circuit breakers, intelligent retry mechanisms with exponential backoff, and graceful degradation strategies when downstream services fail.
- Architect data pipeline infrastructure for agent context management, including vector database optimization, semantic caching layers, and efficient state hydration for long-running workflows.
- Actively manage, or participate in managing, product roadmaps, sprint design and progression, development pacing and quality markers for own work.
- Develop and use effective technical documentation patterns for architecture and design activities, and continuously incorporate emerging best practices.
- Research and implement best practices for both technical and process improvements, and advocate for broad adoption by communicating and illustrating their value.
What You’ll Bring
- Bachelor’s or Master’s degree in Computer Science, Distributed Systems, or related technical field (or equivalent practical experience).
- 8+ years of production engineering experience with 5-7 years specifically focused on platform infrastructure design, distributed systems architecture, and hands-on solution development.
- Expert-level Python proficiency with deep understanding of async programming, concurrency patterns, and performance optimization at scale.
- Extensive production experience with modern agent frameworks (LangChain, LlamaIndex, AutoGen, CrewAI) and workflow orchestration systems (Temporal, Cadence, Airflow, Prefect) including custom extensions and performance tuning.
- Advanced cloud architecture expertise (AWS, GCP, or Azure) including serverless patterns, container orchestration (ECS, GKE, AKS), service mesh implementations, and multi-region deployment strategies.
- Extensive Infrastructure-as-Code expertise with production experience managing complex multi-environment deployments using Terraform or other tools.
- Proven track record designing, documenting and building enterprise multi-tenant platforms with production experience in data isolation patterns, tenant resource quotas, cross-tenant security boundaries, and compliance framework implementation.
- Expert-level distributed systems knowledge including consensus algorithms, distributed transactions, event-driven architectures, and sophisticated service failure management.
- Production experience with advanced observability stacks including distributed tracing (OpenTelemetry), time-series databases (Prometheus, InfluxDB), log aggregation at scale, and custom instrumentation for AI/ML workloads.
- Experience in platform reliability engineering including SLI/SLO definition, load testing frameworks, and incident response automation.
- Experience mentoring other engineers on technical, research, process, documentation and other professional skills.
- Experience working with senior company leaders on aligning business priorities with technical product roadmaps.
What You Might Bring
- Experience building production LLM infrastructure including prompt caching systems, semantic routing, model gateway design, and inference optimization strategies (batching, quantization, distillation).
- Deep knowledge of distributed state machines, workflow optimization, dynamic task scheduling, and building domain-specific languages (DSLs) for workflow definition.
- Production experience with vector databases (Pinecone, Weaviate, Qdrant) including index optimization, hybrid search strategies, and scaling to billions of embeddings.
- Background in AI safety and governance including prompt injection detection, output validation frameworks, PII redaction systems, and audit trail implementation for regulatory compliance.
- Experience with advanced testing strategies for AI systems including property-based testing, metamorphic testing, adversarial testing, and building synthetic test data generation pipelines.
- Track record of technical leadership including driving architecture design and reviews, creating technical documentation, establishing engineering standards, and mentoring other engineers on technical topics.
- Contributions to open-source projects in the agent/LLM/workflow orchestration space, published technical articles, or conference speaking experience.
- Relevant advanced certifications (AWS Solutions Architect Professional, Google Cloud Professional Architect, CKS, or similar).
What WE VALUE
- A growth mindset, building on the recognition that a good engineer is always learning.
- Creative, entrepreneurial flexibility to try innovative approaches to solving problems, coupled with the resilience to recognize mistakes quickly, adapt and correct course as needed to achieve success.
- Speed to solutions, with rapid, well-planned iterations.
- Design-forward approaches to building technology products, coupled with a test-heavy technique to ensure that both the problem to be solved and the solution context are clear and optimal before development begins.
- Transparent, frequent and constructive communication skills and practices.
- Low-ego collaboration, where feedback is valued, everyone’s voice is heard, debates and disagreements are used for the team’s benefit, and commitment matters.
- Mission alignment and care for delivering highest-standard quality to support our clients’ success.
Reporting
The role currently reports to FactualIQ’s President. As we build out our engineering team, the role will ultimately report to a Tech Lead or Senior Tech Lead.
Career progression from this role will likely lead to an AI Platform Tech Lead position.