Job Description
Company Description
Aetosky develops secure software platforms designed for defense and dual-use institutions to harness geospatial data for critical decision-making. By providing interoperable tools tailored to mission-critical environments, Aetosky supports operations such as battlefield intelligence, infrastructure protection, disaster response, and climate security. Focused on real-time operations and strategic foresight, our technologies empower partners to act with precision, speed, and confidence in sensitive, air-gapped environments. We collaborate with government and enterprise customers to advance geospatial intelligence capabilities in modern defense and multi-domain operations.
About the role
The Data & NLP/AI Engineer owns the full data journey within Aetosky's Multi-INT Fusion Platform -from scraping raw open-source content off the internet, through statistical filtering and semantic analysis, to orchestrating LLM-powered deep intelligence processing. This is a combined Data Engineering and NLP/AI Engineering role with end-to-end ownership: you build the ingestion infrastructure, deploy the vector database, implement anomaly detection and clustering algorithms, and design the prompt orchestration layer for agentic AI analysis. AI-assisted development (GitHub Copilot, Cursor, Claude Code, or equivalent) is the standard workflow - not optional - and will be directly assessed during the hiring process.
Responsibilities
Data Infrastructure Responsibilities
• Design and build automated data collection pipelines (web scrapers, API integrations) for target platforms including X, Facebook, local forums, Instagram, TikTok, and Reddit.
• Deploy and manage the vector database (PostgreSQL with pgvector extension) with indexing optimized for semantic similarity search at scale.
• Implement pipeline monitoring and alerting: heartbeat checks, record-count validation, dead-letter queues, and golden-record unit tests to prevent silent data loss.
• Manage infrastructure scaling during surge events (sudden data volume spikes during geopolitical crises).
• Complete secure enclave provider assessment based on target client security requirements.
NLP / AI Engineering Responsibilities
• Implement the first-stage statistical filter using TF-IDF with configurable anomaly thresholds against 30-day rolling baselines.
• Build semantic clustering using lightweight vector embedding models, grouping near-duplicate content into representative cluster centroids for efficient analyst review.
• Implement bot-detection tripwires: velocity anomaly detection (timing-based coordinated inauthentic behavior) and lexical duplication detection (copy-paste spam arrays).
• Design and manage the prompt orchestration layer for the second-stage LLM processor: intent extraction, relationship mapping, and structured output generation within a secure cloud enclave.
• Implement cost-cap logic with graceful degradation: dynamic threshold escalation at budget warning levels, automated pause at cap, and manual triage fallback.
Collaboration & Tuning Responsibilities
• Collaborate with the Full-Stack Software Developer on data contracts, API schemas, and query optimization for frontend consumption.
• Lead the daily filter tuning cycle during the post-launch stabilization period (first 30–60 days): analyze false positive rates, processing costs, and output quality metrics.
• Document pipeline architecture, filter logic, and prompt templates to enable future team onboarding and sovereign AI transition.
Classifications / Qualifications
Required
• 3+ years of combined experience spanning data engineering and applied NLP/machine learning.
• Demonstrated daily proficiency with AI-assisted development tools (GitHub Copilot, Cursor, Claude Code, or equivalent) - this will be assessed in the technical evaluation.
• Strong Python and SQL skills with hands-on experience in PostgreSQL (pgvector a plus), Elasticsearch, or similar.
• Experience building web scrapers that handle anti-bot protections, rate limiting, proxy rotation, and DOM structure changes.
• Hands-on experience with text embedding models (sentence-transformers, OpenAI embeddings, or equivalent), vector similarity search, and clustering algorithms.
• Demonstrated LLM prompt engineering: designing prompts, managing context windows, evaluating output quality, and controlling inference costs.
• Familiarity with monitoring and observability tools (Prometheus, Grafana, Datadog, or equivalent).
Preferred
• Experience with multilingual NLP.
• Experience with real-time data streaming technologies (Kafka, Redis Streams, or similar).
• Background in influence operation detection, disinformation analysis, or social media intelligence.
• Demonstrated LLM cost optimization techniques (batching, caching, token management).
• Familiarity with government cloud environments (FedRAMP, ISO 27001, or equivalent regional certifications).