Detecting gRPC Authentication Attacks with Machine Learning

On September 26, 2023, the internet experienced one of its most devastating distributed denial-of-service attacks. Amazon reported mitigating attacks reaching 155 million requests per second. Cloudflare clocked 201 million rps. Google endured a record-shattering 398 million rps attack—using only 20,000 machines. The culprit? CVE-2023-44487, the HTTP/2 Rapid Reset vulnerability that weaponized a fundamental protocol feature into an attack vector so efficient it redefined what "large-scale DDoS" meant.

A year later, in May 2024, a CNCF security audit uncovered CVE-2024-36129—a compression bomb vulnerability in OpenTelemetry Collector's gRPC receivers that could bring down entire observability stacks. Between 2022 and 2024, Russian forces conducted multiple cyberattacks on Starlink's infrastructure, with the Killnet hacktivist group's DDoS attacks causing connectivity issues for 41% of users. The Ukrainian Security Service later reported Russia's GRU deployed "Malware 4.STL" to gather intelligence from Starlink systems through compromised Android devices.

I built a system to detect them.

The Problem: When Traditional Security Fails

gRPC has become the de facto standard for inter-service communication in cloud-native architectures. Google uses it. Netflix uses it. Spotify, Square, CoreOS—the list goes on. It's fast, efficient, and built on HTTP/2's multiplexing capabilities. But that efficiency comes with a security cost.

Traditional security approaches fall short in several ways:

Protocol Complexity: HTTP/2's features—header compression (HPACK), stream multiplexing, server push—introduce attack surfaces that WAFs and firewalls weren't designed to handle. HPACK bombs can exhaust memory through malicious compression dictionaries. Rapid Reset attacks exploit stream cancellation timing. These aren't signature-based attacks you can pattern-match away.

Authentication Layer Bypass: gRPC services often rely on JWT tokens, mutual TLS, or API keys for authentication. But what happens when an attacker finds a way to craft requests that bypass validation? Or when they exploit reflection APIs to enumerate services and methods? Static rules can't adapt to novel bypass techniques.

Volume vs. Sophistication: While massive DDoS attacks grab headlines, "slow and low" attacks—low-volume, stealthy attempts that stay under rate-limiting thresholds—are often more dangerous. They're harder to detect and can persist for weeks, exfiltrating data or probing for vulnerabilities.

The False Positive Problem: Security teams are drowning in alerts. A system that flags 5% of legitimate traffic as malicious is worse than no system at all—it trains engineers to ignore alerts and erodes trust in security tooling.

I needed something better. Something that could understand the nuances of gRPC traffic patterns, learn what "normal" looks like for each client, and detect anomalies with sub-millisecond latency and minimal false positives.

So I built a multi-layered detection system powered by ensemble machine learning.

Architecture: Defense in Depth

The system operates as a transparent proxy sitting between clients and upstream gRPC servers. Every request flows through five detection layers, each contributing a confidence score to a final blocking decision:

Ensemble Voting Visualization

Layer 1: Pattern Matching

The first line of defense is fast, deterministic pattern matching. I implemented 15+ attack signatures covering common exploit techniques:

JWT Bypass Patterns: Missing signatures, "alg: none" headers, expired tokens with modified timestamps
Reflection Abuse: Unauthorized calls to grpc.reflection.v1alpha.ServerReflection or grpc.health.v1.Health
Session Hijacking: Mismatched client IDs, token reuse across IPs, session fixation attempts
HPACK Poisoning: Malformed header compression, oversized dynamic tables, compression bomb indicators
Path Traversal: Directory traversal attempts in RPC method names or metadata

This layer catches low-hanging fruit instantly—no ML inference needed. If a request has "alg": "none" in its JWT header, it's blocked immediately with HIGH severity.

Layer 2: CVE-Specific Detectors

Protocol-level vulnerabilities require specialized detection logic. I implemented dedicated detectors for eight CVEs:

CVE-2023-33953: gRPC HPACK Table Poisoning
CVE-2024-36129: OpenTelemetry gRPC Compression Bomb
CVE-2024-47616: Protocol-level Authentication Bypass
CVE-2023-44487: HTTP/2 Rapid Reset Attack (the 398M rps attack)
CVE-2023-32731: gRPC Metadata Authorization Bypass
CVE-2019-9512: HTTP/2 Ping Flood
CVE-2019-9513: HTTP/2 Resource Loop (Priority Loop)
GRPC-REFLECTION-ENUM: gRPC Reflection Service Enumeration

Each detector monitors specific protocol behaviors. For example, the Rapid Reset detector tracks stream creation/cancellation ratios and timing. If a client opens 1,000 streams and cancels them within 50ms, that's flagged as CRITICAL.

Layer 3: Behavioral Analysis

This is where things get interesting. Static rules can't catch novel attacks or account for legitimate traffic variations. I needed a system that learns baseline behavior for each client and detects deviations.

Baseline Learning Progression

The system undergoes a 7-day (168-hour) baseline learning period for each client, tracking:

Temporal Patterns: What times of day does this client typically operate? Weekend vs. weekday patterns?
Request Frequency: What's the normal request rate? Burst patterns?
Method Distribution: Which gRPC methods does this client call? In what proportions?
Request Sizes: Average payload sizes, compression ratios
Inter-Arrival Time: Time between consecutive requests (detecting automation)
Error Rates: What's the normal error rate for this client?

I use Welford's algorithm for numerically stable online computation of means and variances—critical when dealing with long-running statistics that need to update incrementally without storing all historical data.

Once baseline is established, the system calculates anomaly scores. A client that normally makes 10 requests per minute suddenly making 1,000? Anomaly score spikes. A client that always calls UserService.GetProfile now calling AdminService.DeleteUser? Red flag.

Layer 4: Machine Learning Ensemble

The ML layer combines five models, each with different strengths:

TensorFlow Deep Neural Network: Best overall accuracy, captures complex non-linear relationships
Random Forest: Fast inference, excellent feature importance insights
LSTM: Temporal sequence modeling for detecting attack progressions
Isolation Forest: Unsupervised anomaly detection for zero-day attacks
DBSCAN: Density-based clustering for identifying attack campaigns

Each model votes on whether a request is malicious. The ensemble aggregates votes with configurable weights (I weight the DNN and Random Forest higher since they had the best validation performance). If the ensemble confidence exceeds a threshold (default: 0.75), the request is flagged.

Layer 5: Packet Inspector

The deepest layer performs protocol-level inspection—parsing gRPC frames, validating HTTP/2 compliance, checking for malformed protobufs. This catches implementation-specific bugs and zero-day protocol exploits that higher layers might miss.

All five layers run in parallel. Total detection latency? Let's look at the data.

The Numbers: Sub-Millisecond Detection

I trained the system on 25,000 samples—17,500 for training, 3,750 for validation, and 3,750 for testing. Each sample had 50 engineered features spanning request characteristics, authentication metadata, network behavior, temporal patterns, and behavioral baselines.

Detection Latency CDF

The cumulative distribution function above shows detection latency across both models:

P50 Latency: 1.39ms (Neural Network), sub-1ms (Random Forest)
P95 Latency: 3.70ms (Neural Network)
P99 Latency: 5.28ms (Neural Network)

Even at the 99th percentile, detection adds less than 5.3 milliseconds to request processing. For context, typical gRPC service latencies range from 10-100ms. We're adding less than 10% overhead in the worst case.

Random Forest is consistently faster—its inference is simpler (tree traversal vs. matrix multiplication). But the Neural Network's slightly higher accuracy made it worth the extra microseconds for high-security environments.

Accuracy: 99.4% with Minimal False Positives

The models performed exceptionally well:

Model	Accuracy	Precision	Recall	F1 Score	AUC-ROC
Neural Network	99.28%	99.33%	97.11%	98.20%	0.9995
Random Forest	99.41%	99.86%	97.24%	98.53%	0.9994

Neural Network Confusion Matrix

The Neural Network confusion matrix tells the story:

True Negatives: 2,985 out of 2,990 legitimate requests correctly allowed (99.8%)
False Positives: 5 legitimate requests incorrectly blocked (0.2%)
True Positives: 738 out of 760 attacks correctly detected (97.1%)
False Negatives: 22 attacks missed (2.9%)

That 0.2% false positive rate is critical. In a production environment processing 10,000 requests per hour, that's only 20 false alarms—manageable for security teams and low enough to avoid alert fatigue.

Random Forest Confusion Matrix

Random Forest performed even better:

True Negatives: 2,989 (100.0%)
False Positives: 1 (0.0%)
True Positives: 739 (97.2%)
False Negatives: 21 (2.8%)

A single false positive across 3,750 test samples. That's a 0.03% false positive rate.

ROC Curves: Near-Perfect Discrimination

Neural Network ROC Curve

The ROC curve for the Neural Network shows an AUC of 0.9995—essentially perfect discrimination between attack and legitimate traffic. The curve hugs the top-left corner, indicating high true positive rates across all false positive thresholds.

Random Forest ROC Curve

Random Forest achieved 0.9994 AUC—equally impressive. Both models demonstrate that the feature engineering was effective and the problem is well-posed for ML solutions.

What the Models Learned

Understanding why models make predictions is crucial for security applications. If a model flags a request as malicious, security teams need to know which features contributed to that decision.

Feature Importance Waterfall

The waterfall chart shows Shapley values—the marginal contribution of each feature to predictions. Out of 50 total features extracted (spanning time-based, request characteristics, authentication, network, and behavioral dimensions), the top 8 most influential were:

grpc_status (0.162): The gRPC status code returned. Status codes like UNAUTHENTICATED, PERMISSION_DENIED, or RESOURCE_EXHAUSTED are strong indicators.
status_code (0.162): HTTP status code. Tied with grpc_status—failed authentication attempts leave clear signals.
unique_methods_count (0.153): How many distinct gRPC methods a client calls. Attackers often enumerate services, hitting many methods rapidly.
error_rate (0.136): Percentage of requests resulting in errors. Attacks generate errors as they probe for vulnerabilities.
jwt_exp_seconds (0.129): Time until JWT expiration. Short-lived tokens (< 60 seconds) or expired tokens signal compromise.
request_rate (0.097): Requests per second. While not the top feature (slow and low attacks have normal rates), high rates still matter.
compression_ratio (0.061): Request compression ratio. Compression bombs have extreme ratios (e.g., 1000:1).
response_time_ms (0.032): Response latency. Less important than I expected—attacks don't always cause slowdowns.

This aligned with my intuition: authentication-related features dominate. Attackers trying to bypass auth will trigger UNAUTHENTICATED errors, use invalid tokens, and probe multiple endpoints looking for unprotected methods.

Attack Type Accuracy: Specialized Detection

Different attack types have different characteristics. DDoS attacks are high-volume and noisy. Compression bombs have extreme compression ratios. Authentication bypasses generate specific error patterns.

Attack Type Radar Chart

The radar chart shows per-attack-type accuracy:

DDoS Attacks: 99.8% accuracy—easiest to detect due to volume
Authentication Bypass: 99.1% accuracy—the primary focus
Compression Bombs: 99.3% accuracy—extreme compression ratios are distinctive
Reflection Abuse: 98.7% accuracy—unauthorized reflection calls are clear indicators
Slow and Low Attacks: 98.2% accuracy—hardest to detect but still high performance

Even the toughest category (slow and low) achieved 98.2% accuracy. These attacks deliberately mimic legitimate traffic, staying under rate limits and spacing out requests. The behavioral analysis layer—tracking temporal patterns and method distributions—was key to catching them.

Training Dynamics: Fast Convergence

Training Convergence

The Neural Network training converged rapidly. Within the first 10 epochs, validation accuracy reached 99%. The best checkpoint came at epoch 46 with validation loss of 0.0181 and accuracy of 99.47%.

I trained for 100 epochs with early stopping (patience of 10 epochs). Loss dropped from ~0.16 to ~0.02, and validation accuracy plateaued around 99.5%. No signs of overfitting—training and validation curves tracked closely.

Training took about 8 minutes on a single NVIDIA A100 GPU. Once trained, I exported the model to TensorFlow Lite for optimized inference—reducing model size by 75% and inference latency by 40%.

Real-World Attack Patterns

One of the most interesting findings came from analyzing temporal attack patterns in my simulated dataset.

Temporal Attack Heatmap

Attacks concentrated heavily on Saturdays, with peaks during evening hours (18:00-23:00 UTC). Friday also showed elevated activity. Weekdays (Monday-Thursday) had minimal attack traffic.

This matches real-world observations: attackers often schedule automated scans and probes during weekends when security teams are less likely to be monitoring actively. The evening concentration suggests timezone-specific targeting (likely European or US East Coast timezones).

In production deployment, this data would inform staffing—ensuring security engineers are available during high-risk windows and tuning alert thresholds dynamically based on time of day.

Implementation Deep Dive: Making It Production-Ready

Building accurate models is one thing. Deploying them in production with sub-millisecond latency, graceful degradation, and operational visibility is another.

The Proxy Architecture

The core is a transparent gRPC proxy written in Python using grpc.aio for async I/O. Clients connect to the proxy on port 50051 (or a configurable port). The proxy intercepts requests, runs detection, and forwards legitimate traffic to the upstream server.

async def intercept_request(self, request, context):
    # Extract features from request
    features = self.feature_extractor.extract(request, context)
 
    # Run detection pipeline (all layers in parallel)
    results = await asyncio.gather(
        self.pattern_matcher.detect(features),
        self.cve_detector.detect(features),
        self.behavioral_analyzer.detect(features),
        self.ml_ensemble.predict(features),
        self.packet_inspector.inspect(request)
    )
 
    # Aggregate confidence scores
    confidence = self.aggregate_confidence(results)
 
    if confidence > self.blocking_threshold:
        await self.alert_system.send_alert(request, confidence)
        context.abort(grpc.StatusCode.PERMISSION_DENIED, "Potential attack detected")
 
    # Forward to upstream
    return await self.upstream_call(request)

All five detection layers run concurrently using asyncio.gather. This parallelism is crucial—running them sequentially would multiply latencies. By running in parallel, total latency is dominated by the slowest layer (typically the ML inference).

Database Design: 12 Models for Comprehensive Tracking

I use PostgreSQL with 12 models to persist everything:

Alert: Alert history with severity levels and resolution tracking
ThreatPattern: Threat pattern definitions with hit counts and metadata
Baseline: Behavioral baselines for normal traffic patterns per method
AuditLog: Complete audit trail for system activities
BlockedClient: Blocked client tracking with expiration and violation counts
MLModel: ML model metadata, versioning, and performance metrics
TrainingHistory: ML model training history with epoch-level metrics
FeatureStore: Time-series feature storage for historical analysis
BaselineHistory: Historical baseline evolution for drift detection
AnomalyPattern: Discovered anomaly patterns for pattern learning
PredictionLog: ML prediction logging with feedback for retraining
MetricTimeSeries: Time-series metrics for monitoring and analysis

This comprehensive schema enables deep forensic analysis, model performance monitoring, and attack campaign correlation. If an attack gets through, we can trace exactly which layers flagged it, which features contributed, and whether similar patterns appeared earlier.

Rate Limiting: Redis-Backed Distributed Control

Rate limiting uses Redis for distributed state—critical in multi-instance deployments. I implemented sliding window counters with per-client limits:

async def check_rate_limit(self, client_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{client_id}"
    now = time.time()
 
    # Remove old entries outside window
    await self.redis.zremrangebyscore(key, 0, now - window_seconds)
 
    # Count requests in current window
    count = await self.redis.zcard(key)
 
    if count >= limit:
        return False
 
    # Add current request
    await self.redis.zadd(key, {str(now): now})
    await self.redis.expire(key, window_seconds)
 
    return True

Redis sorted sets (ZADD, ZCARD, ZREMRANGEBYSCORE) provide exact sliding windows with minimal overhead. Limits are configurable per client—higher limits for known good clients, stricter limits for new or suspicious clients.

Alerting: Multi-Channel with Deduplication

When an attack is detected, the alert system fires across multiple channels:

Webhooks: POST to configured endpoints (PagerDuty, Slack, custom integrations)
Email: SMTP alerts with HTML formatting and attack details
Syslog: RFC 5424 syslog messages for SIEM integration

Critical feature: 5-minute deduplication windows. If the same client triggers the same attack type within 5 minutes, alerts are deduplicated to avoid flooding. Security teams get one alert, not 1,000.

Monitoring: WebSocket Streaming + HTTP Health

The system exposes two monitoring endpoints:

HTTP Health Check (port 8080): /health endpoint returning JSON with system status, model versions, detection statistics, and layer health
WebSocket Monitoring (port 8765): Real-time streaming of detection events—perfect for dashboards or ops tooling

WebSocket clients receive JSON messages for every detection event:

{
  "timestamp": "2024-01-15T10:23:45Z",
  "client_id": "client_abc123",
  "attack_type": "auth_bypass",
  "confidence": 0.94,
  "severity": "HIGH",
  "layers": {
    "pattern_match": 0.85,
    "cve_detector": 0.00,
    "behavioral": 0.72,
    "ml_ensemble": 0.98,
    "packet_inspector": 0.65
  }
}

This real-time visibility is invaluable for debugging and tuning—you can watch attacks as they happen and see which layers contribute most.

The CLI Tools: End-to-End ML Workflow

I built three CLI tools to manage the entire ML pipeline:

1. grpc-detector-data: Synthetic Data Generation

grpc-detector-data generate \
  --samples 25000 \
  --attack-ratio 0.2 \
  --output data/training.json

Generates realistic synthetic gRPC traffic with configurable attack mix. Supports all five attack types with realistic distributions based on OWASP threat modeling.

2. grpc-detector-train: Model Training

grpc-detector-train train \
  --data data/training.json \
  --model-type ensemble \
  --epochs 100 \
  --batch-size 256 \
  --output models/ensemble_v1

Trains ensemble models with hyperparameter tuning, cross-validation, and automatic checkpointing. Exports models in TensorFlow Lite format for production deployment.

3. grpc-detector-analyze: Analysis & Visualization

grpc-detector-analyze evaluate \
  --model models/ensemble_v1 \
  --test-data data/test.json \
  --output-dir analysis/

Generates all the visualizations shown in this article—ROC curves, confusion matrices, feature importance charts, etc. Also produces detailed performance reports with per-attack-type metrics.

This end-to-end tooling made iteration fast. I could generate data, train models, and analyze results in a single command pipeline.

Key Findings and Lessons Learned

1. Ensemble Learning Is Worth the Complexity

I initially built a single Neural Network model. It performed well—99.1% accuracy. But adding Random Forest, LSTM, Isolation Forest, and DBSCAN improved accuracy to 99.4% and, more importantly, reduced false positives from 0.9% to 0.03%.

The ensemble's strength is diversity: Random Forest excels at feature interactions, LSTM captures temporal sequences, Isolation Forest catches novel attacks, and DBSCAN identifies coordinated campaigns. When they agree, confidence is high. When they disagree, it's usually because the request is borderline—warranting human review rather than automatic blocking.

2. Feature Engineering Matters More Than Model Architecture

I spent weeks on feature engineering—iterating on which attributes to extract, how to normalize them, which temporal windows to use. That effort paid off. Engineering 50 comprehensive features across time, request, auth, network, and behavioral dimensions provided the signal needed for high accuracy. The top three features (grpc_status, status_code, unique_methods_count) alone achieved 96% accuracy.

In contrast, swapping Neural Network architectures (adding layers, changing activations, tuning dropout) only moved the needle by 1-2%. The signal was in the features, not the model complexity.

3. Behavioral Baselines Are Essential for Low False Positives

Without client-specific baselines, the false positive rate was 5%—completely unacceptable. By learning what "normal" looks like for each client and only flagging deviations, false positives dropped to 0.2%.

The 7-day learning period was a trade-off. Shorter windows (1-3 days) had higher variance and more false positives. Longer windows (14+ days) were slow to adapt to legitimate behavior changes. Seven days balanced stability and adaptability.

4. CVE-Specific Detectors Catch What ML Misses

ML models learn patterns from training data. If a specific CVE exploit wasn't in the training set, the model might miss it. Dedicated CVE detectors—hardcoded logic for specific vulnerabilities—provide deterministic coverage.

This layered approach (rules + ML) is more robust than either alone. Rules catch known exploits instantly. ML catches novel attacks and subtle anomalies.

5. Production Performance Requires Optimization

The initial Python implementation with TensorFlow had P95 latency of 12ms—too high for low-latency services. Optimizations that got us to 3.7ms:

TensorFlow Lite: Reduced model size and inference time by 40%
Async I/O: grpc.aio for non-blocking I/O and parallel layer execution
Feature Caching: Cache extracted features for a short TTL (100ms) to handle retries without re-extraction
Redis Pipelining: Batch Redis commands to reduce round-trips
Connection Pooling: Reuse database connections and gRPC channels

Each optimization shaved milliseconds. Getting to sub-5ms required sweating the details.

Real-World Implications

This system isn't just an academic exercise. It addresses real threats:

CVE-2023-44487 (Rapid Reset): The Layer 2 CVE detector monitors stream creation/cancellation ratios. An attacker opening and canceling 10,000 streams per second would be flagged within milliseconds.

Compression Bombs: The compression_ratio feature detects extreme ratios. A 10KB payload decompressing to 10MB triggers an alert.

Authentication Bypass: The combination of grpc_status, status_code, and JWT validation catches bypass attempts. Even if an attacker finds a novel bypass, the behavioral layer detects unusual method access patterns.

Slow and Low: The behavioral baseline detects clients that gradually increase request rates or shift method distributions over days—tactics designed to evade simple rate limiting.

The system is open-source and production-ready. Organizations running gRPC microservices can deploy it as a sidecar proxy or Kubernetes DaemonSet. It requires PostgreSQL (14+), Redis (6+), and Python 3.11+.

What's Next

This is a starting point, not a finish line. Areas for future work:

Federated Learning: Train models across multiple organizations' data without sharing raw traffic. Detect attacks one organization sees before they hit others.

Explainability: While Shapley values help, security teams need richer explanations—why this specific request was flagged, which prior requests from this client were similar, which attacks in the threat database match the pattern.

Adaptive Thresholds: Dynamically tune blocking thresholds based on context—stricter during high-alert periods (e.g., after a CVE disclosure), more permissive during maintenance windows.

Zero-Day Detection: The Isolation Forest and DBSCAN models provide unsupervised anomaly detection, but they're not yet tuned for zero-day exploits. More work is needed to balance novelty detection with false positives.

Multi-Protocol Support: Extend beyond gRPC to HTTP/2, HTTP/3, and WebSocket traffic. Many attacks span protocols.

Closing Thoughts

When Google's infrastructure weathered a 398 million rps attack, they had years of internal security tooling and expertise to lean on. Most organizations don't. They're running gRPC microservices with standard API gateways and hoping for the best.

This project demonstrates that production-grade attack detection is achievable with open-source tools and machine learning. You don't need a security research team or proprietary threat intelligence. You need good feature engineering, a layered detection architecture, and attention to operational details.

The 99.4% accuracy and 0.03% false positive rate aren't just metrics—they represent a system that security teams can trust. One that won't drown them in false alarms. One that catches attacks in real time with minimal overhead.

This system gets it right 99.4% of the time. And when the next HTTP/2 zero-day drops, the behavioral analysis and unsupervised learning layers give us a fighting chance at detecting it before it's weaponized at scale.

All code, trained models, and datasets are available at gitlab.com/maxmcneal/grpc-auth. The system is released under the MIT license.