Tags: keploy/keploy
Tags
feat(grpc): implement static dedup for gRPC (#4298) * feat(grpc): implement static dedup for gRPC Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com> * fix(grpc): address review comments — dedup window collision, header deep copy, parseGrpcStatus error - Always increment GlobalTestCounter for duplicate gRPC streams so concurrent duplicates each get a unique mock-window name instead of all sharing "test-0" and colliding in the mock manager - Deep-copy PseudoHeaders/OrdinaryHeaders via maps.Clone when promoting initial headers to trailers for error responses, preventing shared map mutation between Headers and Trailers - Return -1 (not 0) from parseGrpcStatus on non-numeric trailer values so malformed grpc-status causes a deterministic mismatch rather than silently passing as OK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(grpc): update stale CaptureGRPC doc comment after counter fix The second bullet incorrectly said the counter is only incremented for non-duplicates. After the collision fix it is always incremented. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * error fix Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com> --------- Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com> Signed-off-by: SHIVANI PANDEY <shivanipande486@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(app): work-slow-not-fail for pre-run container-name removal under… … daemon saturation (#4309) * fix(app): work-slow-not-fail for the pre-run container-name removal Under a saturated docker daemon (heavy CI oversubscription) the `docker rm -f` of a leftover --name container can take well past the old 30s pre-run budget + 3 retries (~120s total), so the next `docker run --name` hits "Conflict. The container name is already in use" and the run fails — the go-docker-timefreeze flake. This removal is NOT in the SIGINT drain path, so it can afford to wait. Make it work-slow-not-fail: preRunRemoveBudget 30s->90s and dockerRunNameConflictRetries 3->5 (~540s of removal attempts, perAttempt=budget/3 =30s so each rm can finish), so a saturated daemon yields a slower removal, never a failed run. Pure budget change; teardown-drain budgets untouched. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): work-slow reset-resend — more attempts + backoff for the docker-proxy reset burst A transport-reset test request (docker's userland-proxy resetting freshly-accepted host-port conns in bursts under load) is re-sent on the mock-gated reset path. The budget was 2 back-to-back attempts — enough for a transient reset, but under heavy CI contention the reset burst outlasts both and the test fails got=0 (go-docker-timefreeze's reset mode, distinct from the container-name-conflict mode). Make it work-slow: 2->6 attempts spread by a growing backoff (attempt*300ms) so the bounded re-sends ride out the reset burst rather than hammering it. Each attempt stays gated on no-mock-consumed + port readiness, and a non-reset failure still stops immediately, so the extra attempts only add latency on the reset path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(yaml): atomic non-append WriteFileF — temp+rename so readers never see a partial file WriteFileF's non-append branch opened the destination with O_TRUNC and streamed the document, leaving a window where a concurrent or volume-lagging reader sees a zero-length or half-written file. This is the single write primitive behind every test-set report, config, and test->mock mapping (reportdb Insert/UpdateReport, testdb config, mapping writes), so on overlay/NFS/container-mounted volumes under load a reader (report-upload, localTestsPassed, reportdb.GetReport) could read it mid-sync and either hard-fail to decode or silently decode a PARTIAL report with test entries missing — a customer-facing reliability + correctness hazard. Write to a temp file in the same dir, Sync, preserve the target mode, then atomically rename over the destination (POSIX rename replaces atomically; Windows remove-then-rename). Mirrors the proven mockdb.writeMocksAtomically / testdb.upsert pattern. The append branch (NDJSON/multi-doc) is unchanged. Adds a concurrency regression test (reader never observes a partial file across 80 rewrites). Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(grpc): retry connection-refused on the SimulateGRPC dial (slow-starting app) SimulateGRPC dialed the app once with net.Dial and returned "failed to dial: connection refused" on the first failure. A slow-starting gRPC app — a docker app under CI load especially — can briefly refuse connections while still coming up, exactly the scenario the HTTP replay path already handles via doRequestWithConnRefusedRetry. gRPC had no such tolerance, so its first tests deterministically false-fail with got=0 — the likely grpc-mongo NextToken DeadlineExceeded class. Extract dialTCPWithConnRefusedRetry: re-dial on a pure connection-refused up to maxConnRefusedRetries with the shared growing backoff (a refused dial sent zero bytes / consumed zero mocks, so the re-dial is idempotent); any other error, attempt-exhaustion, or a cancelled context returns immediately. Reuses the existing isPreResponseConnRefused / maxConnRefusedRetries / connRefusedRetryBackoff. Adds tests: retries-then-connects, bounded-on-persistent-refusal, respects-context. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): apply reset-resend on the streaming test path (docker-proxy reset) The non-streaming replay loop tolerates a docker userland-proxy reset on a freshly-accepted host-port conn (replay.go:1752 — the request never reached the app, consumed zero mocks, so retryResetOnce re-sends instead of synthesizing a false got=0). The streaming Phase 2 path (SSE/NDJSON/chunked/multipart) had no such tolerance: a reset under CI load went straight to failure++ and a synthesized got=0 — a false STATUS_CODE_CHANGED/AppConnectionError on a request that was never processed. Wire the identical mock-gated retryResetOnce into the streaming simErr branch before the failure synthesis: on a transport reset, re-send (retryResetOnce returns a fresh streaming response for a streaming tc); on the unsafe refusal path, fold its drained mocks into totalConsumedMocks exactly as the non-streaming loop does. retryResetOnce is bounded + gated on no-mock-consumed (covered by reset_resend_test.go), so no new failure modes. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): bounded retry for transient compose-dependency startup failures When keploy records a docker-compose app and `compose up` fails because a dependency container (a DB/emulator under `depends_on: condition: service_healthy`) crashed during startup, keploy aborted the whole run ("user application terminated unexpectedly") — even though the app's own container never started and a retry brings the flaky dependency up cleanly. This is the orderflow/localstack atg-with-mocks CI flake; for a customer it means a flaky compose dependency fails their recording. Per work-slow-not-fail, tolerate it. Add a bounded (3), linearly-backed-off retry of the compose bring-up, gated strictly on the transient signature read from docker's own post-mortem state (`docker compose ps -a --format json`): the app service in state `created` (compose never started it because a service_healthy gate failed) AND some other non-agent service `exited` non-zero. A genuinely-failing app runs and exits — its service is `exited`, never `created` — so it can never match and still fails fast with its real exit code (no masking; the gate is firewalled and unit-tested). Each attempt does a clean ComposeDown + re-runs the pre-up guards; the loop, backoff, and pre-up guards all short-circuit on ctx cancellation so the teardown budget is respected. The `docker compose ps` state probe is evaluated lazily, only when a retry is plausible, so it never shells out on the success/cancel path. Also fixes a latent bug: the file-based compose down/ps/agent-id branches now carry the `-p` project flags so a user-set compose project name resolves correctly. Covered by unit tests (the firewall incl. genuine-fail-not-masked, the loop gate incl. probe-not-called-on-success, the ps parser) and dockerlive-tagged live tests (RED abort, GREEN flaky-dep recovers, NO-MASKING always-fail surfaces). Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci: drop codfish/semantic-release-action dry-run from the PR build The keploy org blocks third-party (non-GitHub-verified) Actions, so codfish/semantic-release-action@v3 fails the build job at "Getting action download info" with "Repository access blocked", cascading to CI Gate and the record_build_replay_build cells. It was only a dry-run version preview (last step, nothing consumes its output; the real release runs elsewhere), so removing it unblocks the PR build with no loss of coverage. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(ci/proxy-stress): stop keploy on app-start bail + timeout-ceiling everything The proxy-stress-test job hung ~6h until the job timeout. RCA from the CI log (only 3 SIGINTs delivered for 4 record iterations) plus a local repro: send_request runs in the background and SIGINTs the foreground keploy record to stop it. On a flaky slow app-start (the app's /health does a DB ping and only starts after postgres is service_healthy; under contention the 40x3s=120s budget was occasionally exceeded), send_request hit exit 1 WITHOUT calling container_kill, so keploy recorded forever. The local repro confirmed keploy terminates cleanly the instant it IS signaled - the hang is purely the missing signal, not a recording-hot-path bug. - call container_kill on the health-failure bail path (the direct fix) - raise the startup budget 40 -> 80 attempts for contention-slow boot - wrap record, both replays, and both compose-downs in timeout (hard ceiling) so nothing in this script can hang the job for hours Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
feat: strict request-body value-match and unified on-disk noise block (… …#4310) * feat(schemanoise): value-match all candidates under strict mode Under strict schema-noise enforcement (SchemaNoiseStrict), StrictReject previously short-circuited to "allowed" for any mock with no learned req_body_noise, so a value change on an unmarked field was silently served. Gate that early-return on !e.strict: in strict mode every candidate is now value-compared against its recorded request body, and a drift on a field not covered by configured/learned noise rejects it. The recorded-vs-live diff still excludes user/global body noise and obfuscated values via KnownNoise, and the non-strict path is unchanged (it still tightens only mocks that carry learned noise). Updates the HTTP filterStrictNoiseMatches docs and the match() strict-block comment to describe the new behaviour, and extends the engine tests with no-learned-noise strict cases. Signed-off-by: Aditya Sharma <aditya282003@gmail.com> * refactor(mockdb): unify mock noise under a single on-disk noise block The on-disk mock format split obfuscator value-regexes (top-level `noise:` list) from request-body schema noise (`req_body_noise:` map). Consolidate both under one `noise:` mapping: noise: req: [ body.tier_type, body.user.id ] # schema field paths value: [ "^tok-.*$" ] # obfuscator value-regexes noise.req is a plain path list — the per-path regex values are dropped because the strict path only ever honoured whole-field "ignore", so they were dead weight for request-body noise. The change is confined to the serialization boundary: the in-memory models (Mock.Noise []string, MockSpec.ReqBodyNoise map) are unchanged, so the matcher/engine keep their existing vocabulary. A new DocNoise type with custom YAML/JSON unmarshalers converts at encode/decode. Backward compatible (read both, write new only): legacy `noise:` lists and `req_body_noise:` maps still decode (the list folds into value, the map keys fold into req); new writes emit only the unified block. The gob format is unaffected (it gob-encodes models.Mock directly). Adds DocNoise unit tests and a legacy-format decode test. Signed-off-by: Aditya Sharma <aditya282003@gmail.com> * test(schema-noise): assert noise.req in the mux-elasticsearch e2e The schema-noise-detection e2e script grepped for the literal on-disk key `req_body_noise`, which the unified noise block replaced with a `noise.req` field-path list. Detection still works (the path is written under noise.req), but the Phase B assertion could no longer find the old key and failed. Update the mock_has_*_noise helpers to match a `- body.<path>` list item (unique to noise.req), refresh the display grep to print the noise block, and align the phase messages/comments with the new key name. Phase C (strict rejection) was already passing and is unchanged. Signed-off-by: Aditya Sharma <aditya282003@gmail.com> * fix: remove the codfish/semantic-release-action due to access blocked by github (attacker force-pushed malicious commit) Signed-off-by: Aditya Sharma <aditya282003@gmail.com> --------- Signed-off-by: Aditya Sharma <aditya282003@gmail.com>
fix(ci-flake): tab-safe PostgresV3 SQLNormalized YAML (CI flake robus… …tness, batch 1) (#4305) * fix(models): tab-safe YAML for PostgresV3 SQLNormalized PostgresV3QuerySpec.SQLNormalized is a plain string with no write-side YAML styling. After the integrations restoreRawSQL pass it can carry embedded tabs/newlines from the original statement; yaml.v3 v3.0.1 then emits such values as a literal block scalar (`|N-`), and because yaml.Node.Encode marshals-then-reparses its own output, that block scalar fails to round-trip with "found a tab character where an indentation space is expected". The failure is inside EncodeMock's Spec.Encode, before the post-encode sanitizeYAMLStringNodes walk runs, so the node sanitizer can't rescue it — the whole PostgresV3 mock fails to marshal and is dropped. Add a PostgresV3QuerySpec.MarshalYAML that routes SQLNormalized through PostgresV3SafeString (DoubleQuotedStyle when StringNeedsDoubleQuoted), mirroring the existing PostgresV3Notice/PostgresV3Error convention. The model field stays `string`, so JSON/BSON and all integrations assignment/read sites compile unchanged; only the YAML write side gains the double-quoting. The alias spells out every field explicitly because yaml.v3 v3.0.1 panics on the duplicated-key embed-and-shadow idiom. Regression tests: TestYAMLRoundTrip_PostgresV3_* cover leading-tab and tab/newline-bearing SQLNormalized round-tripping through EncodeMock. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): retry user-app docker-run on transient container-name conflict The v3.5.77 pre-run ensureContainerNameFreeWithin frees the user-app container name before `docker run --name X`, but on a saturated CI daemon the prior test-set's `--rm` reaper can still re-touch the name in the narrow window between the guard's "name is free" check and the run, so the run loses the race and docker exits 125 ("Conflict. The container name ... is already in use") without starting the app — failing the whole test-set (observed: go-docker-timefreeze test-set-0->1 in pipeline 5120, post-v3.5.77, with no pre-run-budget warning, i.e. the guard ran and returned cleanly but the name was retaken before the run). Close the residual window at the run itself: when a user-app docker-run exits 125, force-remove the name and re-issue the run, bounded by dockerRunNameConflictRetries so a genuine 125 (bad image, missing mount, …) still surfaces after a few attempts. Belt-and-suspenders with the pre-run guard; only the docker-run (non-compose) path is affected. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(runner): use AgentReadyTimeout for the runner agent-readiness gate The runner's pre-instrumentation agent-readiness gate hardcoded a 120s ceiling, while the record/replay/agent gates all use the contention-tuned pkg.AgentReadyTimeout (default 330s, overridable via KEPLOY_AGENT_READY_TIMEOUT). On a saturated CI daemon the keploy agent container can take well over a minute just to start, so the 120s gate here fired first and failed an otherwise-healthy sandbox/runner bring-up with "keploy-agent did not become ready within 120s" — even though the other gates would have waited it out. Align this gate to the same budget. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(record): don't close appErrChan while the app-runner can still send On shutdown the app-runner goroutine (runAppErrGrp, both the compose and non-compose branches) sends the app's exit error on appErrChan. When the app receives SIGTERM it exits with "signal: terminated" — not ErrCtxCanceled — so the goroutine reaches `appErrChan <- runAppError`. Start's `defer close(appErrChan)` raced that send and crashed the whole run with "panic: send on closed channel" (observed in pulsar-basetopic teardown, record.go:518). appErrChan is owned by Start but written by a goroutine that can outlive Start's return, so Start must not close it. The sole consumer is a single receive in the select below (no close dependency) and the size-1 buffer absorbs the lone send (the two sender branches are mutually exclusive), so leaving the channel open to be GC'd is correct and race-free. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): gate docker-run replay on host port readiness, not just --delay waitForAppReady only honored --health-url (poll for 2xx) or a fixed --delay sleep. For docker apps published with -p, a fixed --delay that is too short under CI contention fires replay at a not-yet-listening app, so every request fails with "status_code got=0" — the dominant remaining go-docker-timefreeze / go-dedup-docker readiness flake (distinct from the container-name race already fixed). When no --health-url is set, after the --delay floor, derive the published host port from the docker command (dockerPublishedHostPort) and additionally wait, bounded by HealthPollTimeout (default 60s), for it to accept a TCP connection via pkg.WaitForPort. Properties: - only ever waits LONGER than --delay, never shorter; instant for a ready port; - falls back to the pure --delay behavior for native apps / container-only or unparseable publishes (so nothing regresses); - proceeds anyway (returns true) if the ceiling elapses — never blocks forever, never weakens an assertion; ctx cancel still returns false promptly. Covers docker-run -p apps; compose apps (ports in the compose file, not the -c command) are a follow-up that needs the resolved app ports plumbed in. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): free the compose agent name before a compose up The atg-with-mocks regression flaked with "Container keploy-v3-XXXX Recreate -> Error response from daemon: No such container: <id>" aborting the run with "no linked test suites found". The 5135 log shows the cause precisely: a fresh `docker compose up` starts while the PREVIOUS compose is still "Stopping Gracefully" (removing the agent container). keploy injects the agent as a compose service AND force-removes it by name on teardown, so the new up reads the stale project state and tries to Recreate the being-removed agent against its old container id -> "No such container". It is the same teardown->up race the docker-run pre-run guard already closes, just on the compose agent. Before a `docker compose up`, poll the agent name (a.keployContainer) free (force-remove + wait, preRunRemoveBudget) so teardown and the next up are serialized and the up always creates a fresh agent. No-op on the first up; mirrors ensureContainerNameFreeWithin on the docker-run path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): scale pre-run container force-remove budget with contention ensureContainerNameFreeWithin exists to wait out a still-running prior container under docker-daemon contention, but it passed each `docker rm -f` the tight 2s teardown-drain forceRemoveBudget — contradicting its own comment that the startup pre-run cleanup waits out contention. Under heavy contention a single rm -f can run past 2s; SIGKILLing it there (CommandContext) tears the docker CLI down mid-request so the removal may not complete, the name never frees, and the whole budget burns on repeated stillborn 2s attempts. The result is "container name still in use after the pre-run remove budget" → the next `docker run --name` hits a Conflict (exit 125) → the app never starts → every test in the set replays got=0 (observed on go-docker-timefreeze). Size each force-remove deadline to ~1/3 of the overall budget (floored at forceRemoveBudget) so the rm can finish under contention while still leaving room for retries against the async --rm reaper. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): re-send a test request refused before the app accepts it A replay test request that fails with a pre-response "connection refused" (errors.Is ECONNREFUSED) proves the app never received it — zero bytes sent, zero single-use mocks consumed. Under CI contention an app container can briefly refuse while still coming up, and keploy recorded that transport failure as a test failure (synthetic status_code=0), a FALSE regression. Re-send it (bounded to 3, ctx-aware backoff, body rewound via GetBody) so the suite asserts the REAL response instead. A mid-response reset (ECONNRESET)/EOF/broken pipe is NOT retried — those can occur after the app consumed mocks, where a retry would re-run non-idempotent logic against exhausted mocks and fabricate a verdict; a genuinely unreachable app still fails fast after the bounded retries. This is transport robustness, not retrying an assertion (the comparison still runs once on the real response). Covers SimulateHTTP + SimulateHTTPStreaming. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): attribute connection-level test failures as APP_CONNECTION_ERROR When the app produces NO response, CreateFailedTestResult records a synthetic status_code=0 which the matcher reports as STATUS_CODE_CHANGED — making a transport/availability failure (connection refused/reset/EOF/host unreachable) masquerade as a content regression. Classify such errors and add a distinct FailureInfo.Category APP_CONNECTION_ERROR (read by k8s-proxy via TestResult.FailureInfo) so operators triage it as infra, not a regression. The raw StatusCode stays 0 — report data is unchanged, only the attribution is corrected. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * test(replay): function-level SimulateHTTP connection-refused integration tests Guard the retry where it is WIRED into the real replay send path (not just the helper in isolation): SimulateHTTP against a sleep-before-bind listener must recover a brief refuse via the bounded retry, and a refuse longer than the ~900ms budget must fail (bounded, no fabrication). No CI credential needed, so it runs in keploy's own test suite as a permanent regression guard. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(replay): safely re-send a test request on docker-proxy connection reset Under CI load docker's userland proxy (docker-proxy) resets freshly-accepted host-port connections during setup, so keploy replay (which dials the published host port) sees "read: connection reset by peer" and synthesizes status_code=0 even for a request that consumed no mock. This is a general docker-mode replay flake; the time-freeze lanes amplify it by shifting request timing into docker-proxy's contended window (reproduced 458/8000, and with zero freeze churn). Re-send the request, but ONLY when provably idempotent: a reset is ambiguous about whether the app already consumed a single-use mock. GetConsumedMocks is per-call and DRAINING (mockmanager.go), so a non-empty result means THIS request consumed a mock -> refuse the re-send and let the original error stand (never re-run against an exhausted mock; no fabricated verdict). Bounded to 2, ctx-aware, with a host-port readiness re-poll. On the refusal path the gate's drained mocks are folded back into totalConsumedMocks so the next mock-filter still drops those exhausted mocks (the drain only affects keploy-side reporting, not the agent's serving state). Adds IsTransportConnReset and tests driving a real MockManager to pin the per-call draining semantics and the safety gate. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): strictly gate docker-run name-conflict retry on real conflicts The user-app docker-run retry (added for the multi-test-set recreate "container name already in use" / exit 125 flake) blanket-retried on any "exit status 125", so a genuine 125 (bad image/mount/flag) was retried up to 3x — burning the pre-run remove budget and burying the real error. Docker's "Conflict. The container name ... is already in use" text is never captured (the non-PTY docker-run path streams stderr straight to os.Stderr; the returned error is only "exit status 125"). Confirm the conflict POSITIVELY from docker's own state: retry only when exit-125 AND the --name is still occupied (containerNameFree). A genuine 125 with a free name now fails fast. Adds isExit125 / isDockerRunNameConflict + table tests and a retry-path driver test. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(app): remove stale keploy-agent compose container before re-up The keploy agent is injected as a fixed compose service `keploy-agent` but with a per-process-random container_name (keploy-v3-<hash>); docker-compose tracks a managed container by (project, service) labels, not container_name. In a sandbox that runs record then auto-replay in the SAME process/project (atg-with-mocks), each phase generates a fresh keploy-v3-<hash>; record-stop force-removes the agent out-of-band (async), then replay's `docker compose up` sees the prior keploy-agent service container with a drifted container_name and plans a Recreate — whose remove-step races the concurrent out-of-band removal and loses ("No such container: <id>" / "removal already in progress") -> `compose up` exits 1 -> replay fails. Flaky: passes when the reap finishes within the fixed pre-replay sleep, fails under load. The prior name-based guard was a no-op here (it checked the NEW phase's agent name, not the prior phase's). Fix: before the compose up, resolve the prior agent by the compose SERVICE (`docker compose ... ps -aq keploy-agent`, scoping the project exactly as the upcoming up) and force-remove + reap it, so the up always CREATES the agent (no Recreate, no stale-id race). No-op on first up, non-compose paths, and --keep-app-alive single-up reuse. Adds parseComposePSIDs + unit tests. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy): don't drop a buffered/staged mock chunk on connection teardown A once-per-boot startup mock (an outbound call at app boot, before any inbound request) was intermittently dropped from a recorded test set. Root cause is the connection relay teardown losing the mock's bytes — not the syncMock dedup (the prior 4a51765 covered a different facet): - relay/tee.go drain(): on close, `select { case out<-c: case <-shutdown: drop }` could take the shutdown branch and DROP a staged chunk even when `out` had space. Deliver non-blocking FIRST; only fall back to the blocking shutdown-escape send if `out` is genuinely full — so a staged chunk is never dropped while the consumer can still take it (deadlock-safety preserved). - fakeconn.go Read/ReadChunk: on f.closed returned ErrClosed WITHOUT draining an already-buffered chunk. Drain it (drainBufferedLocked) before reporting ErrClosed, so a chunk that arrived before close is still delivered. Both are teardown-tail behavior; steady-state and dedup unchanged. A response that raced an abort/cancel teardown is now emitted as a complete mock rather than dropped. Adds TestTee_StagedChunkSurvivesClose + TestRead{,Chunk}AfterCloseDrainsBuffered (red before / green after). Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
feat(shutdown): drain in-flight streams in-app on SIGTERM (#4306) * feat(shutdown): drain in-flight streams in-app on SIGTERM Add an in-process graceful-shutdown drain to the root signal handler (utils.NewCtx). When KEPLOY_SIDECAR_DRAIN_SECONDS is set to a positive integer, the handler keeps the proxy and its parser goroutines live for that window after receiving SIGTERM -- before cancelling the root context -- so in-flight MITM'd streams finish cleanly, then tears down. A second signal cuts the wait short. This is the in-app replacement for the native Kubernetes `sleep` lifecycle (SleepAction) preStop hook the k8s-proxy webhook injected on the keploy-agent sidecar. SleepAction is GA only in k8s 1.30 and is rejected outright by some older apiservers, which fails pod admission for every instrumented workload. Doing the wait in-process is portable to any k8s version. The env var is set by the k8s-proxy webhook in its new app-managed-drain mode; unset (the default, and every non-sidecar invocation) preserves the historical cancel-immediately-on-signal behaviour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Aditya Sharma <aditya282003@gmail.com> * fix(shutdown): respect the sidecar drain window across repeat signals Previously a second SIGTERM/SIGINT during the KEPLOY_SIDECAR_DRAIN_SECONDS drain cut it short. The kubelet, however, sends exactly one SIGTERM and never a second one -- it escalates to SIGKILL at terminationGracePeriodSeconds, which is uncatchable and is the real hard stop. So aborting on a "second signal" only protected against a non-existent kubelet behaviour while making a repeat `kubectl delete` (the usual source of an extra SIGTERM) silently truncate an in-flight drain. Now additional SIGTERM/SIGINT signals received during the window are logged but do NOT abort it; the drain always runs to completion (bounded by the pod's terminationGracePeriodSeconds via SIGKILL). An operator who wants an immediate kill uses `kubectl delete --grace-period=0 --force`. Verified end-to-end in a kind cluster (k8s 1.35) with the agent image built from this change: on `kubectl delete pod`, the keploy-agent sidecar drained for exactly 15s before shutdown; three extra SIGTERMs delivered at +3s/+7s/ +11s were each logged "ignoring to honour the 15s drain window" and the drain still completed the full 15s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Aditya Sharma <aditya282003@gmail.com> --------- Signed-off-by: Aditya Sharma <aditya282003@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(app): verify container name is free before docker-run re-run (#4304) In docker-mode the per-test-set app lifecycle force-removes the prior container by name, then runs `docker run --name <name>` for the next test set. The pre-run cleanup was fire-and-forget: `docker rm -f` returns as soon as it has *initiated* removal, but under docker-daemon contention the `--rm` async reaper (from the prior test set's container exiting) can keep holding the name for a brief window after that — long enough for the immediately following `docker run --name` to hit "Conflict. The container name ... is already in use" (docker exit 125), which fails the test set (observed intermittently on go-docker-timefreeze's test-set-3 under CI docker contention). Replace the pre-run force-remove with ensureContainerNameFreeWithin, which removes then polls `docker ps` until the name is actually free (retrying the remove within the existing preRunRemoveBudget) before returning. Closes the reaper race deterministically instead of racing into the conflict. Teardown-path force-removes are unchanged. Reproduced locally under CPU saturation: the bare rm-f-then-run flow conflicts 50/50, the verify-name-free flow 0/50; go-docker-timefreeze records+replays all four test-sets with zero conflicts on the fixed binary. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
feat(agent): make capture-session state reentrant for multi-app (#4283) * feat(agent): make capture-session state reentrant for multi-app Make the userspace capture pipeline able to run multiple independent recording sessions in one process, so an enterprise multi-app agent can serve several apps concurrently without cross-contamination. This is a generic reentrancy change only — no app/tenant/session concept enters OSS, and single-session behaviour is preserved byte-for-byte through explicit fallbacks. - syncMock: add New() and NewDedupQueue() so a caller can own an independent manager + dedup queue; Get()/GetDedupQueue() remain the process-global default. Add a context carrier (NewContext/FromContext/FromContextOrGlobal) and resolve the manager at the HTTP/generic/MySQL emit sites from the context, falling back to the global when none is set. - syncMock: add a per-instance test-id counter (NextTestID) and use it from conn.Capture; deprecate the package-global GlobalTestCounter. - proxy: add SetSessionResolver/GetSessionFor(tgid) so an external composer can route a connection to its owning session; a nil resolver (the OSS default) returns the single session. - proxy: delete the SrcPortToDstURL entry on connection close (after the parser errgroup joins) to bound the map and stop a recycled source port from reading a previous connection's stale TLS destination. - mockdb: add a per-instance MockFormat so spec.mockFormat is honoured per session instead of via the process-global default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * feat(agent): expose session resolver on the Proxy interface Add GetSessionFor and SetSessionResolver to the agent.Proxy interface so external composers can install per-TGID session routing through the interface, not just the concrete type. The single-session default is unchanged (nil resolver returns the one session). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * feat(syncMock): per-session dedup queue + static-deduper context seam Extend the reentrancy seam so a per-session manager also carries its own dedup queue and an optional static-deduplication hook, keeping multi-app callers' dedup state fully isolated. Single-session behaviour is unchanged: the package-global manager falls back to the global dedup queue, and the static-deduper context value is simply unset. - SyncMockManager now owns a dedupQueue (New() allocates one); DedupQueue() returns it, or the package-global globalDedupQueue for the default instance, so callers can switch from GetDedupQueue() to mgr.DedupQueue() for per-session dedup ordering. - Add the StaticDeduper interface plus WithStaticDeduper / StaticDeduperFromContext, a context carrier defined here so a per-app static deduper can ride the parser context without the proxy and the consuming hook importing each other. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * test(syncMock): cover per-manager dedup queue + static-deduper context Add unit tests for the per-session seam additions: New() managers own an independent dedup queue while the global instance falls back to the global queue, and the StaticDeduper context carrier round-trips (nil when unset). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * feat(supervisor): per-session manager for the V2 EmitMock path EmitMock (the V2 mock-emit path used by all supervisor-based parsers) routed unconditionally through the package-global syncMock.Get(), so a multi-app caller could not isolate V2-parser mocks per app the way the legacy mgr.AddMock parsers now can. - supervisor.Session gains an optional Mgr field; EmitMock prefers it and falls back to the package-global when nil (single-session default — no behaviour change). - recordViaSupervisor sets Mgr from the parser context (syncMock.FromContext(ctx)), so when a multi-app caller carries a per-app manager on ctx, every V2 parser emits into that app's manager via the one EmitMock chokepoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * feat(memoryguard): fan memory-pressure out to registered managers The pause decision stays global (pod-level cgroup memory), but the buffered mocks that consume that memory can live in many sync-mock managers: the multi-app agent runs one manager per app and the package-global Get() manager is then unused, so SetMemoryPressure on it relieves nothing. Add RegisterPressureHook so a composer can register a fan-out over its live managers; applyPausedState invokes the global manager and every registered hook. Single-app behaviour is unchanged (no hook registered → only the global manager, as before). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * feat(conn): add transient TestCase.SourcePod carrier for per-pod attribution Reentrancy seam for the enterprise DaemonSet agent's per-pod test-case attribution. TestCase gains a SourcePod field tagged json/yaml/bson "-", so it is never persisted to stored test-case files or serialized into the upload body — it is purely an in-memory routing tag. Capture stamps it from the context (WithSourcePod / SourcePodFromContext) for both the HTTP and gRPC capture paths. OSS single-app callers never set the context value, so SourcePod stays empty and behaviour is unchanged. The enterprise reader sets it per connection from the owning pod so the uploader can carry a per-pod source to the control plane. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * fix(daemonset): skip sys_enter_socket PID auto-registration in DS mode The sys_enter_socket tracepoint auto-registers any process calling socket() that passes the in-eBPF namespace check into the shared target_namespace_pids map. On a hostPID+hostNetwork DaemonSet agent that check matches host processes (init/PID 1, node daemons, short-lived forks), polluting the map so the proxyless capture records non-target host traffic — the test-set-0 leak. In DaemonSet mode the CRD-scoped SessionReconciler is the sole owner of target_namespace_pids and arms exactly the recorded pods' TGIDs, so the tracepoint's auto-detection is redundant. Gate it off when KEPLOY_DAEMONSET_ENABLED=true. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> * style(syncMock): gofmt context_test.go to fix lint gate Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy): guard SrcPortToDstURL delete-on-close against source-port recycle handleConnection registers `defer SrcPortToDstURL.Delete(sourcePort)`, but that defer runs (LIFO) AFTER the defer that calls srcConn.Close(), which releases the client source port at the OS level. The kernel can then recycle that port to a new connection that Stores its own dst mapping before the older connection's Delete fires — so the older connection clobbers the new one's fresh entry, and the new connection's no-SNI ClientHello fails to recover its destination. Track per-port ownership (port -> connection token) and delete only via CompareAndDelete on the token: whichever connection currently owns the port is the only one allowed to delete its mapping. Correct under any interleaving and touches no readers or the stored value type. Adds a regression test that fails under the old unconditional Delete. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(memoryguard): give RegisterPressureHook a deregistration path pressureHooks was an append-only slice with no way to remove a hook, so a multi-app composer that registers one hook per app/session would leak: every closed-over SyncMockManager (and its buffers) stays pinned for the life of the process and is re-invoked on every pressure transition. The 'register once' contract was convention-only with no enforcement mechanism. RegisterPressureHook now returns an idempotent unregister func (hooks keyed by token in a map). Callers that ignore the return value still compile, so this is backward compatible. Adds a test covering register/fire/unregister/no-leak and the nil-hook no-op. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(syncMock): warn on unwired outChan; document per-app dedup seam as consumer-driven Two seam-honesty fixes on the multi-app reentrancy surface: - New()'s doc claimed it gives the manager 'its own output channel', but it wires none — a New() manager whose owner forgets SetOutputChannel buffers every mock and silently emits nothing. Correct the doc and emit a one-time warning the first time a mock is buffered while outChan was never wired. - The per-app dedup carrier (mgr.DedupQueue()) and the WithStaticDeduper context seam have zero OSS consumers; m.dedupQueue is read only by its own getter. The comments overstated this ('per-session dedup for free'). Reword to state the isolation is opt-in and only materializes when the multi-app consumer threads DedupQueue() into ResolveJob. Add a test pinning the getter contract (private per New() instance; global fallback for the package manager and nil receiver) so a refactor that drops the per-instance queue fails CI. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: Ayush Sharma <kshitij3160@gmail.com> Signed-off-by: slayerjain <shubhamkjain@outlook.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Shubham Jain <shubhamkjain@outlook.com>
fix(app): make docker-run pre-run container cleanup robust under cont… …ention (#4303) The #4301 pre-run force-remove (free a leftover --name container before a docker-run re-run) still let "container name already in use" (docker exit 125) through on multi-run lanes (e.g. dedup, which re-runs the app per test-set), because: 1. it force-removed only a.container, which is empty when --container-name isn't passed on the CLI, so it silently no-op'd; and 2. it reused the 2s teardown forceRemoveBudget — too short to remove a STILL- RUNNING prior container under heavy docker-daemon contention, so the remove timed out and the name stayed taken. Resolve the name from --container-name (a.container) OR, when unset, the --name in the command itself (new utils.ContainerNameFromDockerRun), and force-remove with a generous preRunRemoveBudget (30s). This is a startup-time cleanup, NOT the SIGINT teardown drain path, so it can afford to wait for the prior container to actually go away. Verified against the real collision: a running --name container blocks the re-run; a `docker rm -f` first lets it succeed. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
fix(app): force-remove the --name container before a docker-run re-run ( #4301) EnsureRmBeforeName adds --rm so the container auto-removes on exit, but under heavy docker-daemon contention that reap can lag past the NEXT `docker run --name` on a re-run (e.g. a dedup/timefreeze lane's record -> replay), which then fails with "Conflict. The container name ... is already in use" (docker exit 125) even though the previous run finished. #4297 added this guard for compose (ComposeDown's force-remove-by-name); docker-run mode lacked it. Force-remove the named container first (bounded + best-effort via forceRemoveContainerByName — a no-op when nothing is lingering) so a re-run never collides with a still-reaping container. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
fix(agent): align agent-ready wait with the agent's healthcheck budget ( #4299) Under heavy CI docker-daemon contention the in-docker keploy-agent container can take ~2 minutes just to START — observed in CI as a `docker run` of an ALREADY-LOCAL image (no pull) taking 126s before the agent process ran. keploy's agent-ready waits were 60s (native/docker-run setup) and 120s (record, replay, compose), all SHORTER than the agent container's own healthcheck budget (start_period 10s + interval 5s x retries 60 ~= 310s). So keploy gave up while the agent's own healthcheck still considered it starting and tore the bring-up down, surfacing as a spurious "keploy-agent did not become ready in time" that failed otherwise-green lanes (go-dedup-docker, umami, python-postgres-tls-containerized). Add pkg.AgentReadyTimeout() — default 330s (>= the agent healthcheck budget), overridable via KEPLOY_AGENT_READY_TIMEOUT (whole seconds) — and use it at all four readiness sites. This is not papering over a hang: the agent provably comes up (it logged its HTTP server ~126s in), it was just slower than the old wait, so aligning the CLI wait to the agent's declared startup contract fixes the inconsistency. Normal runs are unaffected — the health ticker returns the instant the agent is healthy. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
PreviousNext