Tags · keploy/keploy

v3.5.82

feat(grpc): implement static dedup for gRPC (#4298)

* feat(grpc): implement static dedup for gRPC

Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com>

* fix(grpc): address review comments — dedup window collision, header deep copy, parseGrpcStatus error

- Always increment GlobalTestCounter for duplicate gRPC streams so
  concurrent duplicates each get a unique mock-window name instead of
  all sharing "test-0" and colliding in the mock manager
- Deep-copy PseudoHeaders/OrdinaryHeaders via maps.Clone when promoting
  initial headers to trailers for error responses, preventing shared
  map mutation between Headers and Trailers
- Return -1 (not 0) from parseGrpcStatus on non-numeric trailer values
  so malformed grpc-status causes a deterministic mismatch rather than
  silently passing as OK

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(grpc): update stale CaptureGRPC doc comment after counter fix

The second bullet incorrectly said the counter is only incremented for
non-duplicates. After the collision fix it is always incremented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* error fix

Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com>

---------

Signed-off-by: Shivanipandey31 <shivanipande486@gmail.com>
Signed-off-by: SHIVANI PANDEY <shivanipande486@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Jun 29, 2026
613358a
zip
tar.gz
Notes
Downloads

v3.5.81

fix(app): work-slow-not-fail for pre-run container-name removal under…

… daemon saturation (#4309)

* fix(app): work-slow-not-fail for the pre-run container-name removal

Under a saturated docker daemon (heavy CI oversubscription) the `docker rm -f` of
a leftover --name container can take well past the old 30s pre-run budget + 3
retries (~120s total), so the next `docker run --name` hits "Conflict. The
container name is already in use" and the run fails — the go-docker-timefreeze
flake. This removal is NOT in the SIGINT drain path, so it can afford to wait.

Make it work-slow-not-fail: preRunRemoveBudget 30s->90s and
dockerRunNameConflictRetries 3->5 (~540s of removal attempts, perAttempt=budget/3
=30s so each rm can finish), so a saturated daemon yields a slower removal, never
a failed run. Pure budget change; teardown-drain budgets untouched.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): work-slow reset-resend — more attempts + backoff for the docker-proxy reset burst

A transport-reset test request (docker's userland-proxy resetting freshly-accepted
host-port conns in bursts under load) is re-sent on the mock-gated reset path. The
budget was 2 back-to-back attempts — enough for a transient reset, but under heavy
CI contention the reset burst outlasts both and the test fails got=0
(go-docker-timefreeze's reset mode, distinct from the container-name-conflict mode).

Make it work-slow: 2->6 attempts spread by a growing backoff (attempt*300ms) so the
bounded re-sends ride out the reset burst rather than hammering it. Each attempt
stays gated on no-mock-consumed + port readiness, and a non-reset failure still
stops immediately, so the extra attempts only add latency on the reset path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(yaml): atomic non-append WriteFileF — temp+rename so readers never see a partial file

WriteFileF's non-append branch opened the destination with O_TRUNC and streamed
the document, leaving a window where a concurrent or volume-lagging reader sees a
zero-length or half-written file. This is the single write primitive behind every
test-set report, config, and test->mock mapping (reportdb Insert/UpdateReport,
testdb config, mapping writes), so on overlay/NFS/container-mounted volumes under
load a reader (report-upload, localTestsPassed, reportdb.GetReport) could read it
mid-sync and either hard-fail to decode or silently decode a PARTIAL report with
test entries missing — a customer-facing reliability + correctness hazard.

Write to a temp file in the same dir, Sync, preserve the target mode, then
atomically rename over the destination (POSIX rename replaces atomically; Windows
remove-then-rename). Mirrors the proven mockdb.writeMocksAtomically / testdb.upsert
pattern. The append branch (NDJSON/multi-doc) is unchanged. Adds a concurrency
regression test (reader never observes a partial file across 80 rewrites).

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(grpc): retry connection-refused on the SimulateGRPC dial (slow-starting app)

SimulateGRPC dialed the app once with net.Dial and returned "failed to dial:
connection refused" on the first failure. A slow-starting gRPC app — a docker app
under CI load especially — can briefly refuse connections while still coming up,
exactly the scenario the HTTP replay path already handles via
doRequestWithConnRefusedRetry. gRPC had no such tolerance, so its first tests
deterministically false-fail with got=0 — the likely grpc-mongo NextToken
DeadlineExceeded class.

Extract dialTCPWithConnRefusedRetry: re-dial on a pure connection-refused up to
maxConnRefusedRetries with the shared growing backoff (a refused dial sent zero
bytes / consumed zero mocks, so the re-dial is idempotent); any other error,
attempt-exhaustion, or a cancelled context returns immediately. Reuses the
existing isPreResponseConnRefused / maxConnRefusedRetries / connRefusedRetryBackoff.
Adds tests: retries-then-connects, bounded-on-persistent-refusal, respects-context.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): apply reset-resend on the streaming test path (docker-proxy reset)

The non-streaming replay loop tolerates a docker userland-proxy reset on a
freshly-accepted host-port conn (replay.go:1752 — the request never reached the
app, consumed zero mocks, so retryResetOnce re-sends instead of synthesizing a
false got=0). The streaming Phase 2 path (SSE/NDJSON/chunked/multipart) had no
such tolerance: a reset under CI load went straight to failure++ and a synthesized
got=0 — a false STATUS_CODE_CHANGED/AppConnectionError on a request that was never
processed.

Wire the identical mock-gated retryResetOnce into the streaming simErr branch
before the failure synthesis: on a transport reset, re-send (retryResetOnce
returns a fresh streaming response for a streaming tc); on the unsafe refusal
path, fold its drained mocks into totalConsumedMocks exactly as the non-streaming
loop does. retryResetOnce is bounded + gated on no-mock-consumed (covered by
reset_resend_test.go), so no new failure modes.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): bounded retry for transient compose-dependency startup failures

When keploy records a docker-compose app and `compose up` fails because a
dependency container (a DB/emulator under `depends_on: condition: service_healthy`)
crashed during startup, keploy aborted the whole run ("user application terminated
unexpectedly") — even though the app's own container never started and a retry
brings the flaky dependency up cleanly. This is the orderflow/localstack
atg-with-mocks CI flake; for a customer it means a flaky compose dependency fails
their recording. Per work-slow-not-fail, tolerate it.

Add a bounded (3), linearly-backed-off retry of the compose bring-up, gated
strictly on the transient signature read from docker's own post-mortem state
(`docker compose ps -a --format json`): the app service in state `created`
(compose never started it because a service_healthy gate failed) AND some other
non-agent service `exited` non-zero. A genuinely-failing app runs and exits — its
service is `exited`, never `created` — so it can never match and still fails fast
with its real exit code (no masking; the gate is firewalled and unit-tested). Each
attempt does a clean ComposeDown + re-runs the pre-up guards; the loop, backoff,
and pre-up guards all short-circuit on ctx cancellation so the teardown budget is
respected. The `docker compose ps` state probe is evaluated lazily, only when a
retry is plausible, so it never shells out on the success/cancel path.

Also fixes a latent bug: the file-based compose down/ps/agent-id branches now
carry the `-p` project flags so a user-set compose project name resolves correctly.

Covered by unit tests (the firewall incl. genuine-fail-not-masked, the loop gate
incl. probe-not-called-on-success, the ps parser) and dockerlive-tagged live tests
(RED abort, GREEN flaky-dep recovers, NO-MASKING always-fail surfaces).

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci: drop codfish/semantic-release-action dry-run from the PR build

The keploy org blocks third-party (non-GitHub-verified) Actions, so
codfish/semantic-release-action@v3 fails the build job at "Getting action download
info" with "Repository access blocked", cascading to CI Gate and the
record_build_replay_build cells. It was only a dry-run version preview (last step,
nothing consumes its output; the real release runs elsewhere), so removing it
unblocks the PR build with no loss of coverage.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(ci/proxy-stress): stop keploy on app-start bail + timeout-ceiling everything

The proxy-stress-test job hung ~6h until the job timeout. RCA from the CI log
(only 3 SIGINTs delivered for 4 record iterations) plus a local repro:

send_request runs in the background and SIGINTs the foreground keploy record to
stop it. On a flaky slow app-start (the app's /health does a DB ping and only
starts after postgres is service_healthy; under contention the 40x3s=120s budget
was occasionally exceeded), send_request hit exit 1 WITHOUT calling
container_kill, so keploy recorded forever. The local repro confirmed keploy
terminates cleanly the instant it IS signaled - the hang is purely the missing
signal, not a recording-hot-path bug.

- call container_kill on the health-failure bail path (the direct fix)
- raise the startup budget 40 -> 80 attempts for contention-slow boot
- wrap record, both replays, and both compose-downs in timeout (hard ceiling)
  so nothing in this script can hang the job for hours

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 28, 2026
b0dea76
zip
tar.gz
Notes
Downloads

v3.5.80

feat: strict request-body value-match and unified on-disk noise block (…

…#4310)

* feat(schemanoise): value-match all candidates under strict mode

Under strict schema-noise enforcement (SchemaNoiseStrict), StrictReject
previously short-circuited to "allowed" for any mock with no learned
req_body_noise, so a value change on an unmarked field was silently
served. Gate that early-return on !e.strict: in strict mode every
candidate is now value-compared against its recorded request body, and
a drift on a field not covered by configured/learned noise rejects it.

The recorded-vs-live diff still excludes user/global body noise and
obfuscated values via KnownNoise, and the non-strict path is unchanged
(it still tightens only mocks that carry learned noise). Updates the
HTTP filterStrictNoiseMatches docs and the match() strict-block comment
to describe the new behaviour, and extends the engine tests with
no-learned-noise strict cases.

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

* refactor(mockdb): unify mock noise under a single on-disk noise block

The on-disk mock format split obfuscator value-regexes (top-level
`noise:` list) from request-body schema noise (`req_body_noise:` map).
Consolidate both under one `noise:` mapping:

  noise:
    req:    [ body.tier_type, body.user.id ]   # schema field paths
    value:  [ "^tok-.*$" ]                      # obfuscator value-regexes

noise.req is a plain path list — the per-path regex values are dropped
because the strict path only ever honoured whole-field "ignore", so
they were dead weight for request-body noise.

The change is confined to the serialization boundary: the in-memory
models (Mock.Noise []string, MockSpec.ReqBodyNoise map) are unchanged,
so the matcher/engine keep their existing vocabulary. A new DocNoise
type with custom YAML/JSON unmarshalers converts at encode/decode.

Backward compatible (read both, write new only): legacy `noise:` lists
and `req_body_noise:` maps still decode (the list folds into value, the
map keys fold into req); new writes emit only the unified block. The
gob format is unaffected (it gob-encodes models.Mock directly). Adds
DocNoise unit tests and a legacy-format decode test.

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

* test(schema-noise): assert noise.req in the mux-elasticsearch e2e

The schema-noise-detection e2e script grepped for the literal on-disk
key `req_body_noise`, which the unified noise block replaced with a
`noise.req` field-path list. Detection still works (the path is written
under noise.req), but the Phase B assertion could no longer find the old
key and failed.

Update the mock_has_*_noise helpers to match a `- body.<path>` list item
(unique to noise.req), refresh the display grep to print the noise block,
and align the phase messages/comments with the new key name. Phase C
(strict rejection) was already passing and is unchanged.

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

* fix: remove the codfish/semantic-release-action due to access blocked by github (attacker force-pushed malicious commit)

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

---------

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

Jun 27, 2026
9b95c0b
zip
tar.gz
Notes
Downloads

v3.5.79

fix(ci-flake): tab-safe PostgresV3 SQLNormalized YAML (CI flake robus…

…tness, batch 1) (#4305)

* fix(models): tab-safe YAML for PostgresV3 SQLNormalized

PostgresV3QuerySpec.SQLNormalized is a plain string with no write-side
YAML styling. After the integrations restoreRawSQL pass it can carry
embedded tabs/newlines from the original statement; yaml.v3 v3.0.1 then
emits such values as a literal block scalar (`|N-`), and because
yaml.Node.Encode marshals-then-reparses its own output, that block
scalar fails to round-trip with "found a tab character where an
indentation space is expected". The failure is inside EncodeMock's
Spec.Encode, before the post-encode sanitizeYAMLStringNodes walk runs,
so the node sanitizer can't rescue it — the whole PostgresV3 mock fails
to marshal and is dropped.

Add a PostgresV3QuerySpec.MarshalYAML that routes SQLNormalized through
PostgresV3SafeString (DoubleQuotedStyle when StringNeedsDoubleQuoted),
mirroring the existing PostgresV3Notice/PostgresV3Error convention. The
model field stays `string`, so JSON/BSON and all integrations
assignment/read sites compile unchanged; only the YAML write side gains
the double-quoting. The alias spells out every field explicitly because
yaml.v3 v3.0.1 panics on the duplicated-key embed-and-shadow idiom.

Regression tests: TestYAMLRoundTrip_PostgresV3_* cover leading-tab and
tab/newline-bearing SQLNormalized round-tripping through EncodeMock.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): retry user-app docker-run on transient container-name conflict

The v3.5.77 pre-run ensureContainerNameFreeWithin frees the user-app
container name before `docker run --name X`, but on a saturated CI daemon
the prior test-set's `--rm` reaper can still re-touch the name in the
narrow window between the guard's "name is free" check and the run, so
the run loses the race and docker exits 125 ("Conflict. The container
name ... is already in use") without starting the app — failing the whole
test-set (observed: go-docker-timefreeze test-set-0->1 in pipeline 5120,
post-v3.5.77, with no pre-run-budget warning, i.e. the guard ran and
returned cleanly but the name was retaken before the run).

Close the residual window at the run itself: when a user-app docker-run
exits 125, force-remove the name and re-issue the run, bounded by
dockerRunNameConflictRetries so a genuine 125 (bad image, missing mount,
…) still surfaces after a few attempts. Belt-and-suspenders with the
pre-run guard; only the docker-run (non-compose) path is affected.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(runner): use AgentReadyTimeout for the runner agent-readiness gate

The runner's pre-instrumentation agent-readiness gate hardcoded a 120s
ceiling, while the record/replay/agent gates all use the contention-tuned
pkg.AgentReadyTimeout (default 330s, overridable via
KEPLOY_AGENT_READY_TIMEOUT). On a saturated CI daemon the keploy agent
container can take well over a minute just to start, so the 120s gate
here fired first and failed an otherwise-healthy sandbox/runner bring-up
with "keploy-agent did not become ready within 120s" — even though the
other gates would have waited it out. Align this gate to the same budget.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(record): don't close appErrChan while the app-runner can still send

On shutdown the app-runner goroutine (runAppErrGrp, both the compose and
non-compose branches) sends the app's exit error on appErrChan. When the
app receives SIGTERM it exits with "signal: terminated" — not
ErrCtxCanceled — so the goroutine reaches `appErrChan <- runAppError`.
Start's `defer close(appErrChan)` raced that send and crashed the whole
run with "panic: send on closed channel" (observed in pulsar-basetopic
teardown, record.go:518).

appErrChan is owned by Start but written by a goroutine that can outlive
Start's return, so Start must not close it. The sole consumer is a single
receive in the select below (no close dependency) and the size-1 buffer
absorbs the lone send (the two sender branches are mutually exclusive), so
leaving the channel open to be GC'd is correct and race-free.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): gate docker-run replay on host port readiness, not just --delay

waitForAppReady only honored --health-url (poll for 2xx) or a fixed --delay
sleep. For docker apps published with -p, a fixed --delay that is too short
under CI contention fires replay at a not-yet-listening app, so every request
fails with "status_code got=0" — the dominant remaining go-docker-timefreeze /
go-dedup-docker readiness flake (distinct from the container-name race already
fixed).

When no --health-url is set, after the --delay floor, derive the published host
port from the docker command (dockerPublishedHostPort) and additionally wait,
bounded by HealthPollTimeout (default 60s), for it to accept a TCP connection
via pkg.WaitForPort. Properties:
- only ever waits LONGER than --delay, never shorter; instant for a ready port;
- falls back to the pure --delay behavior for native apps / container-only or
  unparseable publishes (so nothing regresses);
- proceeds anyway (returns true) if the ceiling elapses — never blocks forever,
  never weakens an assertion; ctx cancel still returns false promptly.

Covers docker-run -p apps; compose apps (ports in the compose file, not the
-c command) are a follow-up that needs the resolved app ports plumbed in.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): free the compose agent name before a compose up

The atg-with-mocks regression flaked with "Container keploy-v3-XXXX Recreate
-> Error response from daemon: No such container: <id>" aborting the run with
"no linked test suites found". The 5135 log shows the cause precisely: a fresh
`docker compose up` starts while the PREVIOUS compose is still "Stopping
Gracefully" (removing the agent container). keploy injects the agent as a
compose service AND force-removes it by name on teardown, so the new up reads
the stale project state and tries to Recreate the being-removed agent against
its old container id -> "No such container".

It is the same teardown->up race the docker-run pre-run guard already closes,
just on the compose agent. Before a `docker compose up`, poll the agent name
(a.keployContainer) free (force-remove + wait, preRunRemoveBudget) so teardown
and the next up are serialized and the up always creates a fresh agent. No-op
on the first up; mirrors ensureContainerNameFreeWithin on the docker-run path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): scale pre-run container force-remove budget with contention

ensureContainerNameFreeWithin exists to wait out a still-running prior
container under docker-daemon contention, but it passed each `docker rm -f`
the tight 2s teardown-drain forceRemoveBudget — contradicting its own comment
that the startup pre-run cleanup waits out contention. Under heavy contention a
single rm -f can run past 2s; SIGKILLing it there (CommandContext) tears the
docker CLI down mid-request so the removal may not complete, the name never
frees, and the whole budget burns on repeated stillborn 2s attempts. The
result is "container name still in use after the pre-run remove budget" → the
next `docker run --name` hits a Conflict (exit 125) → the app never starts →
every test in the set replays got=0 (observed on go-docker-timefreeze).

Size each force-remove deadline to ~1/3 of the overall budget (floored at
forceRemoveBudget) so the rm can finish under contention while still leaving
room for retries against the async --rm reaper.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): re-send a test request refused before the app accepts it

A replay test request that fails with a pre-response "connection refused"
(errors.Is ECONNREFUSED) proves the app never received it — zero bytes sent,
zero single-use mocks consumed. Under CI contention an app container can briefly
refuse while still coming up, and keploy recorded that transport failure as a
test failure (synthetic status_code=0), a FALSE regression. Re-send it (bounded
to 3, ctx-aware backoff, body rewound via GetBody) so the suite asserts the REAL
response instead. A mid-response reset (ECONNRESET)/EOF/broken pipe is NOT
retried — those can occur after the app consumed mocks, where a retry would
re-run non-idempotent logic against exhausted mocks and fabricate a verdict; a
genuinely unreachable app still fails fast after the bounded retries. This is
transport robustness, not retrying an assertion (the comparison still runs once
on the real response). Covers SimulateHTTP + SimulateHTTPStreaming.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): attribute connection-level test failures as APP_CONNECTION_ERROR

When the app produces NO response, CreateFailedTestResult records a synthetic
status_code=0 which the matcher reports as STATUS_CODE_CHANGED — making a
transport/availability failure (connection refused/reset/EOF/host unreachable)
masquerade as a content regression. Classify such errors and add a distinct
FailureInfo.Category APP_CONNECTION_ERROR (read by k8s-proxy via
TestResult.FailureInfo) so operators triage it as infra, not a regression. The
raw StatusCode stays 0 — report data is unchanged, only the attribution is
corrected.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* test(replay): function-level SimulateHTTP connection-refused integration tests

Guard the retry where it is WIRED into the real replay send path (not just the
helper in isolation): SimulateHTTP against a sleep-before-bind listener must
recover a brief refuse via the bounded retry, and a refuse longer than the
~900ms budget must fail (bounded, no fabrication). No CI credential needed, so
it runs in keploy's own test suite as a permanent regression guard.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(replay): safely re-send a test request on docker-proxy connection reset

Under CI load docker's userland proxy (docker-proxy) resets freshly-accepted
host-port connections during setup, so keploy replay (which dials the published
host port) sees "read: connection reset by peer" and synthesizes status_code=0
even for a request that consumed no mock. This is a general docker-mode replay
flake; the time-freeze lanes amplify it by shifting request timing into
docker-proxy's contended window (reproduced 458/8000, and with zero freeze churn).

Re-send the request, but ONLY when provably idempotent: a reset is ambiguous
about whether the app already consumed a single-use mock. GetConsumedMocks is
per-call and DRAINING (mockmanager.go), so a non-empty result means THIS request
consumed a mock -> refuse the re-send and let the original error stand (never
re-run against an exhausted mock; no fabricated verdict). Bounded to 2, ctx-aware,
with a host-port readiness re-poll. On the refusal path the gate's drained mocks
are folded back into totalConsumedMocks so the next mock-filter still drops those
exhausted mocks (the drain only affects keploy-side reporting, not the agent's
serving state). Adds IsTransportConnReset and tests driving a real MockManager to
pin the per-call draining semantics and the safety gate.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): strictly gate docker-run name-conflict retry on real conflicts

The user-app docker-run retry (added for the multi-test-set recreate "container
name already in use" / exit 125 flake) blanket-retried on any "exit status 125",
so a genuine 125 (bad image/mount/flag) was retried up to 3x — burning the
pre-run remove budget and burying the real error. Docker's "Conflict. The
container name ... is already in use" text is never captured (the non-PTY
docker-run path streams stderr straight to os.Stderr; the returned error is only
"exit status 125"). Confirm the conflict POSITIVELY from docker's own state:
retry only when exit-125 AND the --name is still occupied (containerNameFree).
A genuine 125 with a free name now fails fast. Adds isExit125 /
isDockerRunNameConflict + table tests and a retry-path driver test.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(app): remove stale keploy-agent compose container before re-up

The keploy agent is injected as a fixed compose service `keploy-agent` but with a
per-process-random container_name (keploy-v3-<hash>); docker-compose tracks a
managed container by (project, service) labels, not container_name. In a sandbox
that runs record then auto-replay in the SAME process/project (atg-with-mocks),
each phase generates a fresh keploy-v3-<hash>; record-stop force-removes the agent
out-of-band (async), then replay's `docker compose up` sees the prior keploy-agent
service container with a drifted container_name and plans a Recreate — whose
remove-step races the concurrent out-of-band removal and loses ("No such
container: <id>" / "removal already in progress") -> `compose up` exits 1 ->
replay fails. Flaky: passes when the reap finishes within the fixed pre-replay
sleep, fails under load. The prior name-based guard was a no-op here (it checked
the NEW phase's agent name, not the prior phase's).

Fix: before the compose up, resolve the prior agent by the compose SERVICE
(`docker compose ... ps -aq keploy-agent`, scoping the project exactly as the
upcoming up) and force-remove + reap it, so the up always CREATES the agent (no
Recreate, no stale-id race). No-op on first up, non-compose paths, and
--keep-app-alive single-up reuse. Adds parseComposePSIDs + unit tests.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy): don't drop a buffered/staged mock chunk on connection teardown

A once-per-boot startup mock (an outbound call at app boot, before any inbound
request) was intermittently dropped from a recorded test set. Root cause is the
connection relay teardown losing the mock's bytes — not the syncMock dedup (the
prior 4a51765 covered a different facet):

- relay/tee.go drain(): on close, `select { case out<-c: case <-shutdown: drop }`
  could take the shutdown branch and DROP a staged chunk even when `out` had
  space. Deliver non-blocking FIRST; only fall back to the blocking
  shutdown-escape send if `out` is genuinely full — so a staged chunk is never
  dropped while the consumer can still take it (deadlock-safety preserved).
- fakeconn.go Read/ReadChunk: on f.closed returned ErrClosed WITHOUT draining an
  already-buffered chunk. Drain it (drainBufferedLocked) before reporting
  ErrClosed, so a chunk that arrived before close is still delivered.

Both are teardown-tail behavior; steady-state and dedup unchanged. A response that
raced an abort/cancel teardown is now emitted as a complete mock rather than
dropped. Adds TestTee_StagedChunkSurvivesClose + TestRead{,Chunk}AfterCloseDrainsBuffered
(red before / green after).

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 25, 2026
8ccf57b
zip
tar.gz
Notes
Downloads

v3.5.78

feat(shutdown): drain in-flight streams in-app on SIGTERM (#4306)

* feat(shutdown): drain in-flight streams in-app on SIGTERM

Add an in-process graceful-shutdown drain to the root signal handler
(utils.NewCtx). When KEPLOY_SIDECAR_DRAIN_SECONDS is set to a positive
integer, the handler keeps the proxy and its parser goroutines live for
that window after receiving SIGTERM -- before cancelling the root context --
so in-flight MITM'd streams finish cleanly, then tears down. A second
signal cuts the wait short.

This is the in-app replacement for the native Kubernetes `sleep` lifecycle
(SleepAction) preStop hook the k8s-proxy webhook injected on the
keploy-agent sidecar. SleepAction is GA only in k8s 1.30 and is rejected
outright by some older apiservers, which fails pod admission for every
instrumented workload. Doing the wait in-process is portable to any k8s
version. The env var is set by the k8s-proxy webhook in its new
app-managed-drain mode; unset (the default, and every non-sidecar
invocation) preserves the historical cancel-immediately-on-signal behaviour.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

* fix(shutdown): respect the sidecar drain window across repeat signals

Previously a second SIGTERM/SIGINT during the KEPLOY_SIDECAR_DRAIN_SECONDS
drain cut it short. The kubelet, however, sends exactly one SIGTERM and never
a second one -- it escalates to SIGKILL at terminationGracePeriodSeconds,
which is uncatchable and is the real hard stop. So aborting on a "second
signal" only protected against a non-existent kubelet behaviour while making
a repeat `kubectl delete` (the usual source of an extra SIGTERM) silently
truncate an in-flight drain.

Now additional SIGTERM/SIGINT signals received during the window are logged
but do NOT abort it; the drain always runs to completion (bounded by the
pod's terminationGracePeriodSeconds via SIGKILL). An operator who wants an
immediate kill uses `kubectl delete --grace-period=0 --force`.

Verified end-to-end in a kind cluster (k8s 1.35) with the agent image built
from this change: on `kubectl delete pod`, the keploy-agent sidecar drained
for exactly 15s before shutdown; three extra SIGTERMs delivered at +3s/+7s/
+11s were each logged "ignoring to honour the 15s drain window" and the drain
still completed the full 15s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Aditya Sharma <aditya282003@gmail.com>

---------

Signed-off-by: Aditya Sharma <aditya282003@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Jun 23, 2026
91be11f
zip
tar.gz
Notes
Downloads

v3.5.77

fix(app): verify container name is free before docker-run re-run (#4304)

In docker-mode the per-test-set app lifecycle force-removes the prior
container by name, then runs `docker run --name <name>` for the next
test set. The pre-run cleanup was fire-and-forget: `docker rm -f` returns
as soon as it has *initiated* removal, but under docker-daemon contention
the `--rm` async reaper (from the prior test set's container exiting) can
keep holding the name for a brief window after that — long enough for the
immediately following `docker run --name` to hit "Conflict. The container
name ... is already in use" (docker exit 125), which fails the test set
(observed intermittently on go-docker-timefreeze's test-set-3 under CI
docker contention).

Replace the pre-run force-remove with ensureContainerNameFreeWithin,
which removes then polls `docker ps` until the name is actually free
(retrying the remove within the existing preRunRemoveBudget) before
returning. Closes the reaper race deterministically instead of racing
into the conflict. Teardown-path force-removes are unchanged.

Reproduced locally under CPU saturation: the bare rm-f-then-run flow
conflicts 50/50, the verify-name-free flow 0/50; go-docker-timefreeze
records+replays all four test-sets with zero conflicts on the fixed
binary.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 23, 2026
dcc862c
zip
tar.gz
Notes
Downloads

v3.5.76

feat(agent): make capture-session state reentrant for multi-app (#4283)

* feat(agent): make capture-session state reentrant for multi-app

Make the userspace capture pipeline able to run multiple independent
recording sessions in one process, so an enterprise multi-app agent can
serve several apps concurrently without cross-contamination. This is a
generic reentrancy change only — no app/tenant/session concept enters
OSS, and single-session behaviour is preserved byte-for-byte through
explicit fallbacks.

- syncMock: add New() and NewDedupQueue() so a caller can own an
  independent manager + dedup queue; Get()/GetDedupQueue() remain the
  process-global default. Add a context carrier
  (NewContext/FromContext/FromContextOrGlobal) and resolve the manager
  at the HTTP/generic/MySQL emit sites from the context, falling back to
  the global when none is set.
- syncMock: add a per-instance test-id counter (NextTestID) and use it
  from conn.Capture; deprecate the package-global GlobalTestCounter.
- proxy: add SetSessionResolver/GetSessionFor(tgid) so an external
  composer can route a connection to its owning session; a nil resolver
  (the OSS default) returns the single session.
- proxy: delete the SrcPortToDstURL entry on connection close (after the
  parser errgroup joins) to bound the map and stop a recycled source
  port from reading a previous connection's stale TLS destination.
- mockdb: add a per-instance MockFormat so spec.mockFormat is honoured
  per session instead of via the process-global default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* feat(agent): expose session resolver on the Proxy interface

Add GetSessionFor and SetSessionResolver to the agent.Proxy interface so
external composers can install per-TGID session routing through the
interface, not just the concrete type. The single-session default is
unchanged (nil resolver returns the one session).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* feat(syncMock): per-session dedup queue + static-deduper context seam

Extend the reentrancy seam so a per-session manager also carries its own
dedup queue and an optional static-deduplication hook, keeping multi-app
callers' dedup state fully isolated. Single-session behaviour is unchanged:
the package-global manager falls back to the global dedup queue, and the
static-deduper context value is simply unset.

- SyncMockManager now owns a dedupQueue (New() allocates one); DedupQueue()
  returns it, or the package-global globalDedupQueue for the default
  instance, so callers can switch from GetDedupQueue() to mgr.DedupQueue()
  for per-session dedup ordering.
- Add the StaticDeduper interface plus WithStaticDeduper /
  StaticDeduperFromContext, a context carrier defined here so a per-app
  static deduper can ride the parser context without the proxy and the
  consuming hook importing each other.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* test(syncMock): cover per-manager dedup queue + static-deduper context

Add unit tests for the per-session seam additions: New() managers own an
independent dedup queue while the global instance falls back to the global
queue, and the StaticDeduper context carrier round-trips (nil when unset).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* feat(supervisor): per-session manager for the V2 EmitMock path

EmitMock (the V2 mock-emit path used by all supervisor-based parsers) routed
unconditionally through the package-global syncMock.Get(), so a multi-app
caller could not isolate V2-parser mocks per app the way the legacy
mgr.AddMock parsers now can.

- supervisor.Session gains an optional Mgr field; EmitMock prefers it and
  falls back to the package-global when nil (single-session default — no
  behaviour change).
- recordViaSupervisor sets Mgr from the parser context
  (syncMock.FromContext(ctx)), so when a multi-app caller carries a per-app
  manager on ctx, every V2 parser emits into that app's manager via the one
  EmitMock chokepoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* feat(memoryguard): fan memory-pressure out to registered managers

The pause decision stays global (pod-level cgroup memory), but the buffered
mocks that consume that memory can live in many sync-mock managers: the
multi-app agent runs one manager per app and the package-global Get() manager
is then unused, so SetMemoryPressure on it relieves nothing.

Add RegisterPressureHook so a composer can register a fan-out over its live
managers; applyPausedState invokes the global manager and every registered
hook. Single-app behaviour is unchanged (no hook registered → only the global
manager, as before).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* feat(conn): add transient TestCase.SourcePod carrier for per-pod attribution

Reentrancy seam for the enterprise DaemonSet agent's per-pod test-case
attribution. TestCase gains a SourcePod field tagged json/yaml/bson "-", so it
is never persisted to stored test-case files or serialized into the upload
body — it is purely an in-memory routing tag.

Capture stamps it from the context (WithSourcePod / SourcePodFromContext) for
both the HTTP and gRPC capture paths. OSS single-app callers never set the
context value, so SourcePod stays empty and behaviour is unchanged. The
enterprise reader sets it per connection from the owning pod so the uploader
can carry a per-pod source to the control plane.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* fix(daemonset): skip sys_enter_socket PID auto-registration in DS mode

The sys_enter_socket tracepoint auto-registers any process calling socket()
that passes the in-eBPF namespace check into the shared target_namespace_pids
map. On a hostPID+hostNetwork DaemonSet agent that check matches host
processes (init/PID 1, node daemons, short-lived forks), polluting the map so
the proxyless capture records non-target host traffic — the test-set-0 leak.

In DaemonSet mode the CRD-scoped SessionReconciler is the sole owner of
target_namespace_pids and arms exactly the recorded pods' TGIDs, so the
tracepoint's auto-detection is redundant. Gate it off when
KEPLOY_DAEMONSET_ENABLED=true.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

* style(syncMock): gofmt context_test.go to fix lint gate

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy): guard SrcPortToDstURL delete-on-close against source-port recycle

handleConnection registers `defer SrcPortToDstURL.Delete(sourcePort)`, but that
defer runs (LIFO) AFTER the defer that calls srcConn.Close(), which releases the
client source port at the OS level. The kernel can then recycle that port to a
new connection that Stores its own dst mapping before the older connection's
Delete fires — so the older connection clobbers the new one's fresh entry, and
the new connection's no-SNI ClientHello fails to recover its destination.

Track per-port ownership (port -> connection token) and delete only via
CompareAndDelete on the token: whichever connection currently owns the port is
the only one allowed to delete its mapping. Correct under any interleaving and
touches no readers or the stored value type. Adds a regression test that fails
under the old unconditional Delete.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(memoryguard): give RegisterPressureHook a deregistration path

pressureHooks was an append-only slice with no way to remove a hook, so a
multi-app composer that registers one hook per app/session would leak: every
closed-over SyncMockManager (and its buffers) stays pinned for the life of the
process and is re-invoked on every pressure transition. The 'register once'
contract was convention-only with no enforcement mechanism.

RegisterPressureHook now returns an idempotent unregister func (hooks keyed by
token in a map). Callers that ignore the return value still compile, so this is
backward compatible. Adds a test covering register/fire/unregister/no-leak and
the nil-hook no-op.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(syncMock): warn on unwired outChan; document per-app dedup seam as consumer-driven

Two seam-honesty fixes on the multi-app reentrancy surface:

- New()'s doc claimed it gives the manager 'its own output channel', but it
  wires none — a New() manager whose owner forgets SetOutputChannel buffers
  every mock and silently emits nothing. Correct the doc and emit a one-time
  warning the first time a mock is buffered while outChan was never wired.

- The per-app dedup carrier (mgr.DedupQueue()) and the WithStaticDeduper context
  seam have zero OSS consumers; m.dedupQueue is read only by its own getter. The
  comments overstated this ('per-session dedup for free'). Reword to state the
  isolation is opt-in and only materializes when the multi-app consumer threads
  DedupQueue() into ResolveJob. Add a test pinning the getter contract (private
  per New() instance; global fallback for the package manager and nil receiver)
  so a refactor that drops the per-instance queue fails CI.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
Signed-off-by: slayerjain <shubhamkjain@outlook.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Shubham Jain <shubhamkjain@outlook.com>

Jun 22, 2026
115a188
zip
tar.gz
Notes
Downloads

v3.5.75

fix(app): make docker-run pre-run container cleanup robust under cont…

…ention (#4303)

The #4301 pre-run force-remove (free a leftover --name container before a
docker-run re-run) still let "container name already in use" (docker exit 125)
through on multi-run lanes (e.g. dedup, which re-runs the app per test-set),
because:

1. it force-removed only a.container, which is empty when --container-name isn't
   passed on the CLI, so it silently no-op'd; and
2. it reused the 2s teardown forceRemoveBudget — too short to remove a STILL-
   RUNNING prior container under heavy docker-daemon contention, so the remove
   timed out and the name stayed taken.

Resolve the name from --container-name (a.container) OR, when unset, the --name
in the command itself (new utils.ContainerNameFromDockerRun), and force-remove
with a generous preRunRemoveBudget (30s). This is a startup-time cleanup, NOT the
SIGINT teardown drain path, so it can afford to wait for the prior container to
actually go away. Verified against the real collision: a running --name container
blocks the re-run; a `docker rm -f` first lets it succeed.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 22, 2026
cf0852f
zip
tar.gz
Notes
Downloads

v3.5.74

fix(app): force-remove the --name container before a docker-run re-run (

#4301)

EnsureRmBeforeName adds --rm so the container auto-removes on exit, but under
heavy docker-daemon contention that reap can lag past the NEXT `docker run --name`
on a re-run (e.g. a dedup/timefreeze lane's record -> replay), which then fails
with "Conflict. The container name ... is already in use" (docker exit 125) even
though the previous run finished. #4297 added this guard for compose
(ComposeDown's force-remove-by-name); docker-run mode lacked it. Force-remove the
named container first (bounded + best-effort via forceRemoveContainerByName — a
no-op when nothing is lingering) so a re-run never collides with a still-reaping
container.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 22, 2026
6e74c3d
zip
tar.gz
Notes
Downloads

v3.5.73

fix(agent): align agent-ready wait with the agent's healthcheck budget (

#4299)

Under heavy CI docker-daemon contention the in-docker keploy-agent container can
take ~2 minutes just to START — observed in CI as a `docker run` of an
ALREADY-LOCAL image (no pull) taking 126s before the agent process ran. keploy's
agent-ready waits were 60s (native/docker-run setup) and 120s (record, replay,
compose), all SHORTER than the agent container's own healthcheck budget
(start_period 10s + interval 5s x retries 60 ~= 310s). So keploy gave up while
the agent's own healthcheck still considered it starting and tore the bring-up
down, surfacing as a spurious "keploy-agent did not become ready in time" that
failed otherwise-green lanes (go-dedup-docker, umami, python-postgres-tls-containerized).

Add pkg.AgentReadyTimeout() — default 330s (>= the agent healthcheck budget),
overridable via KEPLOY_AGENT_READY_TIMEOUT (whole seconds) — and use it at all
four readiness sites. This is not papering over a hang: the agent provably comes
up (it logged its HTTP server ~126s in), it was just slower than the old wait, so
aligning the CLI wait to the agent's declared startup contract fixes the
inconsistency. Normal runs are unaffected — the health ticker returns the instant
the agent is healthy.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

Jun 22, 2026
b2bfcc8
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v3.5.82

v3.5.81

v3.5.80

v3.5.79

v3.5.78

v3.5.77

v3.5.76

v3.5.75

v3.5.74

v3.5.73

Uh oh!

Uh oh!

Tags: keploy/keploy