Implement download command for training job#8029
Conversation
jongio
left a comment
There was a problem hiding this comment.
Solid start on the download command. A few things to address before this is merge-ready:
Path safety - downloadOne joins server-provided paths directly into the destination directory without validating for traversal (../). Defense-in-depth: sanitize before writing.
Retry policy - IsRetryable returns true for every non-nil error (line 41), making the url.Error check on line 38 dead code. The status-code checks below (lines 43-48) are unreachable when err != nil. Combined with callers that always pass status=0, permanent failures get retried pointlessly.
Test coverage - 915 new lines with no tests. At minimum: retry classification, path sanitization, mode selection logic (default vs named vs all), and the tracking-endpoint extraction deserve unit tests.
Partial file cleanup - failed downloads can leave empty/partial files on disk. Consider writing to a temp file and renaming on success, or cleaning up on error.
Duplicate formatBytes - azcopy/runner.go already has this function. Consider extracting to a shared util or reusing.
wbreza
left a comment
There was a problem hiding this comment.
Code Review — PR #8029: Implement download command for training job
TL;DR: Adds �zd custom-training job download with three modes (default artifacts, single named output, all outputs). Includes retry logic with exponential backoff and parallel downloading across 7 files (+915/-1). Supplements @jongio's existing review with additional findings.
Note: @jongio already flagged path traversal, broken retry classification, zero test coverage, partial file cleanup, and duplicate ormatBytes. The findings below are net-new only.
🔴 Must Fix
1. HTTP response body never closed on error paths — pkg/client/download.go
All 5 API methods (GetModelVersion, GetModelCredentials, GetDatasetCredentials, ListRunArtifacts, GetRunArtifactContentInfo) call c.HandleError(resp) when status ≠ 200 but never close
esp.Body. This leaks TCP connections under sustained error conditions.
Fix: Add defer resp.Body.Close() immediately after the Do() call, before the status check.
2. Semaphore leak on nil artifact entries — internal/download/download.go ~line 71
When info == nil, the code acquires a semaphore slot (sem <- struct{}{}) then continues the loop without releasing it. After enough nil entries, the semaphore fills and the entire download hangs.
Fix: Move sem <- struct{}{} after the nil check, or release the slot in the nil path.
3. Unvalidated tracking endpoint URL (SSRF risk) — internal/cmd/job_download.go ~line 328, pkg/client/download.go ~line 62
�xtractTrackingEndpoint extracts a URL from untrusted server JSON and uses it directly to construct API requests. A malicious or compromised job response could redirect artifact downloads to an attacker-controlled server.
Fix: Validate the URL scheme is https:// and the host matches expected Azure domains (e.g., *.api.azureml.ms).
🟡 Should Fix
4. Nil dereference when API returns (nil, nil) — internal/cmd/job_download.go ~lines 168, 190, 209, 227
Pattern: �ar modelVer *models.ModelVersion → retry closure sets it on success → code dereferences without nil guard. If the API returns (nil, nil), the next access panics. Appears in 4 places (modelVer, creds ×2, history).
Fix: Add if modelVer == nil { return fmt.Errorf("...") } after each retry block.
5. Missing nil guard on pagination response — internal/cmd/job_download.go ~line 223
If ListRunArtifacts returns (nil, nil), accessing page.Value panics.
Fix: Guard with if page == nil { break } before appending.
6. Credential leakage in error messages — internal/download/download.go ~line 106
Failed download responses (up to 1024 bytes) are included verbatim in error messages. If the server echoes SAS tokens or credentials in error responses, these propagate to user-visible output.
Fix: Limit error details to status code; don't include raw response body.
7. Unbounded download size — internal/download/download.go ~line 103
io.Copy(f, resp.Body) streams directly to disk without Content-Length validation. A malicious server could exhaust disk space.
Fix: Validate Content-Length header or wrap
esp.Body with io.LimitedReader.
8. Race condition on duplicate artifact paths — internal/download/download.go ~line 78
Multiple goroutines can race to os.Create the same file path if the server returns duplicate artifact paths.
Fix: Deduplicate artifact paths before the download loop, or use temp files with atomic rename.
9. Client initialization boilerplate duplicated — internal/cmd/job_download.go ~line 134
ewDownloadClient() duplicates the exact credential/client setup from job_delete.go and job_cancel.go.
Fix: Extract a shared createAuthenticatedClient(ctx) helper.
🟢 Nitpick
10. Parallelism default (8) defined in two places — job_download.go line 305 and download.go line 56. Define once as a shared constant.
11. User-unfriendly error formatting — job_download.go ~line 32. Terminal states printed with %v renders as [Completed Failed ...] instead of a readable list.
12. Debug logging may expose full URLs — pkg/client/download.go lines 53, 77. When debugBody is on, full request URLs including tokens are printed.
Overall: solid foundation — needs one more pass
The download command is well-structured with good use of parallel processing and retry logic. The must-fix items (resource leaks, semaphore deadlock, SSRF) are crash/security risks that should be resolved before merge, along with @jongio's existing feedback.
jongio
left a comment
There was a problem hiding this comment.
Addresses all my previous feedback. The retry logic is properly scoped now (context cancellation, transport errors only), path traversal is blocked via safeJoin with filepath.Abs comparison, partial downloads use tmp+rename, formatBytes is shared, and there's good test coverage on the critical paths. Clean work.
|
@wbreza -
|
8de9139
into
Azure:foundry-training-dev
* Custom training (#7125) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * feat: add job name auto-generation, fix endpoint URL, rename job get to show - Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK) - Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com) - Rename 'job get' to 'job show' to match models/finetune extensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * feat: enhance job list with pagination, filters, and systemData support (#7203) - Add --skip-token flag for pagination with next-page UX message - Add --tag and --properties flags for server-side filtering - Add --include-archived flag for listViewType control - Add SystemData (createdBy, createdAt) to job list output - Update doDataPlane() to support variadic query params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: add CODEOWNERS for azure.ai.customtraining extension (#7204) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: rename job create to submit and add resolver layer for compute, code, and input resolution (#7205) - Rename job create command to job submit for consistency with finetune extension - Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver - Add JobResolver orchestrator that resolves all references in JobDefinition - Wire resolver into submit flow before buildJobResource() - Stub implementations guide users to provide full ARM IDs / remote URIs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add artifact resolution for code and inputs in job create (#7153) * Implement cancel job command for custom training (#7272) * Impelement delete job command for custom training (#7273) * Custom training clean (#7454) * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * adding -e -s override * fixing asset resolution * custom training: enhance job show, fix asset resolution, add full resource config support - Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info - Added client APIs for run history, metrics, and artifacts endpoints - Fixed dataset version field: json:dataType -> json:type - Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount - Added full resource config support: instanceType, shmSize, dockerArgs, properties - Added ResourceDefinition YAML struct with AISuperComputer properties pass-through - Backward compatible: flat instance_count still works Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * custom training: add spinner progress to job show command Show animated spinner with progress text while fetching job details. Updates text as each parallel fetch (run history, metrics, artifacts) completes, showing remaining items until all data is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Implement download command for custom training (#7453) * Add validate command for custom training (#7407) * Rename EnvironmentID to EnvironmentImageReference (#7891) * Revert "Implement download command for custom training (#7453)" (#7892) This reverts commit 5216202. * Add userAssignedIdentityId support for command jobs (#7927) * Implement stream command for training job (#7939) * Add support for experiment name (#7961) * Implement connect-ssh to job node (#7964) * Add support for gpuCount for partial SKU scenario (#8067) * Share offline validation between job validate and job submit commands (#8068) * fix(azure.ai.customtraining): honor --subscription and --project-endpoint flags over stored env values (#8093) * Implement download command for training job (#8029) * Rename azure.ai.customtraining to azure.ai.training (#8106) * Implement show services command (#8121) * Add validations for UAMI requirement (#8122) * Add template flag in init command (#8123) * Bump armcognitiveservices SDK from v1.8.0 to v2.0.0 to fix NetworkInjections unmarshal error * Bump go directive to 1.26.1 to align with repo standard and other extensions * Use Subscription.UserTenantId for credential tenant to support guest/multi-tenant users * Pin azd module to semver v1.24.3 instead of pseudo-version for stable dependency * Add UTs * Consolidate doUpload and doUploadWithTag into single method with optional tags parameter * Use PromptSubscriptionResource for interactive Foundry project selection in init * Surface azcopy scanner errors so truncated stdout doesn't mask upload failures * Add schema header and requiredAzdVersion >=1.25.1 to extension.yaml * Cap error and service-instance response body reads to prevent unbounded memory use * Add retry policy (429/502/503/504 + net errors) to Foundry data plane client * Use azd Confirm prompt and honor --no-prompt in job delete * Escape user-supplied IDs in client URL paths * Fix JSON tag mismatch for DataType * Route client debug prints to stderr to keep stdout JSON parseable * Adopt azdext.NewExtensionRootCommand and remove reserved-flag conflicts * chore(ext/azure.ai.training): add CHANGELOG, README, cspell, golangci config and CI lint + release pipeline * chore(ext/azure.ai.training): extend cspell dictionary to fix CI lint * chore(ext/azure.ai.training): apply go fix modernization (interface{}→any, CutPrefix, drop loopvar capture) * chore(ext/azure.ai.training): fix golangci-lint issues * fix(ext/azure.ai.training): surface azcopy failure diagnostics * refactor(ext/azure.ai.training): consolidate ServiceEndpoint helper in internal/utils * test(ext/azure.ai.training): add hash + upload_service unit tests * refactor(ext/azure.ai.training): rename job_get.go to job_show.go * chore(ext/azure.ai.training): rename Design/ to design/ and link from README * chore(ext/azure.ai.training): address PR feedback (ssh ProxyCommand % escape, design/ rename) * Fix cspell error * fix: apply go fix modernizations * Move to APIM APIs and update API paths as per latest Typespec * Temp: Print API request response for testing * Fix API paths for metrics * fix(ai.training): use delete operation result url to surface accurate job delete outcome * fix(ai.training): fan out artifact contentinfo per unique root folder for job download * Revert "Temp: Print API request response for testing" This reverts commit 6d7f2d7. * Support attaching remaining service types (jupyter_lab, tensor_board, vs_code, custom) to job * Support distribution type (pytorch, tensorflow, mpi, ray) in job YAML * Add polling for job deletion | Add a --no-wait flag * Add polling for job cancel | Add a --no-wait flag | Refactor to share same poller as delete * fix: surface all per-root errors in artifact contentinfo fan-out * fix: redact query string from --debug URL logs to avoid leaking SAS tokens * fix: harden redactSAS with SAS-marker fallback for URLs without '?' * fix: validate ray distribution port and dashboard_port ranges in job validator * fix: make root-folder fan-out semaphore acquire context-aware in job download * fix: write full azcopy diagnostics to side-file when terminal output is truncated * fix: resolve cspell (skoid/sktid) and gosec G104 (LRO body close) lint failures * fix: redact request body for /credentials endpoints under --debug --------- Co-authored-by: Amit Chauhan <70937115+achauhan-scc@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Custom training (#7125) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * feat: add job name auto-generation, fix endpoint URL, rename job get to show - Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK) - Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com) - Rename 'job get' to 'job show' to match models/finetune extensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * feat: enhance job list with pagination, filters, and systemData support (#7203) - Add --skip-token flag for pagination with next-page UX message - Add --tag and --properties flags for server-side filtering - Add --include-archived flag for listViewType control - Add SystemData (createdBy, createdAt) to job list output - Update doDataPlane() to support variadic query params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: add CODEOWNERS for azure.ai.customtraining extension (#7204) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: rename job create to submit and add resolver layer for compute, code, and input resolution (#7205) - Rename job create command to job submit for consistency with finetune extension - Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver - Add JobResolver orchestrator that resolves all references in JobDefinition - Wire resolver into submit flow before buildJobResource() - Stub implementations guide users to provide full ARM IDs / remote URIs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add artifact resolution for code and inputs in job create (#7153) * Implement cancel job command for custom training (#7272) * Impelement delete job command for custom training (#7273) * Custom training clean (#7454) * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * adding -e -s override * fixing asset resolution * custom training: enhance job show, fix asset resolution, add full resource config support - Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info - Added client APIs for run history, metrics, and artifacts endpoints - Fixed dataset version field: json:dataType -> json:type - Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount - Added full resource config support: instanceType, shmSize, dockerArgs, properties - Added ResourceDefinition YAML struct with AISuperComputer properties pass-through - Backward compatible: flat instance_count still works Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * custom training: add spinner progress to job show command Show animated spinner with progress text while fetching job details. Updates text as each parallel fetch (run history, metrics, artifacts) completes, showing remaining items until all data is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Implement download command for custom training (#7453) * Add validate command for custom training (#7407) * Rename EnvironmentID to EnvironmentImageReference (#7891) * Revert "Implement download command for custom training (#7453)" (#7892) This reverts commit 5216202. * Add userAssignedIdentityId support for command jobs (#7927) * Implement stream command for training job (#7939) * Add support for experiment name (#7961) * Implement connect-ssh to job node (#7964) * Add support for gpuCount for partial SKU scenario (#8067) * Share offline validation between job validate and job submit commands (#8068) * fix(azure.ai.customtraining): honor --subscription and --project-endpoint flags over stored env values (#8093) * Implement download command for training job (#8029) * Rename azure.ai.customtraining to azure.ai.training (#8106) * Implement show services command (#8121) * Add validations for UAMI requirement (#8122) * Add template flag in init command (#8123) * Bump armcognitiveservices SDK from v1.8.0 to v2.0.0 to fix NetworkInjections unmarshal error * Bump go directive to 1.26.1 to align with repo standard and other extensions * Use Subscription.UserTenantId for credential tenant to support guest/multi-tenant users * Pin azd module to semver v1.24.3 instead of pseudo-version for stable dependency * Add UTs * Consolidate doUpload and doUploadWithTag into single method with optional tags parameter * Use PromptSubscriptionResource for interactive Foundry project selection in init * Surface azcopy scanner errors so truncated stdout doesn't mask upload failures * Add schema header and requiredAzdVersion >=1.25.1 to extension.yaml * Cap error and service-instance response body reads to prevent unbounded memory use * Add retry policy (429/502/503/504 + net errors) to Foundry data plane client * Use azd Confirm prompt and honor --no-prompt in job delete * Escape user-supplied IDs in client URL paths * Fix JSON tag mismatch for DataType * Route client debug prints to stderr to keep stdout JSON parseable * Adopt azdext.NewExtensionRootCommand and remove reserved-flag conflicts * chore(ext/azure.ai.training): add CHANGELOG, README, cspell, golangci config and CI lint + release pipeline * chore(ext/azure.ai.training): extend cspell dictionary to fix CI lint * chore(ext/azure.ai.training): apply go fix modernization (interface{}→any, CutPrefix, drop loopvar capture) * chore(ext/azure.ai.training): fix golangci-lint issues * fix(ext/azure.ai.training): surface azcopy failure diagnostics * refactor(ext/azure.ai.training): consolidate ServiceEndpoint helper in internal/utils * test(ext/azure.ai.training): add hash + upload_service unit tests * refactor(ext/azure.ai.training): rename job_get.go to job_show.go * chore(ext/azure.ai.training): rename Design/ to design/ and link from README * chore(ext/azure.ai.training): address PR feedback (ssh ProxyCommand % escape, design/ rename) * Fix cspell error * fix: apply go fix modernizations * Move to APIM APIs and update API paths as per latest Typespec * Temp: Print API request response for testing * Fix API paths for metrics * fix(ai.training): use delete operation result url to surface accurate job delete outcome * fix(ai.training): fan out artifact contentinfo per unique root folder for job download * Revert "Temp: Print API request response for testing" This reverts commit 6d7f2d7. * Support attaching remaining service types (jupyter_lab, tensor_board, vs_code, custom) to job * Support distribution type (pytorch, tensorflow, mpi, ray) in job YAML * Add polling for job deletion | Add a --no-wait flag * Add polling for job cancel | Add a --no-wait flag | Refactor to share same poller as delete * fix: surface all per-root errors in artifact contentinfo fan-out * fix: redact query string from --debug URL logs to avoid leaking SAS tokens * fix: harden redactSAS with SAS-marker fallback for URLs without '?' * fix: validate ray distribution port and dashboard_port ranges in job validator * fix: make root-folder fan-out semaphore acquire context-aware in job download * fix: write full azcopy diagnostics to side-file when terminal output is truncated * fix: resolve cspell (skoid/sktid) and gosec G104 (LRO body close) lint failures * fix: redact request body for /credentials endpoints under --debug --------- Co-authored-by: Amit Chauhan <70937115+achauhan-scc@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: therealjohn <1501196+therealjohn@users.noreply.github.com>
Notes
Testing
Happy paths-
Re-running download to the same path overwrites cleanly
Unhappy paths-
Download interruption via
Ctrl+Cdoesn't lead to download of corrupted or temporary files, the operation is atomic