Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: provide a mechanism to inform the caller that a quota is exceeded #200

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

MattMcL4475
Copy link
Contributor

@MattMcL4475 MattMcL4475 commented Oct 24, 2023

Currently, when Azure Batch has no quota available, a TES Task in TES on Azure will stay in the INITIALIZING state indefinitely until quota becomes available. This could be minutes, hours, or even days. TES needs a way to inform the caller why this is the case, so that the caller can update the UI with this additional information, and the user or IT admin knows they need to submit an Azure Support Request to increase their quota. Otherwise, they don't have visibility into why the task is not progressing.

Ideally we actually want the caller to parse the string Pending available quota: low-priority vCPUs, to recognize that there is a quota issue, and that the specific quota is low-priority vCPUs, so I'm also open to the idea of adding a specific string property to the TES Task such as quotaTypeExceeded, and set it to a value such as low-priority vCPUs or NVSv3 Series

@MattMcL4475 MattMcL4475 changed the title Feature: add TesTask.state_description Feature: provide a mechanism to inform the caller that the quota is exceeded Oct 24, 2023
@MattMcL4475 MattMcL4475 changed the title Feature: provide a mechanism to inform the caller that the quota is exceeded Feature: provide a mechanism to inform the caller that a quota is exceeded Oct 24, 2023
@MattMcL4475
Copy link
Contributor Author

@patmagee what are your thoughts on the best way to handle this?

@patmagee
Copy link

@MattMcL4475 i would be in favour of a fail fast model. I think waiting for quota for a given period of time is okay, but beyond a "reasonable limit" the tes task should fail.

I propose a new state INSUFFICIENT_RESOURCES to be treated as a failure state. If returned it would be highly informative to the end user why something failed. Combine that with a message for the failure and that should allow diagnosis of most failures

@patmagee
Copy link

I think this state would fit well onto a WES workflow as well btw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants