Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure storage #1323

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jeffery-leirness-noaa
Copy link

Prework

Related GitHub issues and pull requests

Summary

Implements basic utility functions R/utils_azure.R and associated tests (tests/azure/test-utils_azure.R) to allow targets to interface with Azure storage.

@wlandau
Copy link
Member

wlandau commented Aug 27, 2024

Thanks for putting this together, @jeffery-leirness-noaa. At the same time you worked on this PR, I implemented customizable content-addressable storage in #1322. Would you check it out and let me know if tar_repository_cas() is enough for you to use Azure storage with targets? If it works for you and Azure storage integration can exist outside the targets package itself, this will reduce the maintenance load on my end.

@jeffery-leirness-noaa
Copy link
Author

Hi @wlandau. I've been testing the functionality of tar_repository_cas() and I think it's a wonderfully elegant solution! So far, I've tested {targets} workflows using both Google Drive and Azure Storage as the specified repository for CAS with (mostly) success. Thank you for this brilliant integration!

I am encountering an error where tar_read() fails when utilizing tar_option_set() to specify the custom repository and resources (via tar_resources(repository_cas = tar_resources_repository_cas(envvars = c())). tar_make() works successfully without issue, but tar_read() fails unless the tar_option_set() code block is run explicitly prior to tar_read(). Hopefully that makes sense. If not, please let me know and I can elaborate further or attempt to provide a reproducible example.

@wlandau
Copy link
Member

wlandau commented Sep 4, 2024

I think it's a wonderfully elegant solution!

Thanks! CAS turned out to be great for this because users of tar_repository_cas() don't need to futz with hashes.

I think the trickiest thing for people on your end is to make the exists function in tar_repository_cas() efficient for large pipelines. In https://docs.ropensci.org/targets/reference/tar_repository_cas.html#arguments I suggest maintaining an in-memory cache of objects: in theory, exists can submit a LIST request to list all the (relevant) objects the first time it is called, then submit a HEAD request only on cache misses. The challenge is that there are so many objects to list, and not every cloud API can restrict LIST to just the keys in tar_meta(targets_only = TRUE)$data. So it might be necessary to ask for better APIs, do some sort of filtering based on time stamps, or keep up with garbage collection.

tar_make() works successfully without issue, but tar_read() fails unless the tar_option_set() code block is run explicitly prior to tar_read().

Yeah, that's a problem with the native AWS integration as well if tar_resources_aws() has AWS credentials. A semi-convenient workaround is to run tar_load_globals() first, or is possible, those credentials could be stored in ~/.Renviron environment variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants