Azure storage #1323

jeffery-leirness-noaa · 2024-08-23T19:56:05Z

Prework

I understand and agree to the code of conduct and the contributing guidelines.
I have already submitted a discussion topic or issue to discuss my idea with the maintainer.

Related GitHub issues and pull requests

Ref: Discussion Azure Storage integration #1104

Summary

Implements basic utility functions R/utils_azure.R and associated tests (tests/azure/test-utils_azure.R) to allow targets to interface with Azure storage.

…e.R)

wlandau · 2024-08-27T18:12:06Z

Thanks for putting this together, @jeffery-leirness-noaa. At the same time you worked on this PR, I implemented customizable content-addressable storage in #1322. Would you check it out and let me know if tar_repository_cas() is enough for you to use Azure storage with targets? If it works for you and Azure storage integration can exist outside the targets package itself, this will reduce the maintenance load on my end.

jeffery-leirness-noaa · 2024-09-03T15:57:19Z

Hi @wlandau. I've been testing the functionality of tar_repository_cas() and I think it's a wonderfully elegant solution! So far, I've tested {targets} workflows using both Google Drive and Azure Storage as the specified repository for CAS with (mostly) success. Thank you for this brilliant integration!

I am encountering an error where tar_read() fails when utilizing tar_option_set() to specify the custom repository and resources (via tar_resources(repository_cas = tar_resources_repository_cas(envvars = c())). tar_make() works successfully without issue, but tar_read() fails unless the tar_option_set() code block is run explicitly prior to tar_read(). Hopefully that makes sense. If not, please let me know and I can elaborate further or attempt to provide a reproducible example.

wlandau · 2024-09-04T16:14:53Z

I think it's a wonderfully elegant solution!

Thanks! CAS turned out to be great for this because users of tar_repository_cas() don't need to futz with hashes.

I think the trickiest thing for people on your end is to make the exists function in tar_repository_cas() efficient for large pipelines. In https://docs.ropensci.org/targets/reference/tar_repository_cas.html#arguments I suggest maintaining an in-memory cache of objects: in theory, exists can submit a LIST request to list all the (relevant) objects the first time it is called, then submit a HEAD request only on cache misses. The challenge is that there are so many objects to list, and not every cloud API can restrict LIST to just the keys in tar_meta(targets_only = TRUE)$data. So it might be necessary to ask for better APIs, do some sort of filtering based on time stamps, or keep up with garbage collection.

tar_make() works successfully without issue, but tar_read() fails unless the tar_option_set() code block is run explicitly prior to tar_read().

Yeah, that's a problem with the native AWS integration as well if tar_resources_aws() has AWS credentials. A semi-convenient workaround is to run tar_load_globals() first, or is possible, those credentials could be stored in ~/.Renviron environment variables.

jeffery-leirness-noaa added 4 commits August 23, 2024 19:14

Add R/utils_azure.R and associated tests (tests/azure/test-utils_azur…

b280613

…e.R)

Remove commented code as recommended by lintr::lint_package()

bb8b98c

Change setNames() to stats::setNames()

daca955

Add additional packages to Suggests

5dafae5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure storage #1323

Azure storage #1323

jeffery-leirness-noaa commented Aug 23, 2024

wlandau commented Aug 27, 2024

jeffery-leirness-noaa commented Sep 3, 2024

wlandau commented Sep 4, 2024

Azure storage #1323

Are you sure you want to change the base?

Azure storage #1323

Conversation

jeffery-leirness-noaa commented Aug 23, 2024

Prework

Related GitHub issues and pull requests

Summary

wlandau commented Aug 27, 2024

jeffery-leirness-noaa commented Sep 3, 2024

wlandau commented Sep 4, 2024