Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: change average_spectra to take an iterator over SpectrumLike #6

Merged
merged 7 commits into from
Sep 6, 2024

Conversation

douweschulte
Copy link
Contributor

I was working on averaging spectra and saw room for some improvements. This will allow any iterator of any SpectrumLike instead of a slice of MultiLayerSpectrum (but is just as convenient for that use case). Additionally I fleshed out the documentation a bit based on some discussion I has with a coworker.

I was trying to make this into a trait implementation for FromIterator which would make it trivial to make any spectrum iterator and then to .collect() and get the average. But the necessary dx default for reprofiling centroid data made this impossible, unless a good universal default could be defined.

@mobiusklein
Copy link
Owner

Thank you for working on this.

The idea of making this work with all SpectrumLike types is reasonable, although it means working through the RefPeakDataLevel type in certain places. The last release introduced more efficient iterators for that type so this would be a good exercise for them. These would be used where you're getting a compiler error about as_ref not existing on that type.

I'll write more about dx shortly.

@douweschulte
Copy link
Contributor Author

I figured out that the error was because I did not actually run the code locally and as such did not test if it actually built. Now I think I fixed my original attempt. I did not check out the new iterators yet, sounds like a good thing to look into tomorrow.

Copy link
Owner

@mobiusklein mobiusklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment about dx inference.

Additionally, even if you derive an algorithm for inferring dx, it raises the possibility the calculation becomes inconsistent as you add more spectra.

src/spectrum/group.rs Outdated Show resolved Hide resolved
if let Some(peaks) = scan.peaks.as_ref() {
let fpeaks: Vec<_> = peaks
let peaks = spectrum.peaks();
let fitted_peaks: Vec<_> = peaks
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will use the new iterators. It will work as expected for centroid and raw spectra which contain centroid data. For deconvolved spectra, this will probably fall apart because they aren't sorted by m/z, but neutral mass, and even if you sort the points before reprofiling, such a signal is of little use since the charge dimension is lost and cannot be recovered.

If this is a MultiLayerSpectrum, which behavior you get depends upon SpectrumLike::signal_continuity and whether or not MultiLayerSpectrum::deconvolved_peaks is populated or not.

This is probably worth documenting, but I don't think it invalidates the proposed flexibility itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your comment I restructured the code a bit to always use the .peaks() method as this is always available and will have the same behaviour as the previous implementation. The documentation was updated to reflect this fact and properly documents the deconvoluted case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: if with the new iterators you mean PeakDataIterDispatch than this is not yet exported to the public API.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noting that. PeakDataIterDispatch should be exported. I'll add this to the main branch.

I have yet to find a balance on how much to put into the public API of a Rust library, complexity of the interface vs. ease of extension. The mzdata::spectrum::peaks module is a good example, with seven different types exported in the public API, but only two concepts, "abstraction over peak list types" and "abstraction over iteration over abstracted peak list types", with variations in ownership.

src/spectrum/group.rs Outdated Show resolved Hide resolved
src/spectrum/group.rs Outdated Show resolved Hide resolved
@mobiusklein
Copy link
Owner

The changes as they are look good to me. Please let me know if you have any other changes you want to make before merging.

@douweschulte
Copy link
Contributor Author

Do you think it is needed to change the .peaks() to the new iterators? Otherwise this looks like a good point for merging to me. I was thinking if adding this to some existing (or new) trait is warranted as that would allow for more efficient implementations if that would be necessary. But I do not have a direct use for that and that might be enough reason to not build it for now.

Additionally if there are other parts of the library you would like another pair of eyes on feel free to ask me to take a look.

Copy link
Owner

@mobiusklein mobiusklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken on Saturday when I approved the changes. I did not properly recognize that the branching between profile and centroid spectra was gone and that only the RefPeakDataLevel variant was being used.

Using RefPeakDataLevel::iter makes when you have centroid data. It might be simplest to do something like this (not tested):

let peak_data = scan.peaks();
let mode = scan.signal_continuity();

match (mode, peak_data) {
    (_, RefPeakDataLevel::Deconvoluted(_)) | (_, RefPeakDataLevel::Missing) => None,
    (_,  RefPeakDataLevel::Centroid(_)) | (SignalContinuity::Profile, RefPeakDataLevel::RawData(_)) => {
       Some(reprofile(peak_data.iter().map(|p| FittedPeak::from(p.as_centroid())), dx)),
    },
    (_, RefPeakDataLevel::RawData(arrays)) => {
         Some(ArrayPair::from(arrays.mzs().unwarp(), arrays.intensities().unwrap()))
    }
}

There are other things we can do to reduce overhead here, but you're under no obligation to pursue them unless they are of interest to you. The clearest one is to use a single mzsignal::reprofile::PeakSetReprofiler instance for all re-profiling jobs here because reprofiling also creates a costly m/z grid spaced by dx, and the reprofile top-level function just creates a PeakSetReprofiler, reprofiles a single spectrum, then throws it away, while the PeakSetReprofiler can be re-used without needing to copy the m/z grid over-and-over again. Again, this is already complicated and we can leave it for future improvements if they are needed.

src/spectrum/group.rs Outdated Show resolved Hide resolved
src/spectrum/group.rs Outdated Show resolved Hide resolved
@mobiusklein
Copy link
Owner

mobiusklein commented Sep 3, 2024

Thank you for offering to continue working with the library.

Right now, spectrum averaging is on the SpectrumGroupingIterator trait. You could create a SpectrumLike member function that does the same thing as this function as well, but that would purely be for convenience as there isn't an obvious way that doing that would introduce extra information and save work during averaging.

Something that could use another pair of eyes is the peak loading scheme. peaks() loads peaks in a SpectrumLike-implementing type-specific order. MultiLayerSpectrum is currently the only type to have more than one kind of peak list and so I have added try_build_peaks, try_build_centroids, and try_build_deconvoluted_centroids to do the unpacking if the caller expects those to be available. There may be a more streamlined way to do that without just moving those method calls into the spectrum file reader.

Alternatively, please try to use the library in an application and tell me where it fails to do what you expect it to.

@douweschulte
Copy link
Contributor Author

I implemented your feedback. I also tried my hand at the reuse of PeakSetReprofiler, I assumed (but that assumption might be wrong) that the Vec<FittedPeak> will always be sorted. If this proves to be not the case the min_by can be extended to run over the entire vector. The trait I was talking about was mainly me trying to make it as easy as possible to call average now that you need to have a ref iterator it is somewhat cumbersome to call it with an owned iterator. But it would indeed not change all that much so might be not worth the effort.

I will take a look at the peaks() function. Otherwise I am using it in rustyms and the annotator right now so I will keep my eyes out for things that I can help with. One of the things I would like to add (after this PR) is a general API for lazy loading, I saw all your file types support this in one way or another but the current mzdata::io::MZReaderType::open_path loads everything at once.

@mobiusklein mobiusklein changed the title Allowed average from an iterator and added more documentation feature: change average_spectra to take an iterator over SpectrumLike Sep 6, 2024
@mobiusklein mobiusklein merged commit 0e0582e into mobiusklein:main Sep 6, 2024
2 checks passed
@douweschulte douweschulte deleted the generic_average branch September 9, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants