Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of computing partial derivative #77

Open
smao-astro opened this issue Dec 27, 2020 · 3 comments
Open

Performance of computing partial derivative #77

smao-astro opened this issue Dec 27, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested

Comments

@smao-astro
Copy link

Hi,

I am pretty new to neurodiffeq, thank you very much for the excellent library.

I am interested in the way, and the computational speed, of computing partial derivatives w.r.t. the inputs.

Take forward ODE (1D, 1 unknown variable) solver for example, the input is x, a batch of coordinates, and the output of the neural network is y, the approximated solution of the PDE at these coordinates. If view the neural network as a smooth function that simulate the solution and name it f, the forward part in the training is evaluating y = f(x), and for each element of input, x_i, the neural network gives y_i = f(x_i), the i increase from 0 to N-1, the batch size. When constructing the loss function, one evaluate the residual of PDE, which usually require evaluating \frac{\partial y_i}{\partial x_i} and higher order of derivative.

My question related to the way of evaluating the \frac{\partial y_i}{\partial x_i}, for example x is (N, 1) tensor, y is also (N, 1) tensor, N is the batch size, if you do autograd.grad(y, t, create_graph=True, grad_outputs=ones, allow_unused=True) as the lines below
https://github.com/odegym/neurodiffeq/blob/718f226d40cfbcb9ed50d72119bd3668b0c68733/neurodiffeq/neurodiffeq.py#L21-L22
my understanding is that it will evaluate a Jacobian Matrix of size (N, N) with elements equal to \frac{\partial y_i}{\partial x_j} (I, j from 0 to N-1) regardless of the fact that y_i only dependent on x_i and thus computation (and storage) on the non-diagonal elements is useless and unnecessary. In other word, the computation actually can be done by evaluating N gradients, but the current method do N * N times.

My question is that:

  1. Is what I state above correct to your understanding?
  2. If correct, do you think this may cause computation speed influence?
  3. If 1 is correct, do you know any way, and do you have any plan to reduce the computation needed?

Thanks!

@shuheng-liu
Copy link
Member

shuheng-liu commented Dec 27, 2020

Thanks for you interest in neurodiffeq. Your question was very good and thought-provoking, I need to check whether it indeed computes a N x N Jacobian when there's no interdependence in the batch. My intuition is that you are right.

In the meantime, may I ask how you came to know neurodiffeq? It'd be nice to know how we can further promote it :)

@shuheng-liu shuheng-liu added enhancement New feature or request good first issue Good for newcomers question Further information is requested labels Dec 27, 2020
@shuheng-liu shuheng-liu self-assigned this Dec 27, 2020
@shuheng-liu shuheng-liu pinned this issue Dec 27, 2020
@smao-astro
Copy link
Author

Hi @shuheng-liu ,

Do you have any update would like to share about the questions, I am not familiar to automatic diff so I would like to know whether the current implementation is efficient.

In addition, I noticed that this nightly new API might helpful, what do you think?

I can not remember the detail since I noticed it earlier last year, maybe from some paper (?)

Thanks.

@shuheng-liu
Copy link
Member

computation (and storage) on the non-diagonal elements is useless and unnecessary

I am pretty sure it won't introduce additional storage because PyTorch accumulates (sums) the gradient w.r.t. the same variable. For example

x = torch.rand(10, 1, requires_grad=True)  # shape 10 x 1
y = torch.cat([x, x*2, x*3], 1)            # shape 10 x 3
y.backward(torch.ones_like(y))
print(x.grad)

will give you a 10 x 1 tensor filled with 6s, instead of a 10 x 3 tensor of (1, 2, 3)s

As for the computation part, I'm thinking it will be more expensive though because PyTorch uses reverse mode of automatic differentiation, so when the output has a high dimension, the computation will slow down. I'm not sure if this can be fixed without switching to a different framework other than PyTorch.

torch.vmap looks interesting but I don't have much experience with it. I'm still trying to understand it.

@sathvikbhagavan sathvikbhagavan unpinned this issue Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants