diff --git a/_toc.yml b/_toc.yml index 82757d7..581dc03 100644 --- a/_toc.yml +++ b/_toc.yml @@ -36,6 +36,7 @@ parts: - file: physgrad.md - file: physgrad-comparison.ipynb - file: physgrad-nn.md + - file: physgrad-code.ipynb - file: physgrad-hig.md - file: physgrad-hig-code.ipynb - file: physgrad-discuss.md diff --git a/physgrad-code.ipynb b/physgrad-code.ipynb new file mode 100644 index 0000000..0cc091e --- /dev/null +++ b/physgrad-code.ipynb @@ -0,0 +1,641 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Physical Gradients for X\n", + "\n", + "placeholder only, insert more complex PG example \n", + "\n", + "..." + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [], + "source": [ + "dont run !!! import numpy as np\n", + "import tensorflow as tf\n", + "import time\n", + "import os\n", + "\n", + "os.environ[\"CUDA_VISIBLE_DEVICES\"]='' # set GPUs\n", + "#tf.config.run_functions_eagerly(True) # deactivate Tensorflow Tracing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Physics Model\n", + "\n", + "Differential equation: Coupled Linear Oscillators with Control term\n", + "\n", + "Solver: Runge-Kutta-4 \n", + "\n", + "Nx: Number of oscillators,\n", + "Nt: Number of time evolution steps,\n", + "dt: Length of one time step" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "Nx = 2\n", + "Nt = 24\n", + "dt = 0.5\n", + "\n", + "def build_laplace(n,boundary='0'):\n", + " if n==1:\n", + " return np.zeros((1,1),dtype=np.float32)\n", + " d1 = -2 * np.ones((n,),dtype=np.float32)\n", + " d2 = 1 * np.ones((n-1,),dtype=np.float32)\n", + " lap = np.zeros((n,n),dtype=np.float32)\n", + " lap[range(n),range(n)]=d1\n", + " lap[range(1,n),range(n-1)]=d2\n", + " lap[range(n-1),range(1,n)]=d2\n", + " if boundary=='0':\n", + " lap[0,0]=lap[n-1,n-1]=-1\n", + "\n", + " return lap\n", + "\n", + "@tf.function\n", + "def coupled_oscillators_batch( x, control):\n", + " print('coup harm osc tracing')\n", + " '''\n", + " ODE of type: x' = f(x)\n", + " :param x_in: shape: (batch, 2 * number of osc)\n", + " order second index: x_i , v_i\n", + " :param control: shape:(batch,)\n", + " :return:\n", + " '''\n", + " n_osc = x.shape[1]//2\n", + "\n", + " # natural time evo\n", + " a1 = np.array([[0,1],[-1,0]],dtype=np.float32)\n", + " a2 = np.eye(n_osc,dtype=np.float32)\n", + " A = np.kron(a1,a2)\n", + " x_dot1 = tf.tensordot(x,A,axes = (1,1))\n", + "\n", + " # interaction term\n", + " interaction_strength = 0.2\n", + " b1 = np.array([[0,0],[1,0]],dtype=np.float32)\n", + " b2 = build_laplace(n_osc)\n", + " B = interaction_strength * np.kron(b1,b2)\n", + " x_dot2 = tf.tensordot(x,B, axes=(1, 1))\n", + "\n", + " # control term\n", + " control_vector = np.zeros((n_osc,),dtype=np.float32)\n", + " control_vector[-1] = 1.0\n", + " c1 = np.array([0,1],dtype=np.float32)\n", + " c2 = control_vector\n", + " C = np.kron(c1,c2)\n", + " x_dot3 = tf.tensordot(control,C, axes=0)\n", + "\n", + " #all terms\n", + " x_dot = x_dot1 + x_dot2 +x_dot3\n", + " return x_dot\n", + "\n", + "@tf.function\n", + "def runge_kutta_4_batch(x_0, dt, control, ODE_f_batch):\n", + "\n", + " f_0_0 = ODE_f_batch(x_0, control)\n", + " x_14 = x_0 + 0.5 * dt * f_0_0\n", + "\n", + " f_12_14 = ODE_f_batch(x_14, control)\n", + " x_12 = x_0 + 0.5 * dt * f_12_14\n", + "\n", + " f_12_12 = ODE_f_batch(x_12, control)\n", + " x_34 = x_0 + dt * f_12_12\n", + "\n", + " terms = f_0_0 + 2 * f_12_14 + 2 * f_12_12 + ODE_f_batch(x_34, control)\n", + " x1 = x_0 + dt * terms / 6\n", + "\n", + " return x1\n", + "\n", + "@tf.function\n", + "def solver(x0, control):\n", + " x = x0\n", + " for i in range(Nt):\n", + " x = runge_kutta_4_batch(x, dt, control[:,i], coupled_oscillators_batch)\n", + " return x\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Network Model" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [], + "source": [ + "act = tf.keras.activations.tanh\n", + "model = tf.keras.models.Sequential([\n", + " tf.keras.layers.InputLayer(input_shape=(2*Nx)),\n", + " tf.keras.layers.Dense(20, activation=act),\n", + " tf.keras.layers.Dense(20, activation=act),\n", + " tf.keras.layers.Dense(20, activation=act),\n", + " tf.keras.layers.Dense(Nt, activation='linear')\n", + " ])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loss Function" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [], + "source": [ + "@tf.function\n", + "def loss_function(a,b):\n", + " diff = a-b\n", + " loss_batch = tf.reduce_sum(diff**2,axis=1)\n", + " loss = tf.reduce_sum(loss_batch)\n", + " return loss" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data set\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [], + "source": [ + "N = 2**12\n", + "x_train = np.random.rand(N, 2 * Nx).astype(np.float32)\n", + "y_train = x_train" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Optimization\n", + "For the optimization procedure, we need to set up some parameters. \n", + "1. Adam: The truncation parameter has no meaning here.\n", + "2. PG: The specified optimizer is the one used for network optimization. The physics inversion is done via Gauss-Newton and corresponds to an exact inversion since the physical optimization landscape is quadratic. For the Jacobian inversion in Gauss-Newton, we can specify a truncation parameter.\n", + "3. HIG: To receive the HIG algorithm, the optimizer has to be set to SGD. For the Jacobian half-inversion, we can specify a truncation parameter. Optimal batch sizes are typically lower. Learning rates can usually be chosen around 1.\n", + "\n", + "Specify the maximal simulation time in seconds via max_number." + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [], + "source": [ + "# Adam Training\n", + "if 0:\n", + " mode = 'GD' \n", + " optimizer = tf.keras.optimizers.Adam\n", + " batch_size = 256\n", + " learning_rate = 0.001\n", + " max_number = 30#*60\n", + " trunc = 10**-10 " + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [], + "source": [ + "# PG training\n", + "if 0:\n", + " mode = 'PG' \n", + " optimizer = tf.keras.optimizers.Adam\n", + " batch_size = 256\n", + " learning_rate = 0.001\n", + " max_number = 30#*60\n", + " trunc = 10**-10" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [], + "source": [ + "# HIG training\n", + "if 1:\n", + " mode = 'HIG' # GD, PG, HIG\n", + " optimizer = tf.keras.optimizers.SGD\n", + " batch_size = 32\n", + " learning_rate = 1.0\n", + " max_number = 30#*60\n", + " trunc = 10**-10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Function to construct the half-inverse of matrix for HIGs" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [], + "source": [ + "from tensorflow.python.framework import ops\n", + "from tensorflow.python.framework import tensor_shape\n", + "from tensorflow.python.ops import array_ops\n", + "from tensorflow.python.ops import math_ops\n", + "from tensorflow.python.util import dispatch\n", + "from tensorflow.python.util.tf_export import tf_export\n", + "from tensorflow.python.ops.linalg.linalg_impl import _maybe_validate_matrix,svd\n", + "\n", + "@tf_export('linalg.gpinv')\n", + "@dispatch.add_dispatch_support\n", + "def gpinv(a, rcond=None,beta=0.5, validate_args=False, name=None):\n", + "\n", + " with ops.name_scope(name or 'pinv'):\n", + " a = ops.convert_to_tensor(a, name='a')\n", + "\n", + " assertions = _maybe_validate_matrix(a, validate_args)\n", + " if assertions:\n", + " with ops.control_dependencies(assertions):\n", + " a = array_ops.identity(a)\n", + "\n", + " dtype = a.dtype.as_numpy_dtype\n", + "\n", + " if rcond is None:\n", + "\n", + " def get_dim_size(dim):\n", + " dim_val = tensor_shape.dimension_value(a.shape[dim])\n", + " if dim_val is not None:\n", + " return dim_val\n", + " return array_ops.shape(a)[dim]\n", + "\n", + " num_rows = get_dim_size(-2)\n", + " num_cols = get_dim_size(-1)\n", + " if isinstance(num_rows, int) and isinstance(num_cols, int):\n", + " max_rows_cols = float(max(num_rows, num_cols))\n", + " else:\n", + " max_rows_cols = math_ops.cast(\n", + " math_ops.maximum(num_rows, num_cols), dtype)\n", + " rcond = 10. * max_rows_cols * np.finfo(dtype).eps\n", + "\n", + " rcond = ops.convert_to_tensor(rcond, dtype=dtype, name='rcond')\n", + "\n", + " # Calculate pseudo inverse via SVD.\n", + " # Note: if a is Hermitian then u == v. (We might observe additional\n", + " # performance by explicitly setting `v = u` in such cases.)\n", + " [\n", + " singular_values, # Sigma\n", + " left_singular_vectors, # U\n", + " right_singular_vectors, # V\n", + " ] = svd(\n", + " a, full_matrices=False, compute_uv=True)\n", + "\n", + " # Saturate small singular values to inf. This has the effect of make\n", + " # `1. / s = 0.` while not resulting in `NaN` gradients.\n", + " cutoff = rcond * math_ops.reduce_max(singular_values, axis=-1)\n", + " singular_values = array_ops.where_v2(\n", + " singular_values > array_ops.expand_dims_v2(cutoff, -1), singular_values**beta,\n", + " np.array(np.inf, dtype))\n", + "\n", + " # By the definition of the SVD, `a == u @ s @ v^H`, and the pseudo-inverse\n", + " # is defined as `pinv(a) == v @ inv(s) @ u^H`.\n", + " a_pinv = math_ops.matmul(\n", + " right_singular_vectors / array_ops.expand_dims_v2(singular_values, -2),\n", + " left_singular_vectors,\n", + " adjoint_b=True)\n", + "\n", + " if a.shape is not None and a.shape.rank is not None:\n", + " a_pinv.set_shape(a.shape[:-2].concatenate([a.shape[-1], a.shape[-2]]))\n", + "\n", + " return a_pinv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We define a Python class to organize the optimization algorithm. Using physics solver, network model, loss function and a data set. We compute loss values after each epoch. Each epoch consists of several mini batch updates. Depending on the optimization method, the mini batch update differs:\n", + "1. Adam: Compute loss gradient, then apply the Adam\n", + "2. PG: Compute loss gradient und physics Jacobian, invert them data-point-wise, define a modified loss to compute network updates\n", + "3. HIG: Compute loss gradient and network-physics Jacobian, then compute half-inversion" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [], + "source": [ + "class Optimization():\n", + " def __init__(self,model,solver,loss_function,x_train,y_train):\n", + " self.model = model\n", + " self.solver = solver\n", + " self.loss_function = loss_function\n", + " self.x_train = x_train\n", + " self.y_train = y_train\n", + " self.y_dim = y_train.shape[1]\n", + " self.weight_shapes = [weight_tensor.shape for weight_tensor in self.model.trainable_weights]\n", + "\n", + " def set_params(self,batch_size,learning_rate,optimizer,max_number,mode,trunc):\n", + " self.number_of_batches = N // batch_size\n", + " self.max_number = max_number\n", + " self.batch_size = batch_size\n", + " self.learning_rate = learning_rate\n", + " self.optimizer = optimizer(learning_rate)\n", + " self.mode = mode\n", + " self.trunc = trunc\n", + "\n", + "\n", + " def computation(self,x_batch, y_batch):\n", + " control_batch = self.model(y_batch)\n", + " y_prediction_batch = self.solver(x_batch,control_batch)\n", + " loss = self.loss_function(y_batch,y_prediction_batch)\n", + " return loss\n", + "\n", + "\n", + " @tf.function\n", + " def gd_get_derivatives(self,x_batch, y_batch):\n", + "\n", + " with tf.GradientTape(persistent=True) as tape:\n", + " tape.watch(self.model.trainable_variables)\n", + " loss = self.computation(x_batch,y_batch)\n", + " loss_per_dp = loss / self.batch_size\n", + " grad = tape.gradient(loss_per_dp, self.model.trainable_variables)\n", + " return grad\n", + "\n", + "\n", + " @tf.function\n", + " def pg_get_physics_derivatives(self,x_batch, y_batch): #physical grads\n", + "\n", + " with tf.GradientTape(persistent=True) as tape:\n", + " control_batch = self.model(y_batch)\n", + " tape.watch(control_batch)\n", + " y_prediction_batch = self.solver(x_batch,control_batch)\n", + " loss = self.loss_function(y_batch,y_prediction_batch)\n", + " loss_per_dp = loss / self.batch_size\n", + "\n", + " jacy = tape.batch_jacobian(y_prediction_batch,control_batch)\n", + " grad = tape.gradient(loss_per_dp, y_prediction_batch)\n", + " return jacy,grad,control_batch\n", + "\n", + " @tf.function\n", + " def pg_get_network_derivatives(self,x_batch, y_batch,new_control_batch): #physical grads\n", + "\n", + " with tf.GradientTape(persistent=True) as tape:\n", + " tape.watch(self.model.trainable_variables)\n", + " control_batch = self.model(y_batch)\n", + " loss = self.loss_function(new_control_batch,control_batch)\n", + " #y_prediction_batch = self.solver(x_batch,control_batch)\n", + " #loss = self.loss_function(y_batch,y_prediction_batch)\n", + " loss_per_dp = loss / self.batch_size\n", + "\n", + "\n", + "\n", + " network_grad = tape.gradient(loss_per_dp, self.model.trainable_variables)\n", + " return network_grad\n", + "\n", + " @tf.function\n", + " def hig_get_derivatives(self,x_batch,y_batch):\n", + "\n", + " with tf.GradientTape(persistent=True) as tape:\n", + " tape.watch(self.model.trainable_variables)\n", + " control_batch = self.model(y_batch)\n", + " y_prediction_batch = self.solver(x_batch,control_batch)\n", + " loss = self.loss_function(y_batch,y_prediction_batch)\n", + " loss_per_dp = loss / self.batch_size\n", + "\n", + " jacy = tape.jacobian(y_prediction_batch, self.model.trainable_variables, experimental_use_pfor=True)\n", + " loss_grad = tape.gradient(loss_per_dp, y_prediction_batch)\n", + " return jacy, loss_grad\n", + "\n", + "\n", + "\n", + " def mini_batch_update(self,x_batch, y_batch):\n", + " if self.mode==\"GD\":\n", + " grad = self.gd_get_derivatives(x_batch, y_batch)\n", + " self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights))\n", + "\n", + " elif self.mode==\"PG\":\n", + " jacy,grad,control_batch = self.pg_get_physics_derivatives(x_batch, y_batch)\n", + " grad_e = tf.expand_dims(grad,-1)\n", + " pinv = tf.linalg.pinv(jacy,rcond=10**-5)\n", + " delta_control_label_batch = (pinv@grad_e)[:,:,0]\n", + " new_control_batch = control_batch - delta_control_label_batch\n", + " network_grad = self.pg_get_network_derivatives(x_batch, y_batch,new_control_batch)\n", + " self.optimizer.apply_gradients(zip(network_grad, self.model.trainable_weights))\n", + "\n", + " elif self.mode =='HIG':\n", + " jacy, grad = self.hig_get_derivatives(x_batch, y_batch)\n", + " flat_jacy_list = [tf.reshape(jac, (self.batch_size * self.y_dim, -1)) for jac in jacy]\n", + " flat_jacy = tf.concat(flat_jacy_list, axis=1)\n", + " flat_grad = tf.reshape(grad, (-1,))\n", + " inv = gpinv(flat_jacy, rcond=self.trunc)\n", + " processed_derivatives = tf.tensordot(inv, flat_grad, axes=(1, 0))\n", + " #processed_derivatives = self.linear_solve(flat_jacy, flat_grad)\n", + " update_list = []\n", + " l1 = 0\n", + " for k, shape in enumerate(self.weight_shapes):\n", + " l2 = l1 + np.prod(shape)\n", + " upd = processed_derivatives[l1:l2]\n", + " upd = np.reshape(upd, shape)\n", + " update_list.append(upd)\n", + " l1 = l2\n", + " self.optimizer.apply_gradients(zip(update_list, self.model.trainable_weights))\n", + "\n", + "\n", + " def epoch_update(self):\n", + " for batch_index in range(self.number_of_batches):\n", + " position = batch_index * batch_size\n", + " x_batch = self.x_train[position:position + batch_size]\n", + " y_batch = self.y_train[position:position + batch_size]\n", + " self.mini_batch_update(x_batch, y_batch)\n", + "\n", + " def eval(self,epoch,wc_time,ep_dur):\n", + "\n", + " print('Epoch: ', epoch,' WallClockTime: ',wc_time,' EpochDuration: ',ep_dur )\n", + "\n", + " train_loss = self.computation(self.x_train,self.y_train)\n", + " train_loss_per_dp = train_loss / N\n", + " print('TrainLoss:', train_loss_per_dp)\n", + " return train_loss_per_dp\n", + "\n", + " def start_training(self):\n", + " init_loss = self.eval(0,0,0)\n", + " init_time = time.time()\n", + " time_list = [init_time]\n", + " loss_list = [init_loss]\n", + "\n", + " epoch=0\n", + " wc_time = 0\n", + "\n", + " while wc_time" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "wct_times = {}\n", + "wct_times[mode] = wct_time\n", + "print(format(wct_times.keys()))\n", + "\n", + "wct_time = np.cumsum(time_list)\n", + "plt.plot(wct_time,loss_list)\n", + "plt.yscale('log')\n", + "plt.xlabel('Wall clock time')\n", + "plt.xlabel('Loss')\n", + "plt.title('Linear Chain - '+mode+'_'+str(optimizer.__name__))\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [], + "source": [ + "path = './'\n", + "name1 = 'time'+mode\n", + "name2 = 'loss'+mode\n", + "np.savetxt(path+name1+'.txt',time_list)\n", + "np.savetxt(path+name2+'.txt',loss_list)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "interpreter": { + "hash": "60a46218545510a486eb79ffac0761a327b44b02462576ac242e6c15f2abc4b7" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/physgrad-discuss.md b/physgrad-discuss.md index 8a3fe11..07331da 100644 --- a/physgrad-discuss.md +++ b/physgrad-discuss.md @@ -1,6 +1,16 @@ Discussion ======================= + +xxx TODO update , include HIG discussion xxx +... discarded supervised, and PIs + +PGs higher order, custom inverse , chain PDE & NN together + +HIG more generic, numerical inversion , joint physics & NN + + + In a way, the learning via physical gradients provide the tightest possible coupling of physics and NNs: the full non-linear process of the PDE model directly steers the optimization of the NN. diff --git a/physgrad-hig.md b/physgrad-hig.md index f11856a..ba4dab5 100644 --- a/physgrad-hig.md +++ b/physgrad-hig.md @@ -106,7 +106,7 @@ Here $y^1$ and $y^2$ denote the first, and second component of $y$ (in contrast We'll use a small neural network with a single hidden layer consisting of 7 neurons with _tanh()_ activations and the objective to learn $\hat{y}$. -## Well conditioned +## Well-conditioned Let's first look at the well-conditioned case with $\lambda=1$. In the following image, we'll compare Adam as the most popular SGD-representative, Gauss-Newton (GN) as "classical" method, and the HIGs. These methods are evaluated w.r.t. three aspects: naturally, it's interesting to see how the loss evolves. In addition, we'll consider the distribution of neuron activations from the resulting neural network states (more on that below). Finally, it's also interesting to observe how the optimization influences the resulting target states produced by the neural network in $y$ space. Note that the $y$-space graph below shows only a single but fairly representative $x,y$ pair, while the other two show quantities of a larger set of validation inputs. @@ -126,9 +126,7 @@ Finally, the third graph on the right shows the evolution in terms of a single i Overall, the behavior of all three methods is largely in line with what we'd expect: while the loss surely could go down more, and some of the steps in $y$ seem to momentarily do in the wrong direction, all three methods cope quite well with this case. Not surprisingly, this picture will change when making things harder with a more ill-conditioned Jacobian resulting from a small $\lambda$ -## Ill conditioned - -xxx CONTINUE xxx +## Ill-conditioned Now we can consider a less well-conditioned case with $\lambda=0.01$. The conditioning could be much worse in real-world PDE solvers, but interestingly, this factor of $1/100$ is sufficient to illustrate the problems that arise in practice. Here are the same 3 graphs for the ill-conditioned case: @@ -144,14 +142,13 @@ The loss curves now show a different behavior: both Adam and GN do not manage to This becomes even clearer in the middle graph, showing the activations statistics. Now the red curve of GN saturates at 1 without any variance. Hence, all neurons have saturated, and do not produce meaningful signals anymore. This not only means that the target function isn't approximated well, it also means all future gradients will effectively be zero, and these neurons are lost to all future learning iterations. Hence, this is a highly undesirable case that we want to avoid in practice. It's also worth pointing out that this doesn't always happen for GN. However it regularly happens, e.g. when individual samples in a batch lead to vectors in the Jacobian that are linearly dependent (or very close to it), and thus makes GN a sub-optimal choice. -The last graph of figure {ref}`hig-toy-example-bad` +The third graph on the right side of figure {numref}`hig-toy-example-bad` shows the resulting behavior in terms of the outputs. As already indicated by the loss values, both Adam and GN do not reach the target (the black dot). Interestingly, it's also apparent that both have much more problems along the $y^2$ direction, which we used to cause the bad conditioning: they both make some progress along the x-axis of the graph ($y^1$), but don't move much towards the $y^2$ target value. This is illustrating the discussions above: GN gets stuck due to its saturated neurons, while Adam struggles to undo the scaling of $y^2$. %We've kept the $\eta$ in here for consistency, but in practice $\eta=1$ is used for Gauss-Newton +## Summary of Half-Inverse Gradients -%PGs higher order, custom inverse , chain PDE & NN together - -%HIG more generic, numerical inversion , joint physics & NN - +Note that for all examples so far, we've improved upon the _differentiable physics_ (DP) training from the previous chapters. I.e., we've focused on combinations of neural networks and PDE solving operators. The latter need to be differentiable for training with regular SGD, as well as for HIG-based training. For the physical gradients, we even need them to provide an inverse solver. Thus, the HIGs described above share more similarities with, e.g., {doc}`diffphys-code-sol` and {doc}`diffphys-code-control`, than with {doc}`physgrad-code`. +This is a good time to give a specific code example of how to train physical NNs with HIGs: we'll look at a classic case, a system of coupled oscillators.