🧐Deep learning Guide 14: Minibatch Stochastic Gradient DescentAt the heart of the decision to use minibatches is computational efficiency. This is most easily understood when considering parallelization to multiple GPUs and multiple servers. In this case we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already arrive at a minibatch size no smaller than 128.
🧐Deep learning Guide 12: Gradient Descent 梯度下降Although it is rarely used directly in deep learning, an understanding of gradient descent is key to understanding stochastic gradient descent algorithms.
🧐課後功課答案 12.2. ConvexityAssume that we want to verify convexity of a set by drawing all lines between points within the set and checking whether the lines are contained.
🧐課後功課答案 11.5. Self-Attention and Positional EncodingImplement distance-based attention by modifying the DotProductAttention code. Note that you only need the squared norms of the keys for an efficient implementation.
🧐課後功課答案 11.4. Multi-Head AttentionImplement distance-based attention by modifying the DotProductAttention code. Note that you only need the squared norms of the keys for an efficient implementation.
🧐課後功課答案 11.3. Attention Scoring FunctionsImplement distance-based attention by modifying the DotProductAttention code. Note that you only need the squared norms of the keys for an efficient implementation.
🧐課後功課答案 11.1. Queries, Keys, and ValuesSuppose that you wanted to reimplement approximate (key, query) matches as used in classical databases, which attention function would you pick?