Chrome 86: Improved Focus Highlighting, WebHID, and More

Unless otherwise noted, changes described below apply to the newest Chrome beta channel release for Android, Chrome OS, Linux, macOS, and Windows. Learn more about the features listed here through the provided links or from the list on ChromeStatus.com. Chrome 86 is beta as of September 3, 2020.

CSS Pseudo-Class: focus-visible and the Quick Focus Highlight

For users who rely on a keyboard or similar assistive technology to navigate the web, the focus indicator is a crucial visual affordance. To improve both the user and developer experience of working with focus, Chrome 86 is introducing two features.

The first is a CSS selector, :focus-visible, which lets a developer opt-in to the same heuristic the browser uses when it's deciding whether to display a default focus indicator.

The second is a user setting called Quick Focus Highlight. When enabled, this setting causes an additional focus indicator to appear over the active element. Importantly, this indicator will be visible even if the page has disabled focus styles with CSS and it causes any :focus or :focus-visible styles to always be displayed. For details, see Giving users and developers more control over focus.

An animation of the quick focus highlight showing how it temporarily highlights a link in a line of text and then fades out to not obscure the text content.

WebHID API

Note: The origin trial for this feature was originally announced as starting in Chrome 85. That timeline changed.

There is a long tail of human interface devices (HIDs) that are too new, too old, or too uncommon to be accessible by systems' device drivers. The WebHID API solves this by providing a way to implement device-specific logic in JavaScript.

An HID is one that takes input from or provides output to humans. Examples of devices include keyboards, pointing devices (mice, touchscreens, etc.), and gamepads.

The inability to access uncommon or unusual HID devices is particularly painful when it comes to gamepad support. Gamepad inputs and outputs are not well standardized and web browsers often require custom logic for specific devices. This is unsustainable and results in poor support for the long tail of older and uncommon devices.

We're working on an article to show you how to use the new API. In the meantime, we've found some demos from a few eager engineers that you can use to try the new API. To see those demos, check out Human interface devices on the web: a few quick examples. The Origin Trials section has information on signing up and a list of other origin trials starting in this release. This origin trial is expected to run through Chrome 87 in January 2021.

Origin Trials

This version of Chrome introduces the origin trials described below. Origin trials allow you to try new features and give feedback on usability, practicality, and effectiveness to the web standards community. To register for any of the origin trials currently supported in Chrome, including the ones described below, visit the Origin Trials dashboard. To learn more about origin trials themselves, visit the Origin Trials Guide for Web Developers.

New Origin Trials

Cross-Screen Window Placement

Adds new screen information APIs and makes incremental improvements to existing window placement APIs, allowing web applications to offer compelling multi-screen experiences.

The existing window.screen property offers a limited view of available screen space, while window placement functions are generally restricted to the current screen. This feature unlocks modern multi-screen capabilities for web applications.

battery-savings Meta Tag

Adds a meta tag allowing a site to recommend measures for the user agent to apply in order to save battery life and optimize CPU usage. Websites that are known to have high CPU or battery costs may want to request that the UA optimize for CPU or battery, even if the user has not requested it. Most modern operating systems also have battery saving features that activate either when the battery is low or the user wishes to save battery. Ideally web sites should be able to respect these settings. Sites may wish to advise the user agent on which strategies work best for the site in these situations.

Secure Payment Confirmation

Secure payment confirmation augments the payment authentication experience on the web with the help of the Web Authentication API. The feature adds a new PaymentCredential credential type to the Credential Management API, which allows a relying party such as a bank to create a PublicKeyCredential that can be queried by any merchant origin as part of an online checkout via the Payment Request API using the proposed secure-payment-confirmation payment method.

This feature enables a consistent, low friction, strong authentication experience using platform authenticators. Strong authentication with the user's bank is becoming a requirement for online payments in many regions, including the European Union. The new feature provides a better user experience and stronger security than existing solutions.

Cross-Origin-Opener-Policy Reporting API

Adds a reporting API to help developers deploy cross-origin opener policy (COOP) on their websites. In addition to reporting breakages when COOP is enforced, it proves a report-only mode that reports potential breakages that would have happened had COOP been enforced. To register for the origin trial, follow the link above. For more information, see Making your website "cross-origin isolated" using COOP and COEP.

Completed Origin Trials

The following features, previously in a Chrome origin trial, are now enabled by default.

Native File System

The new Native File System API enables developers to build powerful web apps that interact with files on the user's local device such as IDEs, photo and video editors, text editors, and more. After a user grants access, this API allows web apps to read or save changes directly to files and folders on the user's device. It does all this by invoking the platform's own open and save dialog boxes. The image below shows a web page invoked using the open dialog box on Mac.

To learn more, see sample code, and a text editor demonstration app, see The Native File System API: Simplifying access to local files for details.

Note: The API surface is changed considerably from what was available in the origin trial. Differences are explained in detail in the spec repo. In the coming weeks, watch the web.dev article listed above for a full explanation of how to use the production version of the API.

Other Features in This Release

Altitude and Azimuth for PointerEvents v3

Adds Altitude and Azimuth angles to PointerEvents. Adds tiltX and tiltY to altitude and azimuth transformation and altitude and azimuth to tiltX and tiltY transformation depending on which pair is available from the device. These angles are those commonly measured by devices. Altitude and azimuth can be calculated using trigonometry from tiltX, tiltY. From a hardware perspective it is easier and less expensive to measure tiltX and tiltY.

From a stylus app perspective altitude and azimuth makes more sense and is more intuitive for users. Using tiltX and tiltY requires a developer to visualize the intersection angle between two imaginary planes, while azimuth and altitude are easier to visualize just by looking at the pen and the screen surface.

Adding azimuth and altitude makes the API more intuitive. Providing conversion between tiltX and tiltY and altitude and azimuth and vice versa allows for backwards compatibility with apps using tiltX and tiltY (even if newer devices might only return altitude and azimuth).

A Well-Known URL for Changing Passwords

Websites can set a well-known URL for changing passwords (for example, /.well-known/change-password). This URL's purpose is to redirect users to the change password page in order for them to modify their passwords quickly. Chrome leverages this URL to help users change their passwords when it detects a saved, compromised password. For more information, see Help users change passwords easily by adding a well-known URL for changing passwords.

Change Encoding of Space Character when URLs are Computed by Custom Protocol Handlers

The navigator.registerProtocolHandler() handler now replaces spaces with "%20" instead of "+". This makes Chrome consistent with other browsers such as Firefox.

CSS ::marker Pseudo-Element

Adds a pseudo-element for customizing numbers and bullets for <ul> and <ol> elements. This change lets developers control the color, size, bullet shape, and number type.

Document-Policy Header

Document Policy restricts the surface area of the web platform on a per-document basis, similar to iframe sandboxing, but more flexibly. It can do things like:

  • Restrict the use of poorly-performing images.
  • Disable slow synchronous JavaScript APIs.
  • Configure iframe, image, or script loading styles.
  • Restrict overall document sizes or network usage.
  • Restrict patterns which cause page re-layout.

Additionally, the header allows sites to opt out of fragment and text-fragment scrolling on load as a privacy mitigation for the scroll-to-text-fragment feature. This is the first part of the Document Policy API to ship.

EME persistent-usage-record Session

Adds a new MediaKeySessionType named "persistent-usage-record session", for which the license and keys are not persisted and for which a record of key usage is persisted when the keys available within the session are destroyed. This feature may help content providers understand how decryption keys are used for purposes like fraud detection.

FetchEvent.handled

A FetchEvent dispatched to a service worker is in a loading pipeline, which is performance sensitive. The new FetchEvent.handled property returns a promise that resolves when a response is returned from a service worker to its client. This enables a service worker to delay tasks that can only run after responses are complete.

HTMLMediaElement.preservesPitch

Adds a property to determine whether the pitch of an audio or video element should be preserved when adjusting the playback rate. This feature is wanted for creative purposes (for example, pitch-shifting in "DJ deck" style applications). It also prevents the introduction of artifacts from pitch-preserving algorithms at playback speeds very close to 1.00. It is already supported by Safari and Firefox.

Imperative Shadow DOM Distribution API

Web developers can now explicitly set the assigned nodes for a slot element. This solves two problems with Shadow DOM v1:

  • Web developers must specify a slot attribute for every one of a shadow host's children (except for elements for the default slot).
  • Component creators can't change the slotting behavior based on conditions.

For information on how the new API solves these issues, see the Imperative Shadow DOM Distribution API explainer.

Move window.location.fragmentDirective

The window.location.fragmentDirective property has been moved to document.fragmentDirective. This is a change to the text fragments feature.

New Display Values for the <fieldset> Element

The <fieldset> element now supports 'inline-grid', 'grid', 'inline-flex', and 'flex' keywords for the CSS 'display' property.

ParentNode.replaceChildren() Method

Adds a method to replace all children of the ParentNode with the passed-in nodes. Previously, there are a couple different ways to replace a node's children with a new set of nodes including:

  • Using node.innerHTML and node.append() to clear and replace all child nodes.
  • Using node.removeChild() and node.append() in a loop.

Safelist Distributed Web Schemes for registerProtocolHandler()

Chrome has extended the list of URL schemes that can be overridden via registerProtocolHandler() to include cabal, dat, did, dweb, ethereum, hyper, ipfs, ipns, and ssb. Extending the list to include decentralized web protocols allows resolution of links to generic entities independently of the website or gateway that's providing access to it. For more information, see Programmable Custom Protocol Handlers at are we distributed yet?

text/html Support for the Asynchronous Clipboard API

The Asynchronous Clipboard API currently does not support the text/html format. Chrome 86 adds support for copying and pasting HTML from the clipboard. The HTML is sanitized when it is read and written to the clipboard. The purpose of this change is to allow use cases such as:

  • Web editors, to copy and paste rich text with images and links.
  • Remote desktop applications, to synchronize text/html payloads across devices.

This is also intended to help the replacement of document.execCommand() for copy and paste functionality.

VP9 for macOS Big Sur

The VP9 video codec is now available on macOS Big Sur whenever it's supported in the underlying hardware. If developers use the Media Capabilities API to detect playback smoothness and power efficiency, the logic in their player should automatically start preferring VP9 at higher resolutions without any action on their part. To take full advantage of this feature, developers should encode their VP9 files in multiple resolutions to accommodate varying user bandwidths and connections.

WebRTC Insertable Streams

Enables the insertion of user-defined processing steps in the encoding and decoding of a WebRTC MediaStreamTrack. This allows applications to insert custom data processing. An important use case this supports is end-to-end encryption of the encoded data transferred between RTCPeerConnections via an intermediate server.

Deprecations, and Removals

This version of Chrome introduces the deprecations and removals listed below. Visit ChromeStatus.com for lists of current deprecations and previous removals.

Remove WebComponents v0 from WebView

Web Components v0 was removed from desktop and Android in Chrome 80. Chromium 86 removes them from WebView. This removal includes Custom Elements v0, Shadow DOM v0, and HTML Imports.

Deprecate FTP Support

Chrome is deprecating and removing support for FTP URLs. The current FTP implementation in Google Chrome has no support for encrypted connections (FTPS), or proxies. Usage of FTP in the browser is sufficiently low that it is no longer viable to invest in improving the existing FTP client. In addition, more capable FTP clients are available on all affected platforms.

Chrome 72 and later removed support for fetching document subresources over FTP and rendering of top level FTP resources. Currently navigating to FTP URLs results in showing a directory listing or a download depending on the type of resource. A bug in Google Chrome 74 and later resulted in dropping support for accessing FTP URLs over HTTP proxies. Proxy support for FTP was removed entirely in Google Chrome 76.

The remaining capabilities of Google Chrome's FTP implementation are restricted to either displaying a directory listing or downloading a resource over unencrypted connections.

Deprecation of support will follow this timeline:

Chrome 86

FTP is still enabled by default for most users, but turned off for pre-release channels (Canary and Beta) and will be experimentally turned off for one percent of stable users. In this version you can re-enable it from the command line using either the --enable-ftp command line flag or the --enable-features=FtpProtocol flag.

Chrome 87

FTP support will be disabled by default for fifty percent of users but can be enabled using the flags listed above.

Chrome 88

FTP support will be disabled.

Chrome Beta for Android Update

 Hi everyone! We've just released Chrome Beta 86 (86.0.4240.16) for Android: it's now available on Google Play.

You can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.

If you find a new issue, please let us know by filing a bug.

Ben Mason
Google Chrome

Beta Channel Update for Desktop

The Chrome team is excited to announce the promotion of Chrome 86 to the beta channel for Windows, Mac and Linux. Chrome 86.0.4240.22 contains our usual under-the-hood performance and stability tweaks, but there are also some cool new features to explore - please head to the Chromium blog to learn more!



A full list of changes in this build is available in the log. Interested in switching release channels?  Find out how here. If you find a new issue, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.



Google Chrome
Prudhvikumar Bommana

Giving users and developers more control over focus

Chrome 86 introduces two new features that improve both the user and developer experience when it comes to working with focus.


The :focus-visible pseudo-class is a CSS selector that lets developers opt-in to the same heuristic the browser uses when it's deciding whether to show a default focus indicator. This makes styling focus more predictable.


The Quick Focus Highlight is a user preference that causes the currently focused element to display an indicator for two seconds. The Quick Focus Highlight will always display, even if a page has disabled focus styles using CSS. It will also cause all CSS focus styles to match regardless of the input device that is interacting with the page.


An animation of the quick focus highlight showing how it temporarily highlights a link in a line of text and then fades out to not obscure the text content.

What is focus?

When a user interacts with an element the browser will often show an indicator to signal that the element has "focus". This is sometimes referred to as the "focus ring" because browsers typically put a solid or dashed ring around the focused element.


A button with a default blue focus ring


The focus ring signals to the user which element will receive keyboard events. If a user is tabbing through a form, the focus ring indicates which text field they can type into, or if they've focused a submit button they will know that pressing Enter or Spacebar will activate that button.

Problems with focus

For users who rely on a keyboard or other assistive technology to access the page, the focus ring acts as their mouse pointer - it's how they know what they are interacting with.


Unfortunately, many websites hide the focus ring using CSS. Oftentimes they do this because the underlying behavior of focus can be difficult to understand, and styling focus can have surprising consequences.


For example, a custom dropdown menu should use the tabindex attribute to make itself keyboard operable. But adding a tabindex to an element causes all browsers to show a focus ring on that element if it is clicked with a mouse. If a developer is surprised to see the focus ring when they click the menu, they might use the following CSS to hide it:


.custom-dropdown-menu:focus {
  outline: none;
}

This "fixes" their issue, insofar as they no longer see the focus ring when they click the menu. However, they have unknowingly broken the experience for users relying on a keyboard to access the page. As mentioned earlier, for users who rely on a keyboard to access the page, the focus ring acts as their mouse pointer. Therefore, CSS that removes the focus ring (without providing an alternative) is akin to hiding the mouse pointer.


To improve on this situation, developers need a better way to style focus - one that matches their expectations of how focus should work, and doesn't run the risk of breaking the experience for users. At the same time, users need to have the final say in the experience and should be able to choose when and how they see focus. This is where :focus-visible and the Quick Focus Highlight come in.

:focus-visible

Whenever you click on an element, browsers use an internal heuristic to determine whether they should display a default focus indicator. This is why in Chrome tabbing to a <button> shows a focus ring, but clicking it with a mouse does not. 


When you use :focus to style an element, it tells the browser to ignore its heuristic and to always show your focus style. For some situations this can break the user's expectation and lead to a confusing experience.


:focus-visible, on the other hand, will invoke the same heuristic that the browser uses when it's deciding whether to show the default focus indicator. This allows focus styles to feel more intuitive. In Chrome 86 and beyond, this should be all you need to style focus:


/* Focusing the button with a keyboard will show a dashed black line. */
button:focus-visible {
  outline: 4px dashed black;
}

By combining :focus-visible with :focus you can take things a step further and provide different focus styles depending on the user's input device. This can be helpful if you want the focus indicator to depend  on the precision of the input device:


/* Focusing the button with a keyboard will show a dashed black line. */
button:focus-visible {
  outline: 4px dashed black;
}
  
/* Focusing the button with a mouse, touch, or stylus will show a subtle drop shadow. */
button:focus:not(:focus-visible) {
  outline: none;
  box-shadow: 1px 1px 5px rgba(1, 1, 0, .7);
}

The snippet above says that if the browser would normally show a focus indicator, then it should do so using a 4px dashed black outline. Additionally, the example relies on the existing :focus behavior and says that if an element has focus, but the browser would not normally show a default focus ring, then it should show a drop shadow. 


Since the browser doesn't usually show a default focus ring when a user clicks on  a button, the :focus:not(:focus-visible) pattern can be an easy way to specifically target mouse/touch focus.


Note that not all browsers set focus in the same way, so the above snippet will work in Chromium based browsers, but may not work in others.

The :focus-visible heuristic

Understanding the browsers’ heuristics for focus indicators will help you understand when to use :focus-visible. Unfortunately, the heuristic has never been specified, so the behavior is subtly different in every browser. The :focus-visible specification suggests one possible heuristic based on the behavior browsers currently demonstrate. Here's a quick breakdown:


Has the user expressed a preference to always see a focus indicator?

If the user has indicated that they always want to see a focus indicator, then :focus-visible will always match on the focused element, just like :focus does.


Does the element require text input?

:focus-visible will always match when an element which requires text input (for example, <input type="text">) is focused.


A quick way to know if an element is likely to require text input is to ask yourself "If I were to tap on this element using a mobile device, would I expect to see a virtual keyboard?" If the answer is "yes" then the element will match :focus-visible.


What input device is being used?

If the user is using a keyboard to navigate the page, then :focus-visible will match on any interactive element (including any element with tabindex) which becomes focused. If they're using a mouse or touch screen, then it will only match if the focused element requires text input. 


Was focus moved programmatically?

If focus is moved programmatically by calling focus(), the newly focused element will only match :focus-visible if the previously focused element matched it as well. 


For example, if a user presses a physical key, and the event handler opens a menu and moves focus to the first menu item, :focus-visible will still match and the menu item will have a focus style.


Because mouse users may frequently use keyboard shortcuts, Chrome's implementation will bypass "keyboard mode" if a meta key (such as command, control, etc.) is pressed. For example, if a user who was previously using a mouse pressed a keyboard shortcut which shows a settings dialog, :focus-visible would not match on the focused element in the settings dialog.

Support and polyfill

Currently, :focus-visible is only supported in Chrome 86 and other Chromium-based browsers, though there's work underway to add support to Firefox. Refer to the MDN browser compatibility table to keep track of current support.


If you'd like to use :focus-visible today you can do so with the help of the :focus-visible polyfill. Once the polyfill is loaded, you can use the .focus-visible class instead of :focus-visible to achieve similar results:


/* Define mouse/touch focus indicators. */
.js-focus-visible :focus:not(.focus-visible) {
  …
}
  
/* Define keyboard focus indicators. */
.js-focus-visible .focus-visible {
  …
}

Note that the MDN support table shows Firefox supports a similar selector known as :-moz-focusring which :focus-visible is based on; however the behavior between the two selectors is quite different and it's recommended to use the :focus-visible polyfill if you need cross-browser support.


Quick Focus Highlight

:focus-visible makes it easier for developers to selectively style focus and avoids pitfalls with the existing :focus selector. While this is a great addition to the developer toolbox, for a subset of users, particularly those with cognitive impairments, it can be helpful to always see a focus indicator, and they may find it distressing when the focus indicator appears less often due to selective styling with :focus-visible.


For these users, Chrome 86 adds a setting called Quick Focus Highlight.


Quick Focus Highlight temporarily highlights the currently focused element, and causes :focus-visible to always match.


To enable Quick Focus Highlight:


  1. Go to Chrome's settings menu (or type chrome://settings into the address bar).

  2. Click Advanced then Accessibility.

  3. Enable the toggle switch to Show a quick highlight on the focused object.


Once Quick Focus Highlight is enabled, focused elements will show a white-blue outline with a blue glow. (See the image below.). The Highlight uses these alternating colors to ensure that it has proper contrast on any background.


The quick focus highlight on a white, black, and blue background. The rings are visible in all scenarios.


The Highlight is outset from the focused element to avoid interfering with that element's existing focus styles or drop shadows. The Highlight will fade out after two seconds to avoid obscuring page content, such as text.


FAQ

User-input can be multi-modal, for example some 2-in-1 laptops support mouse, keyboard, touch, and stylus. How does :focus-visible work with these devices?

Because :focus-visible uses the same heuristic as a default focus indicator, the experience should match what users expect on these platforms when they interact with unstyled HTML elements.


In other words, if developers use :focus-visible as their primary means to style focus, then the experience should be more consistent for all users regardless of their input device.

Does :focus-visible expose sensitive information?

Most of the time, :focus-visible matching only indicates that a user is using the keyboard, or has focused an element which takes text input.


:focus-visible could potentially be used to detect that a user has enabled a preference to always show a focus indicator, by tracking mouse and keyboard events and checking matches(":focus-visible") on elements which were focused when the keyboard is not being used. Since the precise details of when :focus-visible should match are left up to the browser's implementation, this would not be a completely reliable method.

What's the impact on users with low vision or cognitive impairments?

:focus-visible and the Quick Focus Highlight were designed to work together to help these users.


:focus-visible aims to address the common anti-pattern of developers removing the focus indicator from all of their controls. Using the browser's focus heuristic helps by creating fewer surprises for developers when the focus ring appears, meaning fewer reasons to use CSS to hide the ring.


For some users the browser's default behavior may still be insufficient. They may want to see a focus ring regardless of the type of control they're interacting with, or the input device they're using. That's where the Quick Focus Highlight can help.


The Quick Focus Highlight lets users increase the visibility of the focus indicator, and makes it so :focus-visible always matches, regardless of their input device. This combination of effects should make the currently focused element much easier to identify.


Why not have an "alway on" focus indicator?

The Quick Focus Highlight does not currently support an "always on" mode because it's difficult to design a universal focus overlay that does not obscure page content. As a result, the Highlight will fade out after two seconds, and rely on either the browser's default focus indicator, or the page author's :focus and :focus-visible styles.


Because the Highlight is a user preference its behavior can be changed in the future if users would prefer that it always stay on.

Should we also add :focus-visible-within?

There has been discussion around adding :focus-visible-within, but a proposal will require additional use cases. If you feel like you have a good use case for :focus-visible-within please add it to the discussion issue.


We welcome your feedback!

:focus-visible and the Quick Focus Highlight are the product of years of work and feedback from developers in the :focus-visible WICG repo and the standards bodies. We'd like to say thank you to everyone who helped shape these features.


Give :focus-visible and the new Highlight a shot, and tell us what you think.

If you've found an issue with the Quick Focus Highlight, attach a screenshot and send it to our support tracker.

If you've found an issue with :focus-visible, use this template to file a chromium bug.

KeyPose: Estimating the 3D Pose of Transparent Objects from Stereo

Estimating the position and orientation of 3D objects is one of the core problems in computer vision applications that involve object-level perception, such as augmented reality and robotic manipulation. In these applications, it is important to know the 3D position of objects in the world, either to directly affect them, or to place simulated objects correctly around them. While there has been much research on this topic using machine learning (ML) techniques, especially Deep Nets, most have relied on the use of depth sensing devices, such as the Kinect, which give direct measurements of the distance to an object. For objects that are shiny or transparent, direct depth sensing does not work well. For example, the figure below includes a number of objects (left), two of which are transparent stars. A depth device does not find good depth values for the stars, and gives a very poor reconstruction of the actual 3D points (right).

Left: RGB image of transparent objects.  Right: A four-panel image showing the reconstructed depth for the scene on the left.The top row includes depth images and the bottom row presents the 3D point cloud. The left panels were reconstructed using a depth camera and the right panels are output from the ClearGrasp model.  Note that although ClearGrasp inpaints the depth of the stars, it mistakes the actual depth of the rightmost one.

One solution to this problem, such as that proposed by ClearGrasp, is to use a deep neural network to inpaint the corrupted depth map of the transparent objects. Given a single RGB-D image of transparent objects, ClearGrasp uses deep convolutional networks to infer surface normals, masks of transparent surfaces, and occlusion boundaries, which it uses to refine the initial depth estimates for all transparent surfaces in the scene (far right in the figure above). This approach is very promising, and allows scenes with transparent objects to be processed by pose-estimation methods that rely on depth.  But inpainting can be tricky, especially when trained completely with synthetic images, and can still result in errors in depth.

In “KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects”, presented at CVPR 2020 in collaboration with the Stanford AI Lab, we describe an ML system that estimates the depth of transparent objects by directly predicting 3D keypoints. To train the system we gather a large real-world dataset of images of transparent objects in a semi-automated way, and efficiently label their pose using 3D keypoints selected by hand. We then train deep models (called KeyPose) to estimate the 3D keypoints end-to-end from monocular or stereo images, without explicitly computing depth. The models work on objects both seen and unseen during training, for both individual objects and categories of objects. While KeyPose can work with monocular images, the extra information available from stereo images allows it to improve its results by a factor of two over monocular image input, with typical errors from 5 mm to 10 mm, depending on the objects. It substantially improves over state-of-the-art in pose estimation for these objects, even when competing methods are provided with ground truth depth. We are releasing the dataset of keypoint-labeled transparent objects for use by the research community.

Real-World Transparent Object Dataset with 3D Keypoint Labels
To facilitate gathering large quantities of real-world images, we set up a robotic data-gathering system in which a robot arm moves through a trajectory while taking video with two devices, a stereo camera and the Kinect Azure depth camera.

Automated image sequence capture using a robot arm with a stereo camera and an Azure Kinect device.

The AprilTags on the target enable accurate tracing of the pose of the cameras. By hand-labelling only a few images in each video with 2D keypoints, we can extract 3D keypoints for all frames of the video using multi-view geometry, thus increasing the labelling efficiency by a factor of 100.

We captured imagery for 15 different transparent objects in five categories, using 10 different background textures and four different poses for each object, yielding a total of 600 video sequences comprising 48k stereo and depth images. We also captured the same images with an opaque version of the object, to provide accurate ground truth depth images. All the images are labelled with 3D keypoints. We are releasing this dataset of real-world images publicly, complementing the synthetic ClearGrasp dataset with which it shares similar objects.

KeyPose Algorithm Using Early Fusion Stereo
The idea of using stereo images directly for keypoint estimation was developed independently for this project; it has also appeared recently in the context of hand-tracking. The diagram below shows the basic idea: the two images from a stereo camera are cropped around the object and fed to the KeyPose network, which predicts a sparse set of 3D keypoints that represent the 3D pose of the object. The network is trained using supervision from the labelled 3D keypoints.

One of the key aspects of stereo KeyPose is the use of early fusion to intermix the stereo images, and allow the network to implicitly compute disparity, in contrast to late fusion, in which keypoints are predicted for each image separately, and then combined. As shown in the diagram below, the output of KeyPose is a 2D keypoint heatmap in the image plane along with a disparity (i.e., inverse depth) heatmap for each keypoint. The combination of these two heatmaps yields the 3D coordinate of the keypoint, for each keypoint.

Keypose system diagram. Stereo images are passed to a CNN model to produce a probability heatmap for each keypoint.  This heatmap yields 2D image coordinates U,V for the keypoint.  The CNN model also produces a disparity (inverse depth) heatmap for each keypoint, which when combined with the U,V coordinates, gives a 3D position (X,Y,Z).

When compared to late fusion or to monocular input, early fusion stereo typically is twice as accurate.

Results
The images below show qualitative results of KeyPose on individual objects. On the left is one of the original stereo images; in the middle are the predicted 3D keypoints projected onto the image. On the right, we visualize points from a 3D model of the bottle, placed at the pose determined by the predicted 3D keypoints. The network is efficient and accurate, predicting keypoints with an MAE of 5.2 mm for the bottle and 10.1 mm for the mug using just 5 ms on a standard GPU.

The following table shows results for KeyPose on category-level estimation. The test set used a background texture not seen by the training set. Note that the MAE varies from 5.8 mm to 9.9 mm, showing the accuracy of the method.

Quantitative comparison of KeyPose with the state-of-the-art DenseFusion system, on category-level data. We provide DenseFusion with two versions of depth, one from the transparent objects, and one from opaque objects. <2cm is the percent of estimates with errors less than 2 cm. MAE is the mean absolute error of the keypoints, in mm.

For a complete accounting of quantitative results, as well as, ablation studies, please see the paper and supplementary materials and the KeyPose website.

Conclusion
This work shows that it is possible to accurately estimate the 3D pose of transparent objects from RGB images without reliance on depth images. It validates the use of stereo images as input to an early fusion deep net, where the network is trained to extract sparse 3D keypoints directly from the stereo pair. We hope the availability of an extensive, labelled dataset of transparent objects will help to advance the field. Finally, while we used semi-automatic methods to efficiently label the dataset, we hope to employ self-supervision methods in future work to do away with manual labelling.

Acknowledgements
I want to thank my co-authors, Xingyu Liu of Stanford University, and Rico Jonschkowski and Anelia Angelova; as well the many who helped us through discussions during the project and paper writing, including Andy Zheng, Shuran Song, Vincent Vanhoucke, Pete Florence, and Jonathan Tompson.

The Technology Behind our Recent Improvements in Flood Forecasting

Flooding is the most common natural disaster on the planet, affecting the lives of hundreds of millions of people around the globe and causing around $10 billion in damages each year. Building on our work in previous years, earlier this week we announced some of our recent efforts to improve flood forecasting in India and Bangladesh, expanding coverage to more than 250 million people, and providing unprecedented lead time, accuracy and clarity.

To enable these breakthroughs, we have devised a new approach for inundation modeling, called a morphological inundation model, which combines physics-based modeling with machine learning (ML) to create more accurate and scalable inundation models in real-world settings. Additionally, our new alert-targeting model allows identifying areas at risk of flooding at unprecedented scale using end-to-end machine learning models and data that is publicly available globally. In this post, we also describe developments for the next generation of flood forecasting systems, called HydroNets (presented at ICLR AI for Earth Sciences and EGU this year), which is a new architecture specially built for hydrologic modeling across multiple basins, while still optimizing for accuracy at each location.

Forecasting Water Levels
The first step in a flood forecasting system is to identify whether a river is expected to flood. Hydrologic models (or gauge-to-gauge models) have long been used by governments and disaster management agencies to improve the accuracy and extend the lead time of their forecasts. These models receive inputs like precipitation or upstream gauge measurements of water level (i.e., the absolute elevation of the water above sea level) and output a forecast for the water level (or discharge) in the river at some time in the future.

The hydrologic model component of the flood forecasting system described in this week’s Keyword post doubled the lead time of flood alerts for areas covering more than 75 million people. These models not only increase lead time, but also provide unprecedented accuracy, achieving an R2 score of more than 99% across all basins we cover, and predicting the water level within a 15 cm error bound more than 90% of the time. Once a river is predicted to reach flood level, the next step in generating actionable warnings is to convert the river level forecast into a prediction for how the floodplain will be affected.

Morphological Inundation Modeling
In prior work, we developed high quality elevation maps based on satellite imagery, and ran physics-based models to simulate water flow across these digital terrains, which allowed warnings with unprecedented resolution and accuracy in data-scarce regions. In collaboration with our satellite partners, Airbus, Maxar and Planet, we have now expanded the elevation maps to cover hundreds of millions of square kilometers. However, in order to scale up the coverage to such a large area while still retaining high accuracy, we had to re-invent how we develop inundation models.

Inundation modeling estimates what areas will be flooded and how deep the water will be. This visualization conceptually shows how inundation could be simulated, how risk levels could be defined (represented by red and white colors), and how the model could be used to identify areas that should be warned (green dots).

Inundation modeling at scale suffers from three significant challenges. Due to the large areas involved and the resolution required for such models, they necessarily have high computational complexity. In addition, most global elevation maps don’t include riverbed bathymetry, which is important for accurate modeling. Finally, the errors in existing data, which may include gauge measurement errors, missing features in the elevation maps, and the like, need to be understood and corrected. Correcting such problems may require collecting additional high-quality data or fixing erroneous data manually, neither of which scale well.

Our new approach to inundation modeling, which we call a morphological model, addresses these issues by using several innovative tricks. Instead of modeling the complex behaviors of water flow in real time, we compute modifications to the morphology of the elevation map that allow one to simulate the inundation using simple physical principles, such as those describing hydrostatic systems.

First, we train a pure-ML model (devoid of physics-based information) to estimate the one-dimensional river profile from gauge measurements. The model takes as input the water level at a specific point on the river (the stream gauge) and outputs the river profile, which is the water level at all points in the river. We assume that if the gauge increases, the water level increases monotonically, i.e., the water level at other points in the river increases as well. We also assume that the absolute elevation of the river profile decreases downstream (i.e., the river flows downhill).

We then use this learned model and some heuristics to edit the elevation map to approximately “cancel out” the pressure gradient that would exist if that region were flooded. This new synthetic elevation map provides the foundation on which we model the flood behavior using a simple flood-fill algorithm. Finally, we match the resulting flooded map to the satellite-based flood extent with the original stream gauge measurement.

This approach abandons some of the realistic constraints of classical physics-based models, but in data scarce regions where existing methods currently struggle, its flexibility allows the model to automatically learn the correct bathymetry and fix various errors to which physics-based models are sensitive. This morphological model improves accuracy by 3%, which can significantly improve forecasts for large areas, while also allowing for much more rapid model development by reducing the need for manual modeling and correction.

Alert targeting
Many people reside in areas that are not covered by the morphological inundation models, yet access to accurate predictions are still urgently needed. To reach this population and to increase the impact of our flood forecasting models, we designed an end-to-end ML-based approach, using almost exclusively data that is globally publicly available, such as stream gauge measurements, public satellite imagery, and low resolution elevation maps. We train the model to use the data it is receiving to directly infer the inundation map in real time.

A direct ML approach from real-time measurements to inundation.

This approach works well “out of the box” when the model only needs to forecast an event that is within the range of events previously observed. Extrapolating to more extreme conditions is much more challenging. Nevertheless, proper use of existing elevation maps and real-time measurements can enable alerts that are more accurate than presently available for those in areas not covered by the more detailed morphological inundation models. Because this model is highly scalable, we were able to launch it across India after only a few months of work, and we hope to roll it out to many more countries soon.

Improving Water Levels Forecasting
In an effort to continue improving flood forecasting, we have developed HydroNets — a specialized deep neural network architecture built specifically for water levels forecasting — which allows us utilize some exciting recent advances in ML-based hydrology in a real-world operational setting. Two prominent features distinguish it from standard hydrologic models. First, it is able to differentiate between model components that generalize well between sites, such as the modeling of rainfall-runoff processes, and those that are specific to a given site, like the rating curve, which converts a predicted discharge volume into an expected water level. This enables the model to generalize well to different sites, while still fine-tuning its performance to each location. Second, HydroNets takes into account the structure of the river network being modeled, by training a large architecture that is actually a web of smaller neural networks, each representing a different location along the river. This allows neural networks that are modeling upstream sites to pass information encoded in embeddings to models of downstream sites, so that every model can know everything it needs without a drastic increase in parameters.

The animation below illustrates the structure and flow of information in HydroNets. The output from the modeling of upstream sub-basins is combined into a single representation of a given basin state. It is then processed by the shared model component, which is informed by all basins in the network, and passed on to the label prediction model, which calculates the water level (and the loss function). The output from this iteration of the network is then passed on to inform downstream models, and so on.

An illustration of the HydroNets architecture.

We’re incredibly excited about this progress, and are working hard on improving our systems further.

Acknowledgements
This work is a collaboration between the Google Flood Forecasting Initiative, the Google Geo and Crisis Response teams, Google.org and many other research teams at Google, and is part of our AI for Social Good efforts. We would also like to thank the Partnerships and Policy teams.