Commit Graph

344 Commits

Author SHA1 Message Date
Ridwan Sharif 3e0e3b9b11 Added stub FUSE filesystem
Allow FUSE filesystems to be mounted using libfuse.
The appropriate flags and mount options are parsed and
understood by fusefs.
2020-07-23 17:13:24 -04:00
Bhasker Hariharan 71bf90c55b Support for receiving outbound packets in AF_PACKET.
Updates #173

PiperOrigin-RevId: 322665518
2020-07-22 15:33:33 -07:00
Ting-Yu Wang db653bb34b fdbased: Vectorized write for packet; relax writev syscall filter.
Now it calls pkt.Data.ToView() when writing the packet. This may require
copying when the packet is large, which puts the worse case in an even worse
situation.

This sent out in a separate preparation change as it requires syscall filter
changes. This change will be followed by the change for the adoption of the new
PacketHeader API.

PiperOrigin-RevId: 321447003
2020-07-15 15:05:32 -07:00
gVisor bot 8939fae0af Merge pull request #3165 from ridwanmsharif:ridwanmsharif/fuse-off-by-default
PiperOrigin-RevId: 321411758
2020-07-15 12:14:42 -07:00
Fabricio Voznika 1bfb556ccd Prepare boot.Loader to support multi-container TTY
- Combine process creation code that is shared between
  root and subcontainer processes
- Move root container information into a struct for
  clarity

Updates #2714

PiperOrigin-RevId: 321204798
2020-07-14 12:02:03 -07:00
gVisor bot c81ac8ec3b Merge pull request #2672 from amscanne:shim-integrated
PiperOrigin-RevId: 321053634
2020-07-13 16:10:58 -07:00
Ridwan Sharif abffebde7b Gate FUSE behind a runsc flag
This change gates all FUSE commands (by gating /dev/fuse) behind a runsc
flag. In order to use FUSE commands, use the --fuse flag with the --vfs2
flag. Check if FUSE is enabled by running dmesg in the sandbox.
2020-07-09 02:01:29 -04:00
Fabricio Voznika c4815af947 Add shared mount hints to VFS2
Container restart test is disabled for VFS2 for now.

Updates #1487

PiperOrigin-RevId: 320296401
2020-07-08 17:12:29 -07:00
Ayush Ranjan efa2615eb0 [vfs2] Remove VFS1 usage in VDSO.
Removed VDSO dependency on VFS1.

Resolves #2921

PiperOrigin-RevId: 320122176
2020-07-07 21:37:08 -07:00
Ridwan Sharif 2828806fb0 Test that the fuse device can be opened 2020-06-25 15:46:30 -04:00
Ridwan Sharif a63db7d903 Moved FUSE device under the fuse directory 2020-06-25 14:22:21 -04:00
Nicolas Lacasse 58880bf551 Port /dev/net/tun device to VFS2.
Updates #2912 #1035

PiperOrigin-RevId: 318162565
2020-06-24 16:23:44 -07:00
Bhasker Hariharan b070e218c6 Add support for Stack level options.
Linux controls socket send/receive buffers using a few sysctl variables
  - net.core.rmem_default
  - net.core.rmem_max
  - net.core.wmem_max
  - net.core.wmem_default
  - net.ipv4.tcp_rmem
  - net.ipv4.tcp_wmem

The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).

The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.

Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.

This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.

Updates #3043
Fixes #3043

PiperOrigin-RevId: 318089805
2020-06-24 10:24:20 -07:00
Nicolas Lacasse 0f328beb0d Port /dev/tty device to VFS2.
Support is limited to the functionality that exists in VFS1.

Updates #2923 #1035

PiperOrigin-RevId: 317981417
2020-06-23 18:48:37 -07:00
Kevin Krakauer 28b8a5cc3a iptables: remove metadata struct
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.

PiperOrigin-RevId: 317210653
2020-06-18 17:02:16 -07:00
Bhasker Hariharan 07ff909e76 Support setsockopt SO_SNDBUF/SO_RCVBUF for raw/udp sockets.
Updates #173,#6
Fixes #2888

PiperOrigin-RevId: 317087652
2020-06-18 06:07:20 -07:00
gVisor bot dbf786c6b3 Add runsc options to set checksum offloading status
--tx-checksum-offload=<true|false>
  enable TX checksum offload (default: false)
--rx-checksum-offload=<true|false>
  enable RX checksum offload (default: true)

Fixes #2989

PiperOrigin-RevId: 316781309
2020-06-16 16:34:26 -07:00
Ian Lewis 8ea99d58ff Set the HOME environment variable for sub-containers.
Fixes #701

PiperOrigin-RevId: 316025635
2020-06-11 19:31:24 -07:00
Jamie Liu 77c206e371 Add //pkg/sentry/fsimpl/overlay.
Major differences from existing overlay filesystems:

- Linux allows lower layers in an overlay to require revalidation, but not the
  upper layer. VFS1 allows the upper layer in an overlay to require
  revalidation, but not the lower layer. VFS2 does not allow any layers to
  require revalidation. (Now that vfs.MkdirOptions.ForSyntheticMountpoint
  exists, no uses of overlay in VFS1 are believed to require upper layer
  revalidation; in particular, the requirement that the upper layer support the
  creation of "trusted." extended attributes for whiteouts effectively required
  the upper filesystem to be tmpfs in most cases.)

- Like VFS1, but unlike Linux, VFS2 overlay does not attempt to make mutations
  of the upper layer atomic using a working directory and features like
  RENAME_WHITEOUT. (This may change in the future, since not having a working
  directory makes error recovery for some operations, e.g. rmdir, particularly
  painful.)

- Like Linux, but unlike VFS1, VFS2 represents whiteouts using character
  devices with rdev == 0; the equivalent of the whiteout attribute on
  directories is xattr trusted.overlay.opaque = "y"; and there is no equivalent
  to the whiteout attribute on non-directories since non-directories are never
  merged with lower layers.

- Device and inode numbers work as follows:

    - In Linux, modulo the xino feature and a special case for when all layers
      are the same filesystem:

        - Directories use the overlay filesystem's device number and an
          ephemeral inode number assigned by the overlay.

        - Non-directories that have been copied up use the device and inode
          number assigned by the upper filesystem.

        - Non-directories that have not been copied up use a per-(overlay,
          layer)-pair device number and the inode number assigned by the lower
          filesystem.

    - In VFS1, device and inode numbers always come from the lower layer unless
      "whited out"; this has the adverse effect of requiring interaction with
      the lower filesystem even for non-directory files that exist on the upper
      layer.

    - In VFS2, device and inode numbers are assigned as in Linux, except that
      xino and the samefs special case are not supported.

- Like Linux, but unlike VFS1, VFS2 does not attempt to maintain memory mapping
  coherence across copy-up. (This may have to change in the future, as users
  may be dependent on this property.)

- Like Linux, but unlike VFS1, VFS2 uses the overlayfs mounter's credentials
  when interacting with the overlay's layers, rather than the caller's.

- Like Linux, but unlike VFS1, VFS2 permits multiple lower layers in an
  overlay.

- Like Linux, but unlike VFS1, VFS2's overlay filesystem is
  application-mountable.

Updates #1199

PiperOrigin-RevId: 316019067
2020-06-11 18:34:53 -07:00
Fabricio Voznika 4e96b94915 Combine executable lookup code
Run vs. exec, VFS1 vs. VFS2 were executable lookup were
slightly different from each other. Combine them all
into the same logic.

PiperOrigin-RevId: 315426443
2020-06-08 23:08:23 -07:00
Rahat Mahmood 21b6bc7280 Implement mount(2) and umount2(2) for VFS2.
This is mostly syscall plumbing, VFS2 already implements the internals of
mounts. In addition to the syscall defintions, the following mount-related
mechanisms are updated:

- Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2
  filesystems don't implement node-level timestamps yet.

- Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs.

- Plumb mount namespace ownership, which is necessary for checking appropriate
  capabilities during mount(2).

Updates #1035

PiperOrigin-RevId: 315035352
2020-06-05 19:12:03 -07:00
Nicolas Lacasse e4e11f2798 Expand syscall filters to support MSAN.
PiperOrigin-RevId: 314997564
2020-06-05 14:33:50 -07:00
Ting-Yu Wang 41da7a568b Fix copylocks error about copying IPTables.
IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks
analysis. Tested by manually enabling nogo tests.

sync.RWMutex is added to IPTables for the additional race condition discovered.

PiperOrigin-RevId: 314817019
2020-06-05 11:29:09 -07:00
Fabricio Voznika ca5912d13c More runsc changes for VFS2
- Add /tmp handling
- Apply mount options
- Enable more container_test tests
- Forward signals to child process when test respaws process
  to run as root inside namespace.

Updates #1487

PiperOrigin-RevId: 314263281
2020-06-01 21:32:09 -07:00
Jamie Liu 3a987160aa Handle gofer blocking opens of host named pipes in VFS2.
Using tee instead of read to detect when a O_RDONLY|O_NONBLOCK pipe FD has a
writer circumvents the problem of what to do with the byte read from the pipe,
avoiding much of the complexity of the fdpipe package.

PiperOrigin-RevId: 314216146
2020-06-01 15:33:30 -07:00