Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O by abhishek10004 · Pull Request #198 · jacobsa/fuse

abhishek10004 · 2026-06-22T12:58:47Z

Overview

Previously, this library had a hardcoded FUSE message buffer size of 1 MiB + pageSize (corresponding to the standard 1 MiB FUSE payload limit). Because of this hardcoded limit, large reads and writes (> 1 MiB) were not supported at all.

This PR adds support for large I/O operations by introducing a configurable MaxMessageSize in MountConfig, allowing daemons to read and write messages larger than 1 MiB.

To prevent the severe performance regressions, memory fragmentation, and garbage collection (GC) pressure that would arise from allocating giant contiguous buffers (e.g., 4 MiB, 8 MiB, or 16 MiB) on the heap for every request, this support is implemented via Vectored I/O:

Large Reads (> 1 MiB) are read from the FUSE device directly into non-contiguous, block-pooled buffers (1 MiB blocks) via the readv system call. The filesystem can then write the read payload directly into these blocks via ReadFileOp.DstBufs (Zero-Copy Vectored Reads).
Large Writes (> 1 MiB) bypass copying the incoming payload into a single contiguous slice, instead exposing the raw non-contiguous block slices directly to the filesystem via WriteFileOp.DataBlocks (Zero-Copy Vectored Writes).

Key Benefits

Support for Large I/O (> 1 MiB): Enables high-throughput FUSE operations by removing the hardcoded 1 MiB message size ceiling.
Zero-Copy & Low GC Pressure: Avoids massive heap allocations and contiguous memory copies for large transfers by leveraging a thread-safe, block-pooled allocator (BlockPool1M and BlockPool1MPlusPage) and the readv system call.
Backward Compatibility: Preserves contiguous buffers (Dst and Data) for reads/writes under 1 MiB or when vectored I/O is disabled, ensuring existing filesystem implementations continue to work out-of-the-box.

Commit-by-Commit Walkthrough

1. Enable Setting FUSE Buffer Sizes Dynamically

Commit: ddc386c

Purpose: Lays the foundation for configurable message sizes by allowing the buffer size to be set dynamically prior to mounting, rather than relying on a hardcoded global limit.
Key Changes:
- Added MaxMessageSize uint32 to mount_config.go.
- Dynamically calculates c.inMessageSize in connection.go based on MaxMessageSize (or defaults to the maximum of MaxReadSize/MaxWriteSize + 1 page).
- Configures the FUSE protocol initialization (Init call) to announce this dynamically calculated size as MaxWrite and MaxPages to the FUSE kernel driver.
- Refactored InMessage in internal/buffer/in_message.go to accept a dynamic allocation size instead of using a global static bufSize.

2. Refactor Error Handling using standard `errors.Is`

Commit: c263028

Purpose: Cleans up error-handling logic when reading FUSE requests to make it more robust.
Key Changes:
- Refactored Connection.readMessage() in connection.go to use standard errors.Is(err, syscall.ENODEV) and errors.Is(err, syscall.EINTR) checks instead of type-casting to *os.PathError and inspecting inner fields. This ensures compatibility in case errors are wrapped.

3. Implement Zero-Copy Vectored Reads Support

Commit: 72acb69

Purpose: Implements the infrastructure for reading FUSE requests from the device directly into non-contiguous blocks, and exposing these blocks to the filesystem via ReadFileOp.DstBufs to support large reads (> 1 MiB) without heap thrashing.
Key Changes:
- Block-Pool Allocator:
  - Defined thread-safe pools BlockPool1M (1 MiB blocks) and BlockPool1MPlusPage (1 MiB + hardware page size) in internal/buffer/in_message.go to recycle buffers and avoid heap thrashing.
  - Refactored InMessage to allocate non-contiguous blocks (blocks [][]byte) rather than a single contiguous slice. Block 0 is always sized 1 MiB + pageSize (holding headers and small payloads), and additional 1 MiB blocks are allocated to satisfy larger message limits.
- Zero-Copy Syscall (readv):
  - Added internal/buffer/readv.go which implements a wrapper around the SYS_READV syscall, converting block slices to unix.Iovec pointers to perform a single-system-call read into multiple non-contiguous memory segments.
  - Retained backward compatibility on macOS (FuseT) by implementing a contiguous fallback pool (fuseTContiguousPool).
- Vectored Read API:
  - Added EnableVectoredReads to MountConfig.
  - Added DstBufs [][]byte to fuseops/ops.go. When enabled and the read size is larger than block 0, Dst is set to nil and DstBufs is populated with the block-sliced buffers, allowing the filesystem to write read payloads directly into the FUSE message blocks.
- Testing:
  - Added over 760 lines of comprehensive unit tests in internal/buffer/in_message_test.go covering block allocations, boundary-spanning data consumption, vector slicing, and pool returns.

4. Implement Zero-Copy Vectored Writes Support & MemFS Optimization

Commit: 8005ed3

Purpose: Adds zero-copy support for incoming large FUSE write requests, bypassing contiguous payload reconstruction, and optimizes the memfs memory filesystem.
Key Changes:
- Vectored Writes API:
  - Added EnableVectoredWrites to MountConfig.
  - Added DataBlocks [][]byte and a TotalSize() int helper method to fuseops/ops.go.
  - When enabled, convertInMessage slices the write payload directly into DataBlocks (using ConsumeVector), completely avoiding copying the payload into a single contiguous slice.
- MemFS Optimization:
  - Added WriteBlocksAt(blocks [][]byte, off int64) to samples/memfs/inode.go to copy block-by-block directly into the inode's storage slice.
  - Updated WriteFile in samples/memfs/memfs.go to leverage WriteBlocksAt when DataBlocks is populated.
- Wirelog, Debug, and Test Cleanups:
  - Simplified write size calculations in debug.go and wirelog.go to use the new WriteFileOp.TotalSize() helper method.
  - Added integration tests verifying VectoredWritesTest in samples/memfs/memfs_test.go.
  - Fixed out-of-cache benchmarks in internal/buffer/out_message_test.go by allocating larger arrays (80 MiB) on the heap to successfully defeat the CPU cache.

Architectural Design: Why Vectored I/O?

To support message sizes larger than 1 MiB, allocating contiguous buffers dynamically (e.g., a single 8 MiB buffer for an 8 MiB read/write) is highly inefficient due to severe GC pressure and heap fragmentation.

Instead, the library now implements a non-contiguous, block-based architecture:

           +-----------------------+      +-------------------+
InMessage  | Block 0 (1MB + page)  | ---> | Block 1 (1MB)     | ---> ...
           +-----------------------+      +-------------------+
           | Headers | Small data  |      | Large data blocks |
           +-----------------------+      +-------------------+

1. Zero-Copy Reads

When a FUSE read request is received:

The daemon uses the readv syscall on Linux to read data directly from the /dev/fuse descriptor into the pooled blocks, avoiding any kernel-to-user memory copy.
If the read size is larger than Block 0, the library populates ReadFileOp.DstBufs with these blocks.
The filesystem writes the data directly into DstBufs, requiring zero extra allocations or copy operations.

2. Zero-Copy Writes

When a FUSE write request is received:

The write payload is read into the pooled blocks.
Instead of allocating a single contiguous buffer and copying all blocks into it, the library returns the non-contiguous slices directly in WriteFileOp.DataBlocks.
Filesystems optimized for vectored writes (such as the updated memfs using WriteBlocksAt) can consume these blocks directly.

Configuration Reference

Three new fields are introduced in MountConfig to manage the large message and vectored I/O behavior:

Field	Type	Description
`MaxMessageSize`	`uint32`	Configures the maximum size of FUSE messages the daemon is prepared to read/write. Setting this larger than 1 MiB enables large reads and writes.
`EnableVectoredReads`	`bool`	If true, large read operations bypass contiguous buffer allocations in `ReadFileOp.Dst` and instead populate `ReadFileOp.DstBufs`.
`EnableVectoredWrites`	`bool`	If true, large write operations bypass copying payload blocks into a single contiguous slice in `WriteFileOp.Data` and instead populate `WriteFileOp.DataBlocks`.

Important

For MaxMessageSize values greater than 1 MiB, enabling both EnableVectoredReads and EnableVectoredWrites is highly recommended to avoid significant performance regressions due to large heap allocations and contiguous copies.

Testing & Verification

Unit Tests: internal/buffer/in_message_test.go verifies all edge cases of multi-block message parsing (consuming across block boundaries, shrinking, and block recycling).
Integration Tests: samples/memfs/memfs_test.go contains test cases ensuring that MemFS correctly handles vectored write operations when EnableVectoredWrites is enabled.
Benchmarks: Benchmarks in internal/buffer/out_message_test.go have been updated and verified to measure reset, growth, and shrink performance accurately.
All tests pass successfully.

This introduces support for vectored reads (reading FUSE requests from the device via readv into non-contiguous block buffers, and passing them up to the filesystem via DstBufs). Includes multi-block allocation infrastructure in InMessage, platform-specific support for FuseT (contiguous fallback), and configuration.

This introduces support for vectored writes, bypassing copying write payload bytes into a single contiguous slice in WriteFileOp.Data, instead providing the raw non-contiguous blocks in WriteFileOp.DataBlocks. Includes the optimization and implementation in MemFS, along with wirelog, debug, and test updates.

abhishek10004 · 2026-06-22T13:59:40Z

+}
+
+var BlockPool1M = newBlockPool(48, func() []byte {
+	return make([]byte, MiB)


I've not used mmap here because currently I'm overflowing to a syncPool and hence if there is a lot of parallelism, we would not be paying the allocation penalty again and again.

In case we take over the memory allocation using mmap and bypass the go runtime, then there would be 2 options:
a) fixed size buffer pool but that would mean constant allocation/deallocation in case parallelism is higher than the configured limits
b) dynamic pool that keeps growing/shrinking but this would be a slightly larger change & would need more testing.
Hence, I've parked it for later, either as a separate commit or a new change.

vadlakondaswetha

yet to review the testcases and samples

vadlakondaswetha · 2026-06-24T05:02:03Z

 	initOp.MaxReadahead = maxReadahead
-	initOp.MaxWrite = buffer.MaxWriteSize
+
+	maxPayload := c.inMessageSize - buffer.GetPageSize()


do we anticipate different sizes for read and write. if not can we have just one variable which tells size of the request for both reads and writes.

vadlakondaswetha · 2026-06-24T07:44:08Z

+}
+
+// NewInMessage creates a new InMessage.
+func NewInMessage(size int) *InMessage {


why are you taking a size parameter if its not used.

vadlakondaswetha · 2026-06-24T07:54:16Z

-				err = nil
-				continue
-			}
+		if errors.Is(err, syscall.ENODEV) {


we are removing a typecasting here? Are these changes intentiontal? If yes, how did they work earlier?

The old code was the idiomatic way to handle nested error before Errors.Is was released.
When reading from a device or a file, the error returned is typically wrapped in PathError. Hence, we were unwrapping it earlier and then comparing it. With Errors.Is, this is no longer needed.

vadlakondaswetha · 2026-06-24T07:56:09Z

 	var err error
 	if fusekernel.IsPlatformFuseT {
-		n, err = m.ReadSingle(r)
+		if len(m.blocks) == 1 {


as discussed please remove all changes for MAC and throw not supported exception when messageSize is bigger. Lets not checkin changes which are not reviewed.

vadlakondaswetha · 2026-06-24T08:12:44Z

 	return pageSize
 }

+type blockPool struct {


Please create a seperate file for blockPool

vadlakondaswetha · 2026-06-24T08:25:46Z

+		block := BlockPool1M.Get()
+		m.borrowedBlocks = append(m.borrowedBlocks, block)
+		allocSize := MiB
+		if remaining < allocSize {


what happens if you remove this check?

Then the final block won't be truncated to the actual remaining requested size and we'd be passing a buffer which is larger than the requested data.

vadlakondaswetha · 2026-06-24T08:26:36Z

+	// Since n doesn't fit in block 0, and block 0 has size 1MB + pageSize,
+	// n is necessarily larger than 1MB (assuming typical small offset like
+	// sizeof(ReadIn)). Thus we always allocate directly on the heap.
+	return make([]byte, n)


in the existing code, if we dont required buffer, we are returning nil vs here we are creating a new buffer.
Also i didnt understand in what scenarios would it cross 1MB?

I'm doing a new allocation here since the read request can go beyond 1MB reads and if the user has not enabled vectored reads, than we'd have to pass a single buffer of requested size to the user.

Regarding earlier case, in that scenario we had a 1MiB+page size buffer and the header is less than page size, so the original buffer would always have 1MiB space & hence the condition "n > len(m.storage)-m.size" would never be true.

vadlakondaswetha · 2026-06-24T08:30:10Z

 		}
 		// Use part of the incoming message storage as the read buffer.
-		to.Dst = inMsg.GetFree(int(in.Size))
+		if config.EnableVectoredReads && int(in.Size) > buffer.MiBPlusPageSize {


Brainstorming a bit here. why do we need to support both vectoredReads and non vectoredReads. Can we just pass 2-D array always. It would be a minor change on the GCSFuse side. How big of a change will it be on GCSFuse side? I am guessing we can just pick the first block from the array and pass it downstream when messageSize is 1MB?

I kept support for non-vectored I/O as well so that it doesn't become a breaking change. Otherwise, anyone updating the library would be forced to make changes to their code, even though those changes would mostly be minor.

vadlakondaswetha · 2026-06-24T08:45:23Z

+		var buf []byte
+		var dataBlocks [][]byte
+
+		if config.EnableVectoredWrites && inMsg.Len() > uintptr(buffer.MiBPlusPageSize) {


by moving everything to vectoredReads/writes we need not do if-else every where. the code becomes much simpler.

reduces the number of configs too.

yes, it would become easier & cleaner as well but as mentioned above it'd be a breaking change.

vadlakondaswetha · 2026-06-24T08:57:50Z

+	// In production, any spanning allocation is larger than 1MB (since block 0
+	// is 1MB + pageSize and fits all normal headers/payloads). Thus we always
+	// allocate directly from the heap.
+	res := make([]byte, n)


Same as reads? why would we overflow here and not earlier?

This new allocation will be used when the user has not enabled vectored writes and increased the fuse max pages limit beyond 1M. In that case, we'd have to allocate a single buffer which can contain the whole content to be written.

geertj · 2026-06-25T22:44:02Z

Hey @abhishek10004. Thanks for putting this together!

I know that I advocated for the approach of using readv() to allow large messages (e.g. 16MB reads which is what we need on GCS Standard), while at the same time not having to make all messages 16MB. However, after reviewing this PR in some detail, I am not very comfortable with the complexity of what we're adding and the size of this change.

I wonder if instead we need to do something simpler: add a MaxPages config option to MountConfig. If set, that's what we pass in Connection::Init. If unset, we keep the current value based on MaxRead/WriteSize. Note that we'd keep MaxWrite as it is right now.

I think the result would be:

A read() from the socket never returns more than 1MB + 4KB i.e. our current buffer size for inMessage. This is because the only operation that has an unbounded payload is FUSE_WRITE, and that would be clamped by what we set for MaxWrite in FUSE_INIT.
The higher value of MaxPages would allow FUSE to send us large FUSE_READ operations, which is what we're after. The request handler needs to use ReadFileOp::Data instead of ReadFileOp::Dst. I think that is a fair API contract for when you set a MaxPages value > 1MB. It is also self documenting because a developer can do a len(op.Dst) and notice that it does not contain sufficient space for the requested bytes.

What do you think? Would this work?

abhishek10004 · 2026-06-26T17:16:37Z

Hey @abhishek10004. Thanks for putting this together!

I know that I advocated for the approach of using readv() to allow large messages (e.g. 16MB reads which is what we need on GCS Standard), while at the same time not having to make all messages 16MB. However, after reviewing this PR in some detail, I am not very comfortable with the complexity of what we're adding and the size of this change.

I wonder if instead we need to do something simpler: add a MaxPages config option to MountConfig. If set, that's what we pass in Connection::Init. If unset, we keep the current value based on MaxRead/WriteSize. Note that we'd keep MaxWrite as it is right now.

I think the result would be:

A read() from the socket never returns more than 1MB + 4KB i.e. our current buffer size for inMessage. This is because the only operation that has an unbounded payload is FUSE_WRITE, and that would be clamped by what we set for MaxWrite in FUSE_INIT.

The higher value of MaxPages would allow FUSE to send us large FUSE_READ operations, which is what we're after. The request handler needs to use ReadFileOp::Data instead of ReadFileOp::Dst. I think that is a fair API contract for when you set a MaxPages value > 1MB. It is also self documenting because a developer can do a len(op.Dst) and notice that it does not contain sufficient space for the requested bytes.

What do you think? Would this work?

We can set MaxPages & MaxWrite to different values. In this case, while the reads would be limited by MaxPages value (which means we can get read requests larger than 1MB), the write requests would be limited by the minimum of what MaxPages & MaxWrite evaluates to. Hence, if we keep setting MaxWrite to 1M, we'd not need to use readv and vectored io at all. This would significantly reduce the amount of change.

Only issue would be that in case of writes, we'd not get larger requests but that would be okay because we still buffer the writes locally (in streaming writes, we buffer in the in-memory block till it's full whereas in staging writes we buffer full writes on disk). Hence, smaller requests would not make that much of difference as the major bottleneck in writes would actually be the upload time.

geertj · 2026-06-26T18:30:21Z

Only issue would be that in case of writes, we'd not get larger requests but that would be okay because we still buffer the writes locally (in streaming writes, we buffer in the in-memory block till it's full whereas in staging writes we buffer full writes on disk). Hence, smaller requests would not make that much of difference as the major bottleneck in writes would actually be the upload time.

Exactly. Writes will always have to be buffered /somewhere/ (in an actual buffer, or in a network stream) due to the semantics of object storage that don't allow you to just do random overwrites. So I think that we don't really need large requests for writes. Or at least, we don't need them in the near future.

abhishek10004 added 4 commits June 22, 2026 09:33

First set of changes to enable setting buffer size before mount

ddc386c

Refactor: use errors.Is for ENODEV/EINTR error handling

c263028

abhishek10004 force-pushed the abhishek/vectored_io branch from 1891bb9 to 8005ed3 Compare June 22, 2026 13:15

abhishek10004 commented Jun 22, 2026

View reviewed changes

vadlakondaswetha reviewed Jun 24, 2026

View reviewed changes

Conversation

abhishek10004 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Benefits

Commit-by-Commit Walkthrough

1. Enable Setting FUSE Buffer Sizes Dynamically

2. Refactor Error Handling using standard errors.Is

3. Implement Zero-Copy Vectored Reads Support

4. Implement Zero-Copy Vectored Writes Support & MemFS Optimization

Architectural Design: Why Vectored I/O?

1. Zero-Copy Reads

2. Zero-Copy Writes

Configuration Reference

Testing & Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadlakondaswetha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishek10004 Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geertj commented Jun 25, 2026

Uh oh!

abhishek10004 commented Jun 26, 2026

Uh oh!

geertj commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhishek10004 commented Jun 22, 2026 •

edited

Loading

2. Refactor Error Handling using standard `errors.Is`

abhishek10004 Jun 25, 2026 •

edited

Loading